mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 23:31:13 +00:00
6044a33197
Bundled hygiene commit before cycle-3 /implement (AZ-776, AZ-777). Mixes two concerns by user choice (autodev option B): - Cycle-3 autodev artifacts not yet committed by Step 9 (new-task): task specs for AZ-776 / AZ-777 under _docs/02_tasks/todo/ and the updated _docs/02_tasks/_dependencies_table.md. - Accumulated skill / rule tooling maintenance under .cursor/ (skills: autodev, code-review, decompose, deploy, implement, new-task, plan, refactor, retrospective, test-spec; rules: coderule, cursor-meta, meta-rule, testing; new release skill scaffolding). - Autodev bootstrap state: _docs/_autodev_state.md (step 10 in_progress) and _docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md (replay timestamp refreshed; gtsam 4.2 still numpy<2-only). Co-authored-by: Cursor <cursoragent@cursor.com>
291 lines
22 KiB
Markdown
291 lines
22 KiB
Markdown
---
|
||
name: release
|
||
description: |
|
||
Executes the deployment plan produced by /deploy against a target environment.
|
||
Closes the loop between "we have a plan" and "the new version is running in production with a verdict on disk."
|
||
6-phase workflow: pre-release gate, strategy select, execute, smoke test, watch window, commit-or-rollback.
|
||
Outputs _docs/04_release/release_<version>.md with a definitive Released / Rolled-Back / Aborted verdict.
|
||
Trigger phrases:
|
||
- "release", "ship", "go live", "release this version"
|
||
- "deploy to prod", "promote to staging", "roll out"
|
||
- "rollback", "abort the release"
|
||
category: ship
|
||
tags: [release, deployment, rollback, smoke-test, observability, production]
|
||
disable-model-invocation: true
|
||
---
|
||
|
||
# Release Execution
|
||
|
||
The `/deploy` skill produces a plan and scripts. The `/release` skill **runs** them, verifies the live system, watches it for a defined window, and produces a definitive verdict on disk.
|
||
|
||
## Core Principles
|
||
|
||
- **Real execution, not simulation**: every phase must actually run against the target environment. If a phase cannot be executed (missing scripts, no SSH access, disabled secrets, registry auth failure), STOP — do not pretend a step succeeded. See `meta-rule.mdc` § "Real Results, Not Simulated Ones".
|
||
- **Verifiable rollback path**: the release does not start until rollback is proven viable for this version. "We can roll back" without evidence is not a rollback path.
|
||
- **Quiet failure is a release failure**: a deploy script that exits 0 but emits no observable signal in the watch window is treated as a regression, not a success.
|
||
- **One release per invocation**: a single `/release` execution targets exactly one version against exactly one environment. Multi-stage promotion (staging → prod) is two invocations, not one.
|
||
- **Never skip the watch window**: even successful deploys can degrade after 5–60 minutes (cache warm-up, scheduled jobs, downstream backpressure). The watch window is mandatory.
|
||
- **Autonomous rollback on hard regressions**: critical health-check failure, error-rate spike above threshold, or smoke-test failure → automatic rollback. Soft regressions (latency drift, capacity warnings) escalate to the user.
|
||
|
||
## Context Resolution
|
||
|
||
Fixed paths:
|
||
|
||
- DEPLOY_DIR: `_docs/04_deploy/`
|
||
- RELEASE_DIR: `_docs/04_release/`
|
||
- SCRIPTS_DIR: `scripts/`
|
||
- DEPLOY_SCRIPT: `scripts/deploy.sh`
|
||
- HEALTH_SCRIPT: `scripts/health-check.sh`
|
||
- ENV_TEMPLATE: `.env.example`
|
||
- OBSERVABILITY_DOC: `_docs/04_deploy/observability.md`
|
||
- ENVIRONMENT_DOC: `_docs/04_deploy/environment_strategy.md`
|
||
- PROCEDURES_DOC: `_docs/04_deploy/deployment_procedures.md`
|
||
- ARCHITECTURE: `_docs/02_document/architecture.md`
|
||
- RESTRICTIONS: `_docs/00_problem/restrictions.md`
|
||
|
||
Announce the resolved paths and the **target environment + version + strategy** to the user before any phase that touches the live system.
|
||
|
||
## Inputs (BLOCKING prerequisites)
|
||
|
||
| Input | Required | Source |
|
||
|-------|----------|--------|
|
||
| Target environment | Yes — ASK user | `environment_strategy.md` enumerates valid options |
|
||
| Target version / image tag | Yes — ASK user | Must exist in the registry; verified in Phase 1 |
|
||
| Rollback target version | Yes — ASK user | Defaults to currently-deployed version if discoverable |
|
||
| `scripts/deploy.sh` | Yes | Produced by `/deploy` Step 7. STOP if missing → run `/deploy` first |
|
||
| `scripts/health-check.sh` | Yes | Same |
|
||
| `_docs/04_deploy/deployment_procedures.md` | Yes | Defines per-environment runbook, manual approval rules, change-window restrictions |
|
||
| `_docs/04_deploy/observability.md` | Yes | Defines watch metrics, thresholds, and dashboards |
|
||
| `_docs/04_deploy/environment_strategy.md` | Yes | Defines target hostnames, registries, secrets, deploy strategy per env |
|
||
|
||
## Outputs
|
||
|
||
```
|
||
RELEASE_DIR/
|
||
├── release_<version>_<env>_<YYYY-MM-DD-HHmm>.md (mandatory; one per invocation)
|
||
├── rollback_<version>_<env>_<YYYY-MM-DD-HHmm>.md (only when rollback fires; pairs with the release file)
|
||
└── manual_approvals/
|
||
└── approval_<version>_<env>.md (when restrictions require manual approval, written before Phase 3)
|
||
```
|
||
|
||
The release report (`templates/release-report.md`) is appended to as each phase completes — it is durable across phase failures and reflects partial progress so the next operator can resume or audit.
|
||
|
||
## Phases
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ Release Execution (6-Phase Method) │
|
||
├────────────────────────────────────────────────────────────────┤
|
||
│ PREREQ: deploy artifacts on disk; tests green at HEAD │
|
||
│ │
|
||
│ 1. Pre-Release Gate → AC + change summary + readiness │
|
||
│ [BLOCKING: user confirms or aborts] │
|
||
│ 2. Strategy Select → all-at-once / blue-green / canary │
|
||
│ [BLOCKING: user picks strategy] │
|
||
│ 3. Execute → run deploy.sh, capture exit + logs │
|
||
│ [AUTO-ROLLBACK on non-zero exit] │
|
||
│ 4. Smoke Test → /test-run prod-smoke in target env │
|
||
│ [AUTO-ROLLBACK on failure] │
|
||
│ 5. Watch Window → poll observability for N minutes │
|
||
│ [AUTO-ROLLBACK on hard threshold breach] │
|
||
│ 6. Commit or Rollback → finalize verdict, update tracker │
|
||
│ [BLOCKING: user confirms only if soft regression escalated] │
|
||
├────────────────────────────────────────────────────────────────┤
|
||
│ Verdicts: Released · Rolled-Back · Aborted │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Phase 1: Pre-Release Gate
|
||
|
||
**Goal**: Refuse to start if the system is not ready for a real release.
|
||
|
||
1. **Acceptance criteria check**: read `_docs/00_problem/acceptance_criteria.md`. If any AC is marked unmet OR if any AC has no associated test marked `Passed` in the latest `test-run` report, STOP and surface the unmet items. Do not let the user override with "ship anyway" without a recorded reason in the release report.
|
||
2. **Test status check**: read the most recent `_docs/06_metrics/perf_*.md` (if perf is required by restrictions) and the latest functional test report. Any failing or skipped test that maps to a critical-path AC blocks the release.
|
||
3. **Change summary**: read the git log between the version-tag-of-last-release and HEAD (or, if no prior release exists, from the project root commit). Render a short list grouped by component: features, fixes, breaking changes, security fixes. Cross-reference against the latest implementation reports under `_docs/03_implementation/`.
|
||
4. **Rollback readiness**:
|
||
- Confirm the previous version's image is still pullable from the registry (do not deploy without this).
|
||
- Confirm `scripts/deploy.sh --rollback` works as documented (read the script; if `--rollback` flag is missing, STOP — that is a deploy-skill bug).
|
||
- Confirm a rollback target exists (e.g., previously-deployed image tag) and is recorded in the release report under `Rollback Plan`.
|
||
5. **Restrictions**: read `_docs/00_problem/restrictions.md` for change-window rules, manual-approval rules, blackout windows, regulatory requirements (e.g., 4-eyes review, ITAR controls). If any apply, gate accordingly — write a `manual_approvals/approval_<version>_<env>.md` file once received.
|
||
6. **Tracker check**: list tracker tickets in the release scope (per `tracker.mdc` rules). Any ticket still in `In Progress` or `Code Review` that maps to a change in the release scope blocks Phase 1. Move-and-deploy is not allowed.
|
||
|
||
**BLOCKING gate**: present the assembled summary to the user using Choose A/B/C:
|
||
|
||
```
|
||
══════════════════════════════════════
|
||
PRE-RELEASE GATE
|
||
══════════════════════════════════════
|
||
Target env: {env}
|
||
Target version: {version} ({git-sha})
|
||
Rollback target: {previous-version}
|
||
Changes: N tickets, M components
|
||
- {summary list}
|
||
Open risks: {summary or "none"}
|
||
Blocking issues: {summary or "none"}
|
||
══════════════════════════════════════
|
||
A) Proceed to Strategy Select
|
||
B) Abort — fix blocking issue and re-invoke
|
||
C) Edit release scope — exclude a ticket and reassemble
|
||
══════════════════════════════════════
|
||
```
|
||
|
||
If A → write Phase 1 section to release report, proceed. If B → write `Aborted` verdict to release report with reason, exit. If C → loop back into Phase 1 with edited scope.
|
||
|
||
### Phase 2: Strategy Select
|
||
|
||
**Goal**: Pick the deployment strategy that fits the change risk and environment capability.
|
||
|
||
Read `environment_strategy.md` and `deployment_procedures.md` to learn which strategies the target env supports. Strategies and when each is appropriate:
|
||
|
||
| Strategy | When to pick | Risk if wrong |
|
||
|----------|--------------|---------------|
|
||
| **all-at-once** | Internal tools, low traffic, well-rehearsed change, env supports nothing else | All users hit the new version simultaneously — bug blast radius is 100% |
|
||
| **blue-green** | Stateless services with a load balancer, env has dual-stack capability | Cutover is binary — observability must be ready to detect issues fast |
|
||
| **canary** | Customer-facing, traffic-tier load balancer in place, gradual rollout possible | Canary metric thresholds must be well-tuned or canary fails for harmless reasons |
|
||
| **manual** | Non-automatable env (one-off VMs, regulated infrastructure, non-Docker host) | The whole release becomes a runbook and the watch window phases are operator-driven; the release skill records but does not execute |
|
||
|
||
Recommend a default based on:
|
||
- Risk level inferred from change summary (any breaking change → bias toward canary or blue-green)
|
||
- Restrictions (e.g., regulatory rules forcing manual approval at each step)
|
||
- Environment capability (some envs may only support all-at-once)
|
||
|
||
**BLOCKING gate**: Choose A/B/C/D between strategies. Record the choice in the release report.
|
||
|
||
### Phase 3: Execute
|
||
|
||
**Goal**: Actually run the deploy. Capture exit code and full stdout/stderr.
|
||
|
||
1. Validate environment file (`.env`) exists, all required vars from `.env.example` are set, no placeholder secrets remain.
|
||
2. Source the env file and run `scripts/deploy.sh` against the target host. The script produced by `/deploy` Step 7 is the point of execution; do NOT bypass it. If a strategy-specific flag is needed (e.g., `--canary 5%`), pass it through.
|
||
3. Stream stdout/stderr to the release report, with timestamps, in a fenced code block under `## Phase 3: Execute`.
|
||
4. Capture exit code.
|
||
5. **AUTO-ROLLBACK trigger**: non-zero exit code → immediately invoke Phase 6 with verdict `Rolled-Back: deploy script failure`. Do NOT continue to Phase 4.
|
||
|
||
If `deploy.sh` emits no output for more than the configured idle threshold (default 5 minutes; check `deployment_procedures.md` for an explicit value), treat it as hung — capture a snapshot of what's running on the target, kill the script, and AUTO-ROLLBACK with reason `Deploy hung — manual investigation required`.
|
||
|
||
**Manual strategy**: if Phase 2 picked `manual`, write a checklist of operator steps from `deployment_procedures.md` to the release report and pause until the user types `done` or `failed`. Phase 3 then records the user's report verbatim.
|
||
|
||
### Phase 4: Smoke Test
|
||
|
||
**Goal**: Verify the new version is *actually serving traffic correctly* in the target environment.
|
||
|
||
1. Resolve the smoke-test command from `_docs/02_document/tests/blackbox-tests.md` § Production Smoke Tests, OR delegate to `/test-run` in `--prod-smoke` mode against the target environment.
|
||
2. The smoke-test set must (a) hit each public endpoint of each component, (b) include at least one read AND one write per public endpoint where applicable, and (c) complete in under 5 minutes total.
|
||
3. Capture pass/fail per case to the release report.
|
||
4. **AUTO-ROLLBACK trigger**: any smoke-test failure → invoke Phase 6 with verdict `Rolled-Back: smoke test failure: <test-name>`.
|
||
|
||
If smoke tests are **missing** for the target environment (no production-mode test set), STOP — write a leftover entry to `_docs/_process_leftovers/` per `tracker.mdc`, do not proceed to watch window without smoke coverage. Write `Aborted: smoke tests missing for prod-mode target` and ASK the user.
|
||
|
||
### Phase 5: Watch Window
|
||
|
||
**Goal**: Observe the live system for a defined window to catch latent regressions.
|
||
|
||
1. Read `observability.md` for the project's metrics, dashboards, and threshold definitions. Required watch metrics for any production target (per cursor-meta convention) include error rate, request rate, p99 latency, and saturation (CPU/memory/queue-depth).
|
||
2. Compute the watch-window duration from `deployment_procedures.md`. If unspecified, default to **15 minutes** for staging and **60 minutes** for production.
|
||
3. Poll the observability backend at 1-minute intervals (or the configured cadence). For each interval, record metric snapshots to the release report.
|
||
4. Threshold rules:
|
||
- **Hard breach** (auto-rollback): error-rate ≥ 2× baseline, p99 latency ≥ 3× baseline, any health-check failure persisting for 2 consecutive intervals.
|
||
- **Soft breach** (escalate): metric drift between 1.5× and 2× baseline, single-interval health blip, queue-depth steady but elevated.
|
||
- **No data** (escalate): if metrics are not flowing within the first 3 minutes, treat the absence as a hard breach — observability is itself broken.
|
||
5. **AUTO-ROLLBACK trigger**: hard breach at any interval. Move to Phase 6 with verdict `Rolled-Back: <metric> breached <multiplier>× baseline at T+<minutes>`.
|
||
6. **ESCALATE trigger**: soft breach. Pause polling, surface the metric, and ask the user A/B/C:
|
||
- A) Continue watch — accept current drift, keep polling
|
||
- B) Roll back now — treat soft drift as hard
|
||
- C) Extend watch window by N minutes
|
||
7. End of watch window with no breach → proceed to Phase 6.
|
||
|
||
The watch window cannot be skipped. If the user explicitly demands skipping (e.g., emergency rollforward), record the override reason in the release report and continue, but mark the verdict as `Released-with-override` — this triggers an automatic incident retrospective per `retrospective/SKILL.md`.
|
||
|
||
### Phase 6: Commit or Rollback
|
||
|
||
**Goal**: Finalize the release with a definitive verdict on disk.
|
||
|
||
**Path A — Commit (clean release)**:
|
||
1. Update tracker tickets: every ticket in scope moves to `Released` (or `Done`, per project convention defined in `tracker.mdc` / `_docs/_repo-config.yaml`).
|
||
2. Tag the git HEAD with `release/<version>` (or the project's tag convention from `deployment_procedures.md`).
|
||
3. Write the final `Released` verdict to the release report with a summary table.
|
||
4. Trigger `/retrospective --cycle-end` with this release as the cycle terminus.
|
||
5. Auto-chain to autodev's next step (Retrospective in greenfield, or feature-cycle loop start in existing-code).
|
||
|
||
**Path B — Rollback (auto-fired or user-elected)**:
|
||
1. Run `scripts/deploy.sh --rollback` with the rollback target captured in Phase 1.
|
||
2. Stream output to a new file `RELEASE_DIR/rollback_<version>_<env>_<YYYY-MM-DD-HHmm>.md` AND append a summary to the original release report under `## Rollback`.
|
||
3. Re-run Phase 4 (smoke test) and a 5-minute mini watch window against the rolled-back version. If THAT also fails, escalate immediately — the system is in an unknown state and needs human takeover.
|
||
4. Update tracker tickets back to `Ready for Release` (or the project's pre-release status).
|
||
5. Write the final `Rolled-Back` verdict with full reason chain.
|
||
6. Auto-trigger `/retrospective --incident` with this release as the incident anchor (per `retrospective/SKILL.md` incident mode).
|
||
7. Do NOT auto-chain to anything else — the user owns the next step.
|
||
|
||
**Path C — Aborted**:
|
||
Reached only via Phase 1 Choose B, Phase 4 smoke-tests-missing escalation, or any phase that detects a precondition violation. Write `Aborted: <reason>` to the release report. Do not auto-chain.
|
||
|
||
## Self-verification
|
||
|
||
- [ ] Release report exists at `RELEASE_DIR/release_<version>_<env>_<timestamp>.md` with verdict (Released / Rolled-Back / Aborted)
|
||
- [ ] Every phase that ran has a section in the release report with timestamps and tool output
|
||
- [ ] On Released: tracker tickets moved to release status; git tag pushed (if convention)
|
||
- [ ] On Rolled-Back: rollback report exists at `RELEASE_DIR/rollback_<version>_<env>_<timestamp>.md`; tracker tickets moved back to pre-release status; incident retrospective scheduled
|
||
- [ ] On Aborted: reason recorded; no live-system changes attempted; no tracker movement
|
||
- [ ] No phase was skipped without an explicit reason recorded in the release report
|
||
|
||
## Escalation Rules
|
||
|
||
| Situation | Action |
|
||
|-----------|--------|
|
||
| `scripts/deploy.sh` missing or `--rollback` unsupported | STOP — return to `/deploy` Step 7, do not patch the script in `/release` |
|
||
| Registry auth failure during pre-release | STOP — fix credentials at infra layer (per `coderule.mdc`); do not embed creds in the script |
|
||
| Smoke tests missing for prod target | STOP — write a leftover; do not improvise smoke tests in `/release` |
|
||
| Observability backend unreachable | STOP — observability blindness is itself a release blocker |
|
||
| User asks to skip the watch window | Record override, mark verdict `Released-with-override`, fire incident retro |
|
||
| Rollback also fails its smoke test | ESCALATE to user — system is in unknown state; do not loop deploys |
|
||
| Tracker MCP returns Unauthorized during ticket movement | Per `tracker.mdc`, write a leftover entry; do NOT silently continue without confirming the move |
|
||
| Multiple environments named in user request | STOP — one release per invocation; ask user to pick one |
|
||
| Production smoke test would touch real customer data | STOP — that is a `coderule.mdc` violation; ask user to define a smoke endpoint or test account |
|
||
|
||
## Common Mistakes
|
||
|
||
- **Skipping the watch window when "everything looks fine after deploy"** — a deploy that exited 0 is not a release that's stable. Watch is mandatory.
|
||
- **Faking smoke tests** to pass the gate when the prod test set is incomplete. STOP and surface the gap; do not embed prod URLs into ad-hoc curl commands.
|
||
- **Rolling forward through a failure** ("the next deploy will fix it"). Roll back first, fix the cause, then deploy a real fix.
|
||
- **Treating the release report as optional** when only an internal tool changed. Every release writes a report — the audit trail is the value, not the prose volume.
|
||
- **Approving manual gates yourself** without the user's input when restrictions require human approval. The release skill records, the human approves.
|
||
- **Reusing `release_<version>` filenames** across attempted releases. Always include the timestamp in the filename so re-attempts are visible side-by-side.
|
||
- **Letting tracker drift silently** between release attempts. If Phase 6 cannot move tickets, the release is not complete — write a leftover and stop.
|
||
|
||
## Project Mode vs Standalone
|
||
|
||
- **Project mode** (default): autodev invokes `/release` after `/deploy`. State writes occur under `_docs/_autodev_state.md`. Full integration with retrospective and feature-cycle loop.
|
||
- **Standalone mode**: `/release` invoked directly with `@<artifact>` (rare; usually only for re-running a rollback against a specific version). All outputs still go to `RELEASE_DIR/`.
|
||
|
||
## Methodology Quick Reference
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ Release (6 phases, 3 verdicts) │
|
||
├────────────────────────────────────────────────────────────────┤
|
||
│ Phase 1 Pre-Release Gate │
|
||
│ AC + tests + change summary + rollback path │
|
||
│ [BLOCKING — user A/B/C] │
|
||
│ Phase 2 Strategy Select │
|
||
│ all-at-once · blue-green · canary · manual │
|
||
│ [BLOCKING — user picks] │
|
||
│ Phase 3 Execute │
|
||
│ scripts/deploy.sh, capture exit code + logs │
|
||
│ [AUTO-ROLLBACK on non-zero or hang] │
|
||
│ Phase 4 Smoke Test │
|
||
│ /test-run --prod-smoke against target │
|
||
│ [AUTO-ROLLBACK on any failure] │
|
||
│ Phase 5 Watch Window │
|
||
│ Poll observability for N minutes │
|
||
│ [AUTO-ROLLBACK on hard breach; escalate on soft] │
|
||
│ Phase 6 Commit or Rollback │
|
||
│ Released → tracker, tag, retrospective │
|
||
│ Rolled-Back → tracker reset, incident retrospective │
|
||
│ Aborted → no live-system change │
|
||
├────────────────────────────────────────────────────────────────┤
|
||
│ Principles: real execution · verifiable rollback · │
|
||
│ quiet failure = release failure · │
|
||
│ watch window mandatory │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
```
|