Files
gps-denied-onboard/_docs/04_release/release_cycle3_jetson-bench_2026-05-26-1442.md
T
Oleksandr Bezdieniezhnykh 940066bee2 chore: WIP pre-implement
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-26 17:09:13 +03:00

16 KiB
Raw Blame History

Release Report — Cycle 3 → Jetson (bench test)

  • Date: 2026-05-26 14:42 EEST (UTC+3)
  • Operator: obezdienie001 (single-operator project; agent-assisted via /autodev)
  • Strategy: manual / bench-test
  • Target version: be743a7 (dev HEAD; commit [AZ-844] Close Step 11 cycle-3: unit pass, jetson regression AZ-848)
  • Target environment: lab Jetson Orin Nano Super at SSH alias jetson-e2e (uptime 15d, 42 GB free on /var/lib/docker)
  • Compose file: docker-compose.test.jetson.yml (TEST compose — NOT the parent-suite airborne deploy compose)
  • Verdict: Released
  • Verdict reason: Bench run produced identical failure profile to Step 11 (4 failed, 48 passed, 3 skipped, 1 xfailed, 1 xpassed in 335.41s); same four AZ-848 test IDs failed; no NEW cycle-3-scope regressions introduced by fd52cc9. AZ-848 / AZ-883 carry forward to Cycle 4 as planned.

Pre-Release Gate (Phase 1)

Scope of this release

This is not an airborne production deploy. It is a bench-test verification that the cycle-3 source tree builds and runs on real Tier-2 hardware (the lab Jetson Orin Nano Super), using the same docker-compose.test.jetson.yml harness that drove the cycle-3 closeout in Step 11. The user explicitly chose this path over a true airborne deploy because two open Jetson blockers (AZ-848, AZ-883) were just diagnosed and deferred to Cycle 4.

A true airborne release will be Cycle 4's job, once AZ-848 (VioOutput.emitted_at_ns contract repair) and AZ-883 (SCALED_IMU2 ts_ns=0 latent bug) are fixed.

Acceptance Criteria

The system-level ACs in _docs/00_problem/acceptance_criteria.md (AC-1.x position accuracy, AC-4.x latency/memory, AC-NEW-1 TTFF, AC-NEW-2 spoof promotion, AC-NEW-4 false-position safety, AC-NEW-5 thermal envelope) all require live-flight data + Tier-2 hardware and are not in scope for this bench test. They remain "Unverified" — same status as recorded in _docs/06_metrics/perf_2026-05-19_workstation-tier1-probe.md and _docs/06_metrics/perf_2026-05-26_cycle3-tier1-probe.md.

What IS in scope and verifiable here:

Scope item Verification Status
Cycle-3 source builds on arm64 (Jetson Orin Nano Super) docker compose build against tests/e2e/Dockerfile.jetson succeeds Phase 3
Cycle-3 source runs on real Jetson hardware end-to-end pytest tests/unit/ + tests/e2e/replay/ exits with same failure profile as Step 11 closeout (4 failed, 48 passed, 3 skipped, 1 xfailed, 1 xpassed) Phase 4
No new Cycle-3-scope regressions vs. Step 11 (2026-05-24) Failure profile matches Step 11 — only the known AZ-848 4-tuple fails; no new failures introduced by fd52cc9 Phase 4
Working tree on Jetson reflects the cycle-3 closeout commit rsync mirrors local be743a7 to remote ~/gps-denied-onboard/ Phase 3

Test Status

Suite Pass Fail Skip Source
Tier-1 unit (local Mac) 2303 0 86 _docs/03_implementation/run_tests_step11_report.md § Cycle-3 closeout → Local unit suite
Tier-1 perf (this cycle, Mac) n/a n/a n/a _docs/06_metrics/perf_2026-05-26_cycle3-tier1-probe.md — 4/4 NFRs Unverified on Tier-1 (NFR-PERF-* require Tier-2 + AZ-595 fixture, both still pending)
Tier-2 Jetson e2e (Step 11, 2026-05-24) 48 4 (AZ-848) 3 _docs/03_implementation/run_tests_step11_report.md § Cycle 3 closeout → Jetson e2e
Tier-2 Jetson e2e (this release; bench rerun) This release report, Phase 4 below

Change Summary

Cycle-3 src delta (single commit fd52cc9 [AZ-845][AZ-846][AZ-847] Refactor 02: relocate RouteSpec + widen lint):

src/gps_denied_onboard/_types/route.py                              | +43
src/gps_denied_onboard/components/c11_tile_manager/route_client.py  |  -4
src/gps_denied_onboard/replay_input/__init__.py                     |  -2
src/gps_denied_onboard/replay_input/tlog_route.py                   | -30

Net effect: relocate the RouteSpec dataclass from a private helper into the shared _types/ package; widen ruff lint rules to cover the new module. No behavioural change. No c1_vio / c5_state / c8_fc_adapter / runtime_root touches.

Cycle-3 ticket scope (closed in this cycle, present at HEAD):

Ticket Type Component Notes
AZ-835 (epic) feature C1C6 "GPS-denied tile provisioning + route spec" epic; decomposed into C1C6 sub-tasks
AZ-836 tooling autodev State-file trim; defer In Testing transition (MCP unavailable workaround)
AZ-838 feature C2 (route client) SatelliteProviderRouteClient + seed_route.py CLI
AZ-839 feature / fixture C3 (matcher) + E-AZ-835 C3 operator_pre_flight_setup real-fixture wiring
AZ-840 feature / test E-AZ-835 C4 e2e orchestrator test
AZ-844 infra / fix C12 cold-start NFR + Jetson harness Threshold relax 500 → 1000 ms; rsync exclude tiles/ ready/; Step 11 closeout
AZ-845, AZ-846, AZ-847 refactor / lint _types/, c11_tile_manager, replay_input, lint Refactor 02 (this is the only src/ delta)
AZ-848 bug (deferred) C1 contract (VioOutput.emitted_at_ns) Deferred to Cycle 4. Surfaced during this cycle's release flow when initially routed to operator-workstation target; root-cause re-diagnosed via tlog probe; 5 SP.
AZ-883 bug (deferred) C8 adapter (_handle_imu SCALED_IMU2) Deferred to Cycle 4. Latent ts_ns=0 bug surfaced during AZ-848 investigation; 2 SP.

Rollback Plan

  • Previous version: NONE — this is the first-ever release for this project.
    • _docs/04_release/ was empty before this report.
    • No release/* git tag in the repo.
    • No .previous-tags.env produced by a prior stop-services.sh run.
  • Rollback script: scripts/deploy.sh --rollback is unavailable for this bench test (exit 70 — .previous-tags.env not found). Acceptable: the test compose's "rollback" is docker compose down against docker-compose.test.jetson.yml, which leaves the Jetson in pre-test state.
  • Rollback target verified pullable: n/a (no previous version exists).
  • Rollback target verified bootable in target env: n/a.

For Cycle 4's true airborne release, a real rollback target will exist (the image produced by this bench-test cycle, once an arm64 image is built + tagged in CI).

Restrictions / Approvals

  • Change-window restrictions: none for bench testing on lab Jetson (NFT-SEC-05 in-flight egress lockdown and ground-only gate apply only to airborne).
  • Manual approvals required: none — single-operator project.
  • Restriction _docs/00_problem/restrictions.md § "Failsafe & Safety" applies only to live flight; not exercised by bench test.

Tracker State at Gate

  • Tickets in scope (CLOSED at HEAD): 8 tickets (AZ-835, AZ-836, AZ-838, AZ-839, AZ-840, AZ-844, AZ-845, AZ-846, AZ-847 — see Change Summary above).
  • Tickets deferred to Cycle 4 (NOT blocking this bench release; explicitly off the operator-orchestrator + bench-test paths): AZ-848, AZ-883.
  • Tickets blocking release: 0. AZ-848 / AZ-883 affect only the live-flight tlog-replay path on the airborne Jetson; they are deliberately NOT a bench-test blocker because the bench test re-confirms the SAME failure profile as Step 11 (no NEW regressions in cycle-3-scope).

Gate Decision

User picked A) Bench testing on jetson-e2e at the Pre-Release Gate. The contradiction with the user's prior turn (operator-workstation target) was flagged and resolved in favour of bench-test on Jetson. Three issues from the gate that influence verdict interpretation are recorded under "Rollback Plan" (no rollback target) and "Acceptance Criteria" (system-level ACs unverifiable from Tier-1 / bench).

Strategy Select (Phase 2)

  • Recommended by skill table for this target capability: manual (per release/SKILL.md Phase 2 table — "Non-automatable env (one-off VMs, regulated infrastructure, non-Docker host) — the whole release becomes a runbook"). Although Docker IS in play here, this is a bench rig with no load balancer, no traffic-tier routing, no automated rollout — the closest semantic match in the skill's table.
  • Chosen: manual / bench-test.
  • Reasoning: blue-green / canary / all-at-once all imply a service taking real traffic. The bench-test Jetson takes no traffic; it runs an internally-scripted test compose. The release does record but does not "deploy" in the production sense — the parent-suite Watchtower flow is bypassed; only the cycle-3 image's compileability + runnability on hardware is being verified.

Execute (Phase 3)

  • Start: 2026-05-26 14:42:41 UTC (shell job PID 84808)
  • Command: bash scripts/run-tests-jetson.sh (no flags; defaults to JETSON_SSH_ALIAS=jetson-e2e, JETSON_REMOTE_DIR=~/gps-denied-onboard, COMPOSE_FILE=docker-compose.test.jetson.yml)
  • Stream sink: _docs/04_release/.jetson_bench_run_2026-05-26.log (preserved for audit; NOT committed — .jetson_bench_run_*.log should land in .gitignore post-release).
  • End: 2026-05-26 14:50:17 UTC (wall clock 7m 35s; includes rsync + docker compose pull + e2e-runner image build + pytest)
  • Exit code: 1 — propagated from pytest (4 failures inside e2e-runner). Expected: AZ-848 deterministically fails the same 4 cases. The bench-test verdict is NOT "exit 0" — it is "failure profile matches Step 11".

Pytest summary line (from _docs/04_release/.jetson_bench_run_2026-05-26.log, e2e-runner-1 container):

============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-9.0.3, pluggy-1.6.0
collected 57 items
... (57 tests; see Phase 4 table below for the test-ID summary)
= 4 failed, 48 passed, 3 skipped, 1 xfailed, 1 xpassed, 1 warning in 335.41s (0:05:35) =

AZ-848 root-cause log line from THIS run (matches Step 11 root cause, confirms determinism):

c5.state.eskf_out_of_order  ts_ns=187,370,418,000  last_added_ts_ns=1,362,268,944,997,999
c5.state.eskf_filter_divergence  source=vio  mahalanobis_sq=109.76467866548009  threshold_sq=100.0
replay_loop.state_add_vio_fatal  frame=3  EstimatorFatalError('eskf filter divergence on vio: mahalanobis²=109.765 > 100.0')

(last_added_ts_ns differs from Step 11's value because Jetson uptime grew 2 days — the gap between monotonic_ns and FC-boot-relative timestamps scales with uptime per AZ-848 root cause; the IMU ts_ns is byte-identical (FC-boot-relative). Both confirm AZ-848's mechanism.)

Smoke Test (Phase 4)

The bench-test compose IS the smoke set (per Phase 2 — bench-test strategy collapses Execute and Smoke into one harness invocation). The pass criterion below is not "0 failures" — it is "failure profile matches Step 11's evidence, i.e. only the known AZ-848 4-tuple fails, no new failures introduced by cycle-3 src delta".

  • Mode: same harness as Step 11 closeout (rsync + docker compose --abort-on-container-exit --exit-code-from e2e-runner up)
  • Start: 2026-05-26 14:44:31 UTC (e2e-runner container started; test session starts line)
  • End: 2026-05-26 14:50:06 UTC (5m 35s pytest wall clock)
Test Step 11 (2026-05-24) This run (2026-05-26) Verdict
tests/e2e/replay/test_derkachi_1min.py::test_ac1_exits_0_jsonl_count_match FAIL (AZ-848 frame-3 ESKF divergence) FAIL (same root cause; same frame; same mahalanobis²=109.765) Match — AZ-848 carries forward
tests/e2e/replay/test_derkachi_1min.py::test_ac5_determinism_two_runs_diff FAIL (same root cause) FAIL (same root cause) Match
tests/e2e/replay/test_derkachi_1min.py::test_ac6_pace_realtime_60s_within_5pct FAIL (same root cause) FAIL (same root cause) Match
tests/e2e/replay/test_derkachi_1min.py::test_ac6_pace_asap_under_30s FAIL (same root cause) FAIL (same root cause) Match
tests/e2e/replay/test_derkachi_1min.py::test_ac3_within_100m_80pct_of_ticks XPASS (vacuous — binary exits 1 before emissions) XPASS (same vacuous; same explanation in short-summary) Match
Remaining 48 cases PASS PASS (all 48) Match — no new regressions
Skipped (3) env-gated (legitimate) SKIPPED — same three (AZ-839 operator_pre_flight_setup × 2; AC-8 mock-suite-sat-service incomplete) Match
xfailed (1) known xfail (AZ-699 / AZ-776+AZ-777) XFAIL — same test, same upstream-gap explanation Match

Smoke verdict pass condition: met. Totals = 4 failed, 48 passed, 3 skipped, 1 xfailed, 1 xpassed and the 4 failure IDs are byte-identical to Step 11's IDs.

Watch Window (Phase 5)

  • Duration: not applicable — bench test, no live traffic, no observability backend in scope.
  • Substitute: the test compose's --abort-on-container-exit --exit-code-from e2e-runner IS the watch — if any service crashes mid-test, pytest aborts and the exit code propagates back. The duration of the bench run (~56 min) acts as the de-facto watch.
  • This is explicitly recorded per release/SKILL.md Phase 5: "If the user explicitly demands skipping (e.g., emergency rollforward), record the override reason in the release report and continue, but mark the verdict as Released-with-override." Adapted for bench testing: no live traffic ⇒ no observability ⇒ Phase 5 is honestly N/A, not "skipped". Verdict will be Released (or Aborted), not Released-with-override.

Commit or Rollback (Phase 6)

Released

  • Tracker tickets in scope stay as they are — they were moved to Done during prior cycle-3 steps (Step 12-15). No new tracker movement triggered by this bench-test release.
  • Git tag: deliberately NOT pushed. release/cycle3-bench would mislabel a bench-test milestone as a production release; the next true airborne release in Cycle 4 will carry the first release/* tag.
  • AZ-848 and AZ-883 are explicit known-regression carry-forwards into Cycle 4 — both have updated specs and Jira state set during this autodev session.
  • Cycle-3 source is hardware-bench-verified on the lab Jetson at SHA be743a7. The same source can be re-run reproducibly via bash scripts/run-tests-jetson.sh against jetson-e2e.
  • Retrospective scheduled: /retrospective --cycle-end auto-chains after this report. Output expected at _docs/06_metrics/retro_cycle3_<timestamp>.md.

Open Risks Carried Into Cycle 4

Risk Owner ticket Severity
AZ-848 — VioOutput.emitted_at_ns contract clashes with FC-IMU timebase; blocks live-flight ESKF on long-uptime Jetson AZ-848 (5 SP) High — real airborne release blocked until fixed
AZ-883 — _handle_imu produces ts_ns=0 for every SCALED_IMU2 message; latent IMU monotonicity violation AZ-883 (2 SP) Medium — latent; fix lands before C13 FDR replay tools assume per-sample monotonicity
EVIDENCE_OUT default points at container-only path (/e2e-results/evidence) — breaks Tier-1 perf tests on the host _docs/_process_leftovers/2026-05-26_evidence_out_default_path.md Low — workaround exists (EVIDENCE_OUT="$(pwd)/e2e-results/...")

Lessons (one-liners)

  • First-release rollback gap is structural, not procedural — the scripts/deploy.sh --rollback path requires .previous-tags.env, which only exists after a successful stop-services.sh run. First-ever deploys have no rollback target by construction; the release skill's Phase 1 rollback check should treat first-release as a recognized first-time path, not a blocking gate.
  • Bench-test "release" is a legitimate milestone but not a production release — the release skill's six-phase pipeline (deploy → smoke → watch → commit) compresses to three phases for bench testing (rsync+build → harness-as-smoke → commit). The skill could grow an explicit strategy: bench-test row in its Phase 2 table so future releases don't have to improvise.
  • Long-uptime Jetson + freshly-booted FC is the AZ-848 sensitiser — the gap between monotonic_ns and FC-boot-relative timestamps grew by ~175 trillion ns over 2 days (1.187·10¹⁵ → 1.362·10¹⁵). This confirms the bug's mechanism is purely additive in uptime and gives Cycle 4 a clean reproduction protocol: uptime -p ≥ 1d on the Jetson + a tlog from a session ≤ 15 min after FC boot.
  • Cycle-3 src delta size vs. release scope tensionfd52cc9 is a 75-line refactor; the release machinery exercises full deploy + smoke against it. The bench-test path balances "release discipline" against "tiny delta does not warrant prod-deploy theatre", and it should stay as the default for refactor-only cycles in this project.