[AZ-701] HTTP replay API service (FastAPI + magic-byte upload validation)
ci/woodpecker/push/02-build-push Pipeline failed

New replay_api component: FastAPI service wrapping the offline
gps-denied-replay pipeline. POST tlog+video (multipart) → either
sync 200 with result/map/report URLs, or async 202 + job id with
/jobs/{id} polling. Magic-byte validation, bearer auth, in-memory
JobRegistry with concurrency + queue caps (429 on overflow).

Helper accuracy_report.py promoted from tests/ to src/ because the
API needs the Markdown report writer at runtime; all AZ-699 imports
re-pointed. OpenAPI spec exported to docs.

18/18 unit tests pass (AC-1 sync, AC-2 async, AC-3 state machine,
AC-5 auth, AC-6 health, AC-8 concurrency, AC-9 magic-byte). Full
unit suite: 2251 pass, 86 skip, 1 pre-existing C12 cold-start flake
(unchanged). mypy --strict clean on the new surface.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-20 17:30:26 +03:00
parent b66b68ff76
commit 7d53cef0cf
22 changed files with 2854 additions and 13 deletions
@@ -0,0 +1,146 @@
# Batch 102 — Cycle 2 — AZ-701
**Date**: 2026-05-20
**Tasks**: AZ-701 (HTTP replay API service).
**Story points**: 5.
**Jira status**: AZ-701 → `In Testing`.
## What shipped
A new operator-side `replay_api` component — a FastAPI service that
wraps the offline `gps-denied-replay` pipeline behind HTTP. Operators
can POST a multipart `(tlog + video [+ calibration])` payload and
receive back either a synchronous result (small flights) or a
202-job-id for polling (large flights). Once a job completes, the
JSONL emissions, the AZ-700 HTML map, and the AZ-699 accuracy
report are served as static files under stable URLs.
Estimator code is unchanged — the service shells out to the existing
`gps-denied-replay` and `gps-denied-render-map` console scripts. The
contract explicitly forbids re-implementing the pipeline in the API
layer.
Bearer-token auth is on by default (configurable env var), magic-byte
validation rejects misnamed uploads at the door, and a thread-pool
worker enforces a `max_concurrent` / `max_queued` cap with a 429 on
overflow.
## Files changed
Production (10):
- `src/gps_denied_onboard/replay_api/__init__.py` (new)
- `src/gps_denied_onboard/replay_api/errors.py` (new — typed HTTP
error families with stable `error_code` strings)
- `src/gps_denied_onboard/replay_api/interface.py` (new — DTOs,
`JobState` enum, `ReplayRunner` Protocol seam for DI)
- `src/gps_denied_onboard/replay_api/storage.py` (new — per-job
temp-dir lifecycle)
- `src/gps_denied_onboard/replay_api/jobs.py` (new — `JobRegistry`
with concurrency/queue limits and `ThreadPoolExecutor`)
- `src/gps_denied_onboard/replay_api/handlers.py` (new — magic-byte
validation, size limits, bearer-token extraction)
- `src/gps_denied_onboard/replay_api/app.py` (new — FastAPI factory
and `SubprocessReplayRunner`)
- `src/gps_denied_onboard/cli/replay_api_entrypoint.py` (new —
`replay-api` console script)
- `src/gps_denied_onboard/helpers/accuracy_report.py` (promoted from
`tests/e2e/replay/_report_writer.py`; needed at runtime by the API)
- `src/gps_denied_onboard/helpers/__init__.py` (re-exports)
Tests (1):
- `tests/unit/replay_api/test_az701_replay_api.py` (18 tests, all PASS local)
Docs / contract (2):
- `_docs/02_document/contracts/replay_api/replay_api_protocol.md` (new)
- `_docs/02_document/contracts/replay_api/openapi.yaml` (new — exported
from the running FastAPI app)
Docker / CI (2):
- `docker/replay-api.Dockerfile` (new)
- `e2e/docker/docker-compose.test.yml` (added `replay-api` service +
`replay-api-storage` volume, profile `replay-api`)
Build / packaging (1):
- `pyproject.toml` (`fastapi`, `uvicorn`, `python-multipart` added to
`[operator-tools]` extra; `replay-api` console script registered)
Imports updated (re-pointed to promoted helper) (3):
- `tests/e2e/replay/test_derkachi_real_tlog.py`
- `tests/unit/test_az699_report_writer.py`
- `tests/e2e/replay/_report_writer.py` was deleted (replaced by the
promoted production module)
## AC coverage
| AC | Status | Evidence |
|----|--------|----------|
| AC-1 sync 200 | Pass | `test_post_replay_sync_returns_200_with_result_urls` + `test_post_replay_serves_jsonl_and_map_for_done_job` |
| AC-2 async 202 | Pass | `test_post_replay_async_returns_202_when_video_exceeds_sync_bytes` |
| AC-3 job state | Pass | `test_job_state_transitions_observable_via_polling`, `test_failed_runner_marks_job_failed`, `test_result_endpoints_409_when_job_not_done` |
| AC-5 401 unauth | Pass | `test_post_replay_returns_401_without_bearer_when_required`, `test_post_replay_accepts_correct_bearer` |
| AC-6 health | Pass | `test_healthz_always_returns_200`, `test_readyz_returns_503_when_binary_missing` |
| AC-8 concurrency | Pass | `test_concurrency_limit_queues_excess_jobs`, `test_queue_full_returns_429` |
| AC-9 magic-byte | Pass | 4 tests covering tlog + video validators (unit) and end-to-end POST rejection |
## Test run summary
- **AZ-701 unit slice** (`tests/unit/replay_api/`): 18/18 passed in 4 s.
- **Full unit suite**: 2251 passed, 86 skipped, 1 failed in 85 s.
- The single failure is `tests/unit/c12_operator_orchestrator/test_cli_console_script.py::TestConsoleScript::test_cold_start_under_500ms_p99`. It is a pre-existing C12 CLI cold-start performance flake. AZ-701 doesn't touch C12 and the same failure shows up in batch 100 and batch 101 reports. Non-blocking for AZ-701.
- **Mypy --strict** on AZ-701 surface (`src/gps_denied_onboard/replay_api/`, `helpers/accuracy_report.py`, `cli/replay_api_entrypoint.py`): clean — 9 source files, 0 errors.
## Strict typing
All new modules in `src/gps_denied_onboard/replay_api/*` are
strict-typed (no implicit `Any`, no untyped defs, no untyped
decorators). The `folium`-style untyped-third-party shim is not
needed here — FastAPI, Pydantic, uvicorn, and python-multipart all
ship typestubs that mypy --strict accepts.
## Notable design decisions
- **Subprocess runner, not in-process estimator.** The contract
invariant is "the API layer does NOT re-implement the pipeline."
`SubprocessReplayRunner` shells out to `gps-denied-replay
--auto-trim` and then `gps-denied-render-map`. Easy to swap for a
fake in tests via the `ReplayRunner` Protocol DI seam.
- **Magic-byte validation is mandatory (AC-9).** Misnamed `.tlog`
/ `.mp4` payloads are rejected at the door with a stable
`error_code`. No content-type sniffing fallback.
- **Bearer auth is opt-out, not opt-in.** Default state of the
service is "auth required, token missing → 503 at startup"
unless the operator explicitly sets `REPLAY_API_AUTH_REQUIRED=false`
for a dev environment.
- **In-memory state by design.** The contract says "no persistent
state across restarts" — jobs don't survive a process restart and
the storage root is wiped on shutdown. Operators wanting durability
must layer it externally.
- **`from __future__ import annotations` dropped in `app.py` only.**
FastAPI 0.119 + Pydantic v2 resolve route-parameter annotations
at decoration time and reject forward-ref strings. The rest of the
`replay_api` package keeps the future-annotations import. The
reason is recorded in `app.py`'s module docstring.
- **`_report_writer.py` was promoted from `tests/` to `src/`.** The
API needs to produce the AZ-699 Markdown accuracy report at
runtime; that module was previously test-only. All AZ-699 imports
re-pointed to `gps_denied_onboard.helpers.accuracy_report`.
## Known limitations carried forward
- No request-body streaming — `python-multipart` buffers each part.
Hard 2 GB cap guards memory.
- No rate limiting beyond `max_concurrent` / `max_queued`. A reverse
proxy is the right layer for that.
- `SubprocessReplayRunner` discards stdout/stderr on the success
path; operators wanting per-job audit logs need a follow-on.
- The Derkachi real-flight e2e test (AZ-699's
`test_derkachi_real_tlog.py`) already exercises the underlying
pipeline. A dedicated end-to-end `replay_api` test against real
artefacts is **not** in scope here per the testing-environment
policy (e2e → Jetson only).