[AZ-701] HTTP replay API service (FastAPI + magic-byte upload validation)
ci/woodpecker/push/02-build-push Pipeline failed

New replay_api component: FastAPI service wrapping the offline
gps-denied-replay pipeline. POST tlog+video (multipart) → either
sync 200 with result/map/report URLs, or async 202 + job id with
/jobs/{id} polling. Magic-byte validation, bearer auth, in-memory
JobRegistry with concurrency + queue caps (429 on overflow).

Helper accuracy_report.py promoted from tests/ to src/ because the
API needs the Markdown report writer at runtime; all AZ-699 imports
re-pointed. OpenAPI spec exported to docs.

18/18 unit tests pass (AC-1 sync, AC-2 async, AC-3 state machine,
AC-5 auth, AC-6 health, AC-8 concurrency, AC-9 magic-byte). Full
unit suite: 2251 pass, 86 skip, 1 pre-existing C12 cold-start flake
(unchanged). mypy --strict clean on the new surface.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-20 17:30:26 +03:00
parent b66b68ff76
commit 7d53cef0cf
22 changed files with 2854 additions and 13 deletions
@@ -0,0 +1,234 @@
# HTTP Replay API service
**Task**: AZ-701_http_replay_api_service
**Name**: HTTP API for offline replay (POST tlog+video, return GPS fixes + map URL)
**Description**: New `replay_api` component (FastAPI) wrapping the offline replay pipeline. One primary endpoint `POST /replay` accepts multipart `(tlog + video [+ calibration])` and returns either a synchronous JSONL+summary or an async job id. Returns links to the map artifact rendered by AZ-700.
**Complexity**: 5 points
**Dependencies**: AZ-699, AZ-700
**Component**: replay_api (new component)
**Tracker**: AZ-701
**Epic**: AZ-696
## Problem
The product today has zero HTTP surface. The only ways to invoke the
estimator on a recorded flight are:
1. The airborne binary (real-time MAVLink GPS_INPUT — needs the
aircraft + FC).
2. `gps-denied-replay` CLI (operator workstation, Python install
required).
3. `operator-orchestrator` CLI (Click, pre-flight cache only — does
NOT run the estimator).
External consumers (operator tools, suite web UIs, demo dashboards,
other suite services) cannot validate flights without installing the
full Python stack. The user's pipeline framing explicitly calls for
"part of the api — tlog and video uploading. and emits gps fixes back
to the user."
## Outcome
- A new HTTP service exposes `POST /replay` and the supporting `GET /jobs/{id}*` polling endpoints.
- The service wraps `gps-denied-replay` and AZ-700's map renderer behind a single multipart upload.
- Containerized; runs in `docker-compose.test.yml`; OpenAPI spec is committed.
- Authentication via bearer token, gated explicitly off in dev mode (logs WARN).
## Scope
### Included
- New component `src/gps_denied_onboard/replay_api/`:
- `app.py` (FastAPI instance)
- `handlers.py` (multipart upload, validation)
- `jobs.py` (sync ≤ 2 min videos / async > 2 min)
- `storage.py` (temp file lifecycle, cleanup)
- `interface.py` (`ReplayRunner` Protocol so handlers are decoupled)
- `errors.py` (custom HTTP error families)
- Endpoints: `POST /replay`, `GET /jobs/{id}`, `GET /jobs/{id}/result`, `GET /jobs/{id}/map`, `GET /healthz`, `GET /readyz`.
- Bearer-token auth: `REPLAY_API_BEARER_TOKEN` env var; explicit dev opt-out via `REPLAY_API_AUTH_REQUIRED=false`.
- Upload size limit + concurrent-job limit, env-configurable.
- New `replay-api` console script (uvicorn entrypoint) in `pyproject.toml`.
- New `docker/replay-api.Dockerfile` + `docker-compose.test.yml` entry.
- OpenAPI spec exported to `_docs/02_document/contracts/replay_api/openapi.yaml`.
- Contract file `_docs/02_document/contracts/replay_api/replay_api_protocol.md` (per shared/api decompose Step 4.5 rule).
- File-upload magic-byte validation for `.tlog` + `.mp4`.
### Excluded
- Web UI (parent-suite concern).
- Persistent job database (in-memory + temp disk is sufficient for v1).
- Multi-node job distribution.
- WebSocket streaming of progress.
## Acceptance Criteria
**AC-1: Sync happy path (short video, dev mode)**
Given `REPLAY_API_AUTH_REQUIRED=false` and a 60 s video
When `POST /replay` runs with multipart `tlog + video`
Then response is 200 with JSONL of GPS fixes + accuracy summary inline
**AC-2: Async happy path (long video)**
Given a > 2-minute video
When `POST /replay` runs
Then response is 202 with `Location: /jobs/{id}` and `{job_id, status_url}`
**AC-3: Job state transitions**
Given an async job
When polled via `GET /jobs/{id}`
Then state transitions `queued → running → done` are observable
**AC-4: Result + map served from job id**
Given a `done` job
When `GET /jobs/{id}/result` is called
Then it streams the JSONL; `GET /jobs/{id}/map` returns the HTML map (from AZ-700)
**AC-5: Auth enforced when configured**
Given `REPLAY_API_BEARER_TOKEN=secret`
When `POST /replay` runs without `Authorization: Bearer secret`
Then response is 401
**AC-6: Health endpoints**
Given the service is up and `gps-denied-replay` console-script is on PATH
When `GET /healthz` and `GET /readyz` are called
Then both return 200
**AC-7: OpenAPI + contract documented**
Given the service is running
When the OpenAPI spec is exported
Then `_docs/02_document/contracts/replay_api/openapi.yaml` is committed; `replay_api_protocol.md` documents the versioning rules
**AC-8: Concurrency limit enforced**
Given `REPLAY_API_MAX_CONCURRENT_JOBS=1`
When 3 jobs are submitted in quick succession
Then exactly 1 is `running`; 2 are `queued`
**AC-9: Magic-byte upload validation**
Given a `POST /replay` with a misnamed `.tlog` (actually a `.zip`)
When the handler validates
Then response is 400 with a clear error
## Non-Functional Requirements
**Performance**
- For a 60 s Derkachi video, sync `POST /replay` returns within `gps-denied-replay` ASAP-mode wall + 5 s overhead on Tier-2 Jetson.
**Security**
- Magic-byte file validation; reject anything not matching `.tlog` (MAVLink magic 0xFD/0xFE) or `.mp4` (ftyp).
- Bearer auth always available; default-OFF only with explicit env var.
**Compatibility**
- FastAPI / uvicorn / python-multipart pinned; document version compatibility window.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Sync POST → 200 + JSONL | Round-trip succeeds with synth fixtures |
| AC-2 | Async POST → 202 + job id | 202 with Location header |
| AC-3 | Job state machine | Transitions observed |
| AC-5 | Missing/wrong bearer → 401 | Strict failure |
| AC-8 | Concurrency limit | 2 of 3 queued |
| AC-9 | Wrong magic bytes → 400 | Clear error |
## Blackbox Tests
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|--------|------------------------|-------------|-------------------|----------------|
| AC-1, AC-4 | Real derkachi.tlog + video | `curl` round-trip in docker-compose | 200 + JSONL + map HTML | Perf |
| AC-6 | Container up | Health endpoint checks | 200 OK | — |
## Constraints
- FastAPI MUST live in an operator-only build target; ADR-002 binary-exclusion applies. Airborne binary cold-start regression test must remain green.
- New component MUST follow interface-first + constructor-injection (Principle #13 in architecture.md).
- Contract file MUST exist before the endpoint is callable in CI (per decompose Step 4.5 rule).
## Risks & Mitigation
**Risk 1: FastAPI / uvicorn dep weight on airborne binary**
- *Risk*: Adding the API dep to the airborne binary regresses cold-start.
- *Mitigation*: Place `replay_api/` in an operator-only optional-dependencies group; CMake / build-time exclusion enforces.
**Risk 2: HTTP timeout on long videos**
- *Risk*: Sync mode + a long video → HTTP timeout.
- *Mitigation*: Async mode triggers automatically above the configured video-length threshold.
**Risk 3: File-upload abuse**
- *Risk*: Malicious uploads (huge files, zip bombs, fake MIME types).
- *Mitigation*: Hard size limit (2 GB default), magic-byte validation, temp-file cleanup, configurable disk quota.
## Contract
This task produces the contract at `_docs/02_document/contracts/replay_api/replay_api_protocol.md`.
Consumers MUST read that file — not this task spec — to discover the interface and versioning rules.
## Implementation Notes (Batch 102, Cycle 2)
### Files Changed
**New production code**`src/gps_denied_onboard/replay_api/`:
- `__init__.py` — public exports (`create_app`, DTOs, error families).
- `errors.py``ReplayApiError` hierarchy with stable `error_code` + HTTP `status_code`.
- `interface.py``JobState`, `ReplayInputs`, `ReplayJobResult`, `JobSnapshot`, `ReplayRunner` Protocol.
- `storage.py` — per-job temp directory lifecycle (`StorageRoot.allocate_job/release_job/cleanup_all`).
- `jobs.py` — in-memory `JobRegistry` with `max_concurrent` / `max_queued`, `ThreadPoolExecutor` worker pool.
- `handlers.py` — magic-byte validation (`validate_tlog_kind` for MAVLink v1/v2, `validate_video_kind` for MP4 `ftyp`, `validate_calibration_kind` for JSON), size limits, bearer-token extraction.
- `app.py``create_app(...)` FastAPI factory + `SubprocessReplayRunner` (shells out to `gps-denied-replay --auto-trim` and `gps-denied-render-map`).
**New CLI entrypoint**`src/gps_denied_onboard/cli/replay_api_entrypoint.py`:
- `replay-api` console script wired in `pyproject.toml` under the `operator-tools` extra.
- Parses `--host`, `--port`, `--storage-root`, `--reload`; reads `REPLAY_API_*` env knobs.
**Helper promoted from tests**`src/gps_denied_onboard/helpers/accuracy_report.py`:
- Was `tests/e2e/replay/_report_writer.py` (AZ-699 batch). Promoted because `replay_api` needs it at runtime to produce `accuracy_report.md`. Re-exported from `helpers/__init__.py`. All AZ-699 imports re-pointed.
**Contract**`_docs/02_document/contracts/replay_api/replay_api_protocol.md` (purpose, invariants, endpoints, error families, env config, versioning).
**OpenAPI spec**`_docs/02_document/contracts/replay_api/openapi.yaml` (auto-exported from the FastAPI app; check in alongside the contract doc).
**Docker**`docker/replay-api.Dockerfile` + `e2e/docker/docker-compose.test.yml` (`replay-api` service, profile-gated `replay-api`, with `replay-api-storage` volume).
**Dependencies**`pyproject.toml`:
- `operator-tools` extra now also pulls `fastapi>=0.111,<0.120`, `uvicorn>=0.30,<1.0`, `python-multipart>=0.0.9,<1.0`.
- New console script `replay-api`.
**Unit tests**`tests/unit/replay_api/test_az701_replay_api.py` (18 tests, all passing):
- AC-1 sync: POST → 200 with `result_url`/`map_url`/`report_url`; JSONL + HTML map served from those URLs.
- AC-2 async: large video (> sync threshold) → 202 + `Location: /jobs/{id}`.
- AC-3: job state visible via polling `RUNNING → DONE` and `RUNNING → FAILED`.
- AC-5: missing bearer → 401; correct bearer → 200.
- AC-6: `/healthz` always 200; `/readyz` returns 503 when binaries missing.
- AC-8: third job queued when concurrency limit is 2; 4th rejected with 429.
- AC-9: zip renamed to `.tlog` or `.mp4` → 400 with stable `error_code`.
### AC Coverage Matrix
| AC | Status | Evidence |
|----|--------|----------|
| AC-1 sync 200 | Done | `test_post_replay_sync_returns_200_with_result_urls` + `test_post_replay_serves_jsonl_and_map_for_done_job` |
| AC-2 async 202 | Done | `test_post_replay_async_returns_202_when_video_exceeds_sync_bytes` |
| AC-3 job state machine | Done | `test_job_state_transitions_observable_via_polling`, `test_failed_runner_marks_job_failed`, `test_result_endpoints_409_when_job_not_done` |
| AC-5 401 on bad bearer | Done | `test_post_replay_returns_401_without_bearer_when_required` + `test_post_replay_accepts_correct_bearer` |
| AC-6 health endpoints | Done | `test_healthz_always_returns_200` + `test_readyz_returns_503_when_binary_missing` |
| AC-8 concurrency cap | Done | `test_concurrency_limit_queues_excess_jobs` + `test_queue_full_returns_429` |
| AC-9 magic-byte rejection | Done | `test_validate_tlog_kind_rejects_zip_renamed_to_tlog`, `test_validate_video_kind_rejects_arbitrary_bytes`, `test_post_replay_rejects_misnamed_zip_as_tlog`, `test_post_replay_rejects_misnamed_zip_as_video` |
### Test Run Summary
- **AZ-701 unit slice**: 18/18 passed (`tests/unit/replay_api/`).
- **Full unit suite**: 2251 passed, 86 skipped, 1 failed (`test_cold_start_under_500ms_p99` — pre-existing C12 CLI flake unrelated to AZ-701; same failure observed in batches 100 and 101).
- **Mypy --strict on AZ-701 surface**: clean (9 source files: `replay_api/*`, `helpers/accuracy_report.py`, `cli/replay_api_entrypoint.py`).
### Design Decisions
- **Subprocess runner, not in-process estimator**: `SubprocessReplayRunner` invokes the existing `gps-denied-replay` console script. Keeps the API a thin transport layer; matches the invariant in the contract that the API does NOT re-implement the pipeline.
- **Pre-allocated job_id**: the handler allocates a job_id, writes uploads into the matching storage dir, then passes the id to `JobRegistry.submit(job_id=...)`. Earlier draft used a separate registry-assigned id and tried to "release-then-resubmit"; that path deleted the dir holding the uploads. Fixed by adding the optional `job_id` parameter.
- **`from __future__ import annotations` deliberately dropped in `app.py`**: FastAPI 0.119 + Pydantic v2 resolve route-parameter annotations at decoration time. Forward-ref strings break `Annotated[UploadFile, File()]`. The rest of the `replay_api` package keeps the future-annotations import. The reason is captured in the `app.py` module docstring.
- **Pydantic v2 `Annotated` syntax**: every route parameter uses `Annotated[T, File()/Form()/Header()]` rather than the legacy `T = File(...)` form. Older form raised `PydanticUserError: 'UploadFile' is not fully defined`.
- **Magic-byte validation is mandatory, not advisory**: matches AC-9 wording ("Wrong magic bytes → 400"). Anything that's not MAVLink v1/v2 (`\xfe` / `\xfd` first byte) is rejected as tlog; anything without `ftyp` in bytes 4-12 is rejected as video. No `application/x-mavlink` content-type sniffing.
- **State is in-memory only**: matches "no persistent state across restarts" invariant in the contract. Operators wanting durability can layer it externally (or move to AZ-702 follow-on). Documented in the contract.
### Known Limitations
- `SubprocessReplayRunner` returns `result.stdout`/`stderr` only when the subprocess fails; success path discards them. Operators wanting a per-job audit log will need a follow-on.
- No request body streaming — `python-multipart` buffers each part. The 2 GB hard limit guards memory.
- No rate limiting beyond the concurrency/queue caps. A reverse proxy is the right place for that.
- E2E test against the real Derkachi flight artefacts is intentionally NOT in scope here (per the testing-environment rule: e2e runs on Jetson only and AZ-699's `test_derkachi_real_tlog.py` already exercises the underlying pipeline).