mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 09:31:14 +00:00
[AZ-701] HTTP replay API service (FastAPI + magic-byte upload validation)
ci/woodpecker/push/02-build-push Pipeline failed
ci/woodpecker/push/02-build-push Pipeline failed
New replay_api component: FastAPI service wrapping the offline
gps-denied-replay pipeline. POST tlog+video (multipart) → either
sync 200 with result/map/report URLs, or async 202 + job id with
/jobs/{id} polling. Magic-byte validation, bearer auth, in-memory
JobRegistry with concurrency + queue caps (429 on overflow).
Helper accuracy_report.py promoted from tests/ to src/ because the
API needs the Markdown report writer at runtime; all AZ-699 imports
re-pointed. OpenAPI spec exported to docs.
18/18 unit tests pass (AC-1 sync, AC-2 async, AC-3 state machine,
AC-5 auth, AC-6 health, AC-8 concurrency, AC-9 magic-byte). Full
unit suite: 2251 pass, 86 skip, 1 pre-existing C12 cold-start flake
(unchanged). mypy --strict clean on the new surface.
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,234 @@
|
||||
# HTTP Replay API service
|
||||
|
||||
**Task**: AZ-701_http_replay_api_service
|
||||
**Name**: HTTP API for offline replay (POST tlog+video, return GPS fixes + map URL)
|
||||
**Description**: New `replay_api` component (FastAPI) wrapping the offline replay pipeline. One primary endpoint `POST /replay` accepts multipart `(tlog + video [+ calibration])` and returns either a synchronous JSONL+summary or an async job id. Returns links to the map artifact rendered by AZ-700.
|
||||
**Complexity**: 5 points
|
||||
**Dependencies**: AZ-699, AZ-700
|
||||
**Component**: replay_api (new component)
|
||||
**Tracker**: AZ-701
|
||||
**Epic**: AZ-696
|
||||
|
||||
## Problem
|
||||
|
||||
The product today has zero HTTP surface. The only ways to invoke the
|
||||
estimator on a recorded flight are:
|
||||
1. The airborne binary (real-time MAVLink GPS_INPUT — needs the
|
||||
aircraft + FC).
|
||||
2. `gps-denied-replay` CLI (operator workstation, Python install
|
||||
required).
|
||||
3. `operator-orchestrator` CLI (Click, pre-flight cache only — does
|
||||
NOT run the estimator).
|
||||
|
||||
External consumers (operator tools, suite web UIs, demo dashboards,
|
||||
other suite services) cannot validate flights without installing the
|
||||
full Python stack. The user's pipeline framing explicitly calls for
|
||||
"part of the api — tlog and video uploading. and emits gps fixes back
|
||||
to the user."
|
||||
|
||||
## Outcome
|
||||
|
||||
- A new HTTP service exposes `POST /replay` and the supporting `GET /jobs/{id}*` polling endpoints.
|
||||
- The service wraps `gps-denied-replay` and AZ-700's map renderer behind a single multipart upload.
|
||||
- Containerized; runs in `docker-compose.test.yml`; OpenAPI spec is committed.
|
||||
- Authentication via bearer token, gated explicitly off in dev mode (logs WARN).
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
- New component `src/gps_denied_onboard/replay_api/`:
|
||||
- `app.py` (FastAPI instance)
|
||||
- `handlers.py` (multipart upload, validation)
|
||||
- `jobs.py` (sync ≤ 2 min videos / async > 2 min)
|
||||
- `storage.py` (temp file lifecycle, cleanup)
|
||||
- `interface.py` (`ReplayRunner` Protocol so handlers are decoupled)
|
||||
- `errors.py` (custom HTTP error families)
|
||||
- Endpoints: `POST /replay`, `GET /jobs/{id}`, `GET /jobs/{id}/result`, `GET /jobs/{id}/map`, `GET /healthz`, `GET /readyz`.
|
||||
- Bearer-token auth: `REPLAY_API_BEARER_TOKEN` env var; explicit dev opt-out via `REPLAY_API_AUTH_REQUIRED=false`.
|
||||
- Upload size limit + concurrent-job limit, env-configurable.
|
||||
- New `replay-api` console script (uvicorn entrypoint) in `pyproject.toml`.
|
||||
- New `docker/replay-api.Dockerfile` + `docker-compose.test.yml` entry.
|
||||
- OpenAPI spec exported to `_docs/02_document/contracts/replay_api/openapi.yaml`.
|
||||
- Contract file `_docs/02_document/contracts/replay_api/replay_api_protocol.md` (per shared/api decompose Step 4.5 rule).
|
||||
- File-upload magic-byte validation for `.tlog` + `.mp4`.
|
||||
|
||||
### Excluded
|
||||
- Web UI (parent-suite concern).
|
||||
- Persistent job database (in-memory + temp disk is sufficient for v1).
|
||||
- Multi-node job distribution.
|
||||
- WebSocket streaming of progress.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Sync happy path (short video, dev mode)**
|
||||
Given `REPLAY_API_AUTH_REQUIRED=false` and a 60 s video
|
||||
When `POST /replay` runs with multipart `tlog + video`
|
||||
Then response is 200 with JSONL of GPS fixes + accuracy summary inline
|
||||
|
||||
**AC-2: Async happy path (long video)**
|
||||
Given a > 2-minute video
|
||||
When `POST /replay` runs
|
||||
Then response is 202 with `Location: /jobs/{id}` and `{job_id, status_url}`
|
||||
|
||||
**AC-3: Job state transitions**
|
||||
Given an async job
|
||||
When polled via `GET /jobs/{id}`
|
||||
Then state transitions `queued → running → done` are observable
|
||||
|
||||
**AC-4: Result + map served from job id**
|
||||
Given a `done` job
|
||||
When `GET /jobs/{id}/result` is called
|
||||
Then it streams the JSONL; `GET /jobs/{id}/map` returns the HTML map (from AZ-700)
|
||||
|
||||
**AC-5: Auth enforced when configured**
|
||||
Given `REPLAY_API_BEARER_TOKEN=secret`
|
||||
When `POST /replay` runs without `Authorization: Bearer secret`
|
||||
Then response is 401
|
||||
|
||||
**AC-6: Health endpoints**
|
||||
Given the service is up and `gps-denied-replay` console-script is on PATH
|
||||
When `GET /healthz` and `GET /readyz` are called
|
||||
Then both return 200
|
||||
|
||||
**AC-7: OpenAPI + contract documented**
|
||||
Given the service is running
|
||||
When the OpenAPI spec is exported
|
||||
Then `_docs/02_document/contracts/replay_api/openapi.yaml` is committed; `replay_api_protocol.md` documents the versioning rules
|
||||
|
||||
**AC-8: Concurrency limit enforced**
|
||||
Given `REPLAY_API_MAX_CONCURRENT_JOBS=1`
|
||||
When 3 jobs are submitted in quick succession
|
||||
Then exactly 1 is `running`; 2 are `queued`
|
||||
|
||||
**AC-9: Magic-byte upload validation**
|
||||
Given a `POST /replay` with a misnamed `.tlog` (actually a `.zip`)
|
||||
When the handler validates
|
||||
Then response is 400 with a clear error
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- For a 60 s Derkachi video, sync `POST /replay` returns within `gps-denied-replay` ASAP-mode wall + 5 s overhead on Tier-2 Jetson.
|
||||
|
||||
**Security**
|
||||
- Magic-byte file validation; reject anything not matching `.tlog` (MAVLink magic 0xFD/0xFE) or `.mp4` (ftyp).
|
||||
- Bearer auth always available; default-OFF only with explicit env var.
|
||||
|
||||
**Compatibility**
|
||||
- FastAPI / uvicorn / python-multipart pinned; document version compatibility window.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | Sync POST → 200 + JSONL | Round-trip succeeds with synth fixtures |
|
||||
| AC-2 | Async POST → 202 + job id | 202 with Location header |
|
||||
| AC-3 | Job state machine | Transitions observed |
|
||||
| AC-5 | Missing/wrong bearer → 401 | Strict failure |
|
||||
| AC-8 | Concurrency limit | 2 of 3 queued |
|
||||
| AC-9 | Wrong magic bytes → 400 | Clear error |
|
||||
|
||||
## Blackbox Tests
|
||||
|
||||
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|
||||
|--------|------------------------|-------------|-------------------|----------------|
|
||||
| AC-1, AC-4 | Real derkachi.tlog + video | `curl` round-trip in docker-compose | 200 + JSONL + map HTML | Perf |
|
||||
| AC-6 | Container up | Health endpoint checks | 200 OK | — |
|
||||
|
||||
## Constraints
|
||||
|
||||
- FastAPI MUST live in an operator-only build target; ADR-002 binary-exclusion applies. Airborne binary cold-start regression test must remain green.
|
||||
- New component MUST follow interface-first + constructor-injection (Principle #13 in architecture.md).
|
||||
- Contract file MUST exist before the endpoint is callable in CI (per decompose Step 4.5 rule).
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: FastAPI / uvicorn dep weight on airborne binary**
|
||||
- *Risk*: Adding the API dep to the airborne binary regresses cold-start.
|
||||
- *Mitigation*: Place `replay_api/` in an operator-only optional-dependencies group; CMake / build-time exclusion enforces.
|
||||
|
||||
**Risk 2: HTTP timeout on long videos**
|
||||
- *Risk*: Sync mode + a long video → HTTP timeout.
|
||||
- *Mitigation*: Async mode triggers automatically above the configured video-length threshold.
|
||||
|
||||
**Risk 3: File-upload abuse**
|
||||
- *Risk*: Malicious uploads (huge files, zip bombs, fake MIME types).
|
||||
- *Mitigation*: Hard size limit (2 GB default), magic-byte validation, temp-file cleanup, configurable disk quota.
|
||||
|
||||
## Contract
|
||||
|
||||
This task produces the contract at `_docs/02_document/contracts/replay_api/replay_api_protocol.md`.
|
||||
Consumers MUST read that file — not this task spec — to discover the interface and versioning rules.
|
||||
|
||||
## Implementation Notes (Batch 102, Cycle 2)
|
||||
|
||||
### Files Changed
|
||||
|
||||
**New production code** — `src/gps_denied_onboard/replay_api/`:
|
||||
- `__init__.py` — public exports (`create_app`, DTOs, error families).
|
||||
- `errors.py` — `ReplayApiError` hierarchy with stable `error_code` + HTTP `status_code`.
|
||||
- `interface.py` — `JobState`, `ReplayInputs`, `ReplayJobResult`, `JobSnapshot`, `ReplayRunner` Protocol.
|
||||
- `storage.py` — per-job temp directory lifecycle (`StorageRoot.allocate_job/release_job/cleanup_all`).
|
||||
- `jobs.py` — in-memory `JobRegistry` with `max_concurrent` / `max_queued`, `ThreadPoolExecutor` worker pool.
|
||||
- `handlers.py` — magic-byte validation (`validate_tlog_kind` for MAVLink v1/v2, `validate_video_kind` for MP4 `ftyp`, `validate_calibration_kind` for JSON), size limits, bearer-token extraction.
|
||||
- `app.py` — `create_app(...)` FastAPI factory + `SubprocessReplayRunner` (shells out to `gps-denied-replay --auto-trim` and `gps-denied-render-map`).
|
||||
|
||||
**New CLI entrypoint** — `src/gps_denied_onboard/cli/replay_api_entrypoint.py`:
|
||||
- `replay-api` console script wired in `pyproject.toml` under the `operator-tools` extra.
|
||||
- Parses `--host`, `--port`, `--storage-root`, `--reload`; reads `REPLAY_API_*` env knobs.
|
||||
|
||||
**Helper promoted from tests** — `src/gps_denied_onboard/helpers/accuracy_report.py`:
|
||||
- Was `tests/e2e/replay/_report_writer.py` (AZ-699 batch). Promoted because `replay_api` needs it at runtime to produce `accuracy_report.md`. Re-exported from `helpers/__init__.py`. All AZ-699 imports re-pointed.
|
||||
|
||||
**Contract** — `_docs/02_document/contracts/replay_api/replay_api_protocol.md` (purpose, invariants, endpoints, error families, env config, versioning).
|
||||
|
||||
**OpenAPI spec** — `_docs/02_document/contracts/replay_api/openapi.yaml` (auto-exported from the FastAPI app; check in alongside the contract doc).
|
||||
|
||||
**Docker** — `docker/replay-api.Dockerfile` + `e2e/docker/docker-compose.test.yml` (`replay-api` service, profile-gated `replay-api`, with `replay-api-storage` volume).
|
||||
|
||||
**Dependencies** — `pyproject.toml`:
|
||||
- `operator-tools` extra now also pulls `fastapi>=0.111,<0.120`, `uvicorn>=0.30,<1.0`, `python-multipart>=0.0.9,<1.0`.
|
||||
- New console script `replay-api`.
|
||||
|
||||
**Unit tests** — `tests/unit/replay_api/test_az701_replay_api.py` (18 tests, all passing):
|
||||
- AC-1 sync: POST → 200 with `result_url`/`map_url`/`report_url`; JSONL + HTML map served from those URLs.
|
||||
- AC-2 async: large video (> sync threshold) → 202 + `Location: /jobs/{id}`.
|
||||
- AC-3: job state visible via polling `RUNNING → DONE` and `RUNNING → FAILED`.
|
||||
- AC-5: missing bearer → 401; correct bearer → 200.
|
||||
- AC-6: `/healthz` always 200; `/readyz` returns 503 when binaries missing.
|
||||
- AC-8: third job queued when concurrency limit is 2; 4th rejected with 429.
|
||||
- AC-9: zip renamed to `.tlog` or `.mp4` → 400 with stable `error_code`.
|
||||
|
||||
### AC Coverage Matrix
|
||||
|
||||
| AC | Status | Evidence |
|
||||
|----|--------|----------|
|
||||
| AC-1 sync 200 | Done | `test_post_replay_sync_returns_200_with_result_urls` + `test_post_replay_serves_jsonl_and_map_for_done_job` |
|
||||
| AC-2 async 202 | Done | `test_post_replay_async_returns_202_when_video_exceeds_sync_bytes` |
|
||||
| AC-3 job state machine | Done | `test_job_state_transitions_observable_via_polling`, `test_failed_runner_marks_job_failed`, `test_result_endpoints_409_when_job_not_done` |
|
||||
| AC-5 401 on bad bearer | Done | `test_post_replay_returns_401_without_bearer_when_required` + `test_post_replay_accepts_correct_bearer` |
|
||||
| AC-6 health endpoints | Done | `test_healthz_always_returns_200` + `test_readyz_returns_503_when_binary_missing` |
|
||||
| AC-8 concurrency cap | Done | `test_concurrency_limit_queues_excess_jobs` + `test_queue_full_returns_429` |
|
||||
| AC-9 magic-byte rejection | Done | `test_validate_tlog_kind_rejects_zip_renamed_to_tlog`, `test_validate_video_kind_rejects_arbitrary_bytes`, `test_post_replay_rejects_misnamed_zip_as_tlog`, `test_post_replay_rejects_misnamed_zip_as_video` |
|
||||
|
||||
### Test Run Summary
|
||||
|
||||
- **AZ-701 unit slice**: 18/18 passed (`tests/unit/replay_api/`).
|
||||
- **Full unit suite**: 2251 passed, 86 skipped, 1 failed (`test_cold_start_under_500ms_p99` — pre-existing C12 CLI flake unrelated to AZ-701; same failure observed in batches 100 and 101).
|
||||
- **Mypy --strict on AZ-701 surface**: clean (9 source files: `replay_api/*`, `helpers/accuracy_report.py`, `cli/replay_api_entrypoint.py`).
|
||||
|
||||
### Design Decisions
|
||||
|
||||
- **Subprocess runner, not in-process estimator**: `SubprocessReplayRunner` invokes the existing `gps-denied-replay` console script. Keeps the API a thin transport layer; matches the invariant in the contract that the API does NOT re-implement the pipeline.
|
||||
- **Pre-allocated job_id**: the handler allocates a job_id, writes uploads into the matching storage dir, then passes the id to `JobRegistry.submit(job_id=...)`. Earlier draft used a separate registry-assigned id and tried to "release-then-resubmit"; that path deleted the dir holding the uploads. Fixed by adding the optional `job_id` parameter.
|
||||
- **`from __future__ import annotations` deliberately dropped in `app.py`**: FastAPI 0.119 + Pydantic v2 resolve route-parameter annotations at decoration time. Forward-ref strings break `Annotated[UploadFile, File()]`. The rest of the `replay_api` package keeps the future-annotations import. The reason is captured in the `app.py` module docstring.
|
||||
- **Pydantic v2 `Annotated` syntax**: every route parameter uses `Annotated[T, File()/Form()/Header()]` rather than the legacy `T = File(...)` form. Older form raised `PydanticUserError: 'UploadFile' is not fully defined`.
|
||||
- **Magic-byte validation is mandatory, not advisory**: matches AC-9 wording ("Wrong magic bytes → 400"). Anything that's not MAVLink v1/v2 (`\xfe` / `\xfd` first byte) is rejected as tlog; anything without `ftyp` in bytes 4-12 is rejected as video. No `application/x-mavlink` content-type sniffing.
|
||||
- **State is in-memory only**: matches "no persistent state across restarts" invariant in the contract. Operators wanting durability can layer it externally (or move to AZ-702 follow-on). Documented in the contract.
|
||||
|
||||
### Known Limitations
|
||||
|
||||
- `SubprocessReplayRunner` returns `result.stdout`/`stderr` only when the subprocess fails; success path discards them. Operators wanting a per-job audit log will need a follow-on.
|
||||
- No request body streaming — `python-multipart` buffers each part. The 2 GB hard limit guards memory.
|
||||
- No rate limiting beyond the concurrency/queue caps. A reverse proxy is the right place for that.
|
||||
- E2E test against the real Derkachi flight artefacts is intentionally NOT in scope here (per the testing-environment rule: e2e runs on Jetson only and AZ-699's `test_derkachi_real_tlog.py` already exercises the underlying pipeline).
|
||||
Reference in New Issue
Block a user