Files
annotations/_docs/02_document/architecture.md
T
Oleksandr Bezdieniezhnykh 03f879206e docs+src: complete Steps 1-3 outcomes + auth re-sync baseline
This commit captures everything produced during autodev existing-code
Steps 1 (Document), 2 (Architecture Baseline Scan), and 3 (Test Spec),
together with the targeted auth + CORS re-sync triggered on 2026-05-14
when codebase drift was detected at Step 4 entry. None of this work was
previously committed.

Step 1 (Document) — 50+ _docs/02_document/ files: problem, solution,
architecture, system flows, glossary, module-layout, per-component
specs (01..06), modules, deployment, diagrams, data model, FINAL
report, verification log, discovery.

Step 2 (Architecture Baseline) — architecture_compliance_baseline.md.
Verdict PASS_WITH_WARNINGS (0 Critical, 0 High, 1 Medium, 2 Low). No
High/Critical findings; auto-chained to Step 3 per existing-code flow.

Step 3 (Test Spec) — _docs/02_document/tests/* (67 scenarios across
blackbox, security, resilience, resource-limit, performance), plus
e2e/docker-compose.test.yml, e2e/seed/run.sh, scripts/run-tests.sh,
scripts/run-performance-tests.sh. Coverage 88% over the active scope
(40 of 45 items covered, 6 RB-deferred, 5 documented-as-uncovered).

Targeted auth + CORS re-sync — replaces the deleted in-house token
issuer with a JWKS-verifier model. AuthController and TokenService
removed; JwtExtensions switched from HS256 symmetric to ES256 over
admin's JWKS. ConfigurationResolver and CorsConfigurationValidator
added under src/Infrastructure/. ADR-002 and ADR-006 retired; SEC-01,
SEC-02, SEC-03 marked Closed. One new testability risk recorded in
architecture.md Open Risks Section 6 (JWKS HTTPS gating).

Source changes:
- src/Auth/JwtExtensions.cs (modified) — ES256, JWKS, alg pinning
- src/Program.cs (modified) — DI wiring for ConfigurationResolver
  and CorsConfigurationValidator
- src/Controllers/AuthController.cs (deleted) — no in-service issuance
- src/Services/TokenService.cs (deleted) — same
- src/Infrastructure/ConfigurationResolver.cs (new)
- src/Infrastructure/CorsConfigurationValidator.cs (new)
- .env.example (new) — required env var documentation
- .gitignore (updated)

Cross-repo coordination: _docs/cross-repo/flights_h1_h2_h3_change_spec
captures the change-spec for downstream services that consumed the now
deleted /auth endpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-14 20:19:05 +03:00

404 lines
40 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Azaion.Annotations — Architecture
> **Source of truth for service-internal architecture.** Suite-level integration narrative lives in `../../../suite/_docs/01_annotations.md`. This file documents what the code in `src/` actually implements, derived bottom-up from module and component docs.
## Architecture Vision
**Status**: confirmed-by-user 2026-05-14.
Azaion.Annotations is a single .NET 10 ASP.NET Core service in the Azaion suite that owns the authoritative HTTP + streaming surface for annotation lifecycle, media upload, dataset exploration, and system metadata. State of record is PostgreSQL (Linq2DB + Npgsql) with an idempotent boot-time migrator. Real-time fan-out is in-process SSE; durable cross-service export is a transactional-outbox + RabbitMQ Stream pipeline producing MessagePack frames consumed by the admin sync worker and the AI training pipeline. The runtime is one container per node, ARM64-first via Woodpecker CI, with branch-driven image tags (`dev` | `stage` | `main`).
### Components & responsibilities
- **06 Platform** — shared kernel: DB, enums, JWT, error middleware, paths, composition root.
- **02 Annotations realtime & sync** — SSE channel + RabbitMQ Stream failsafe drainer.
- **01 Annotations REST** — annotation CRUD + image/thumbnail file routes; the lifecycle producer.
- **03 Media** — upload (single + batch), list, download, delete.
- **04 Dataset** — read-heavy `/dataset` surface + `DATASET`-policy status writes (planned to route through `01 Annotations REST` per RB-08).
- **05 Settings & metadata** — system / directory / camera / user settings + `/classes` catalog (becoming admin-managed per RB-06).
### Major data flows
- **F1 — Annotation create**: bytes → image file → DB rows → label file → SSE → outbox; will be wrapped in a business transaction (ADR-008).
- **F3 — SSE subscription**: UI long-poll on `/annotations/events`.
- **F4 — Outbox drain**: `FailsafeProducer` pumps queue rows to the RabbitMQ stream `azaion-annotations`.
- **F2 / F5 / F6 / F7 / F8** — read paths, media uploads, auth refresh, directory cache reset, dataset bulk status.
### Principles / non-negotiables
- **Wire enums are integer-stable** (suite contract). [inferred-from: `modules/wire-enums.md`, `suite/_docs/01_annotations.md`]
- **Annotation id is content-addressed** via a sampled image-bytes hash; remains file-size-independent (videos to ~5 GB). [inferred-from: `AnnotationService.ComputeHash`, ADR-004]
- **PostgreSQL is the state of record**; the filesystem is a content-addressed cache. [inferred-from: `data_model.md`, `system-flows.md` F1]
- **The transactional outbox is the durability boundary**; SSE is best-effort. [inferred-from: ADR-003 / ADR-008]
- **Lifecycle observability is World B**: every mutation publishes SSE and enqueues the outbox. [inferred-from: `FailsafeProducer` drainer plumbing for `Validated`/`Deleted`; maintainer resolution 2026-05-14 → ADR-009 / RB-01]
- **Soft-delete with file relocation**: `DeleteAnnotation` flips status to `AnnotationStatus.Deleted = 40` and moves files to a deleted-files directory rather than removing rows. [inferred-from: maintainer resolution 2026-05-14 → ADR-009 / RB-01]
- **Stream consumer dedupe contract is owned by this service**: outbox messages must carry enough metadata for downstream consumers to dedupe on `(annotationId, operation, dateTime)`. [inferred-from: maintainer resolution 2026-05-14 → ADR-013 / RB-09]
- **Mission is the canonical domain term**: code currently uses `FlightId`; the suite spec uses `missionId`. Code aligns to suite (rename, not the other way). [inferred-from: `00_discovery.md` drift list; maintainer resolution 2026-05-14 → ADR-012 / RB-07]
- **Dataset writes flow through the annotation domain service**: `04 Dataset` does not edit `annotations` rows directly. [inferred-from: `module-layout.md` Verification Needed §1; maintainer resolution 2026-05-14 → RB-08]
- **DB-driven runtime config**: directory roots and detection classes change at runtime via `ADM` endpoints, not redeploy. [inferred-from: `PathResolver.Reset`, ADR-011]
### Open questions / drift signals (residual)
- `UserId` body field vs JWT `NameIdentifier` (suite spec lists `UserId` on `POST /annotations`; code uses JWT subject). Reconcile in suite or code.
- The exact dedupe key shape for downstream consumers — `(annotationId, operation, dateTime)` is the working assumption per RB-09; suite consumer doc must be updated to match.
---
## 1. System Context
**Problem being solved**: Provide the canonical HTTP + streaming API for **annotation lifecycle** (create / update / status / delete / list / files), **media** (upload, list, download), **dataset exploration** (`DATASET` policy reads + bulk status writes), and **system metadata** (settings + detection class catalog), with **real-time SSE** push to UI consumers and **failsafe** export to RabbitMQ Stream consumers (admin sync, AI training).
**System boundaries**:
- **Inside**: a single ASP.NET Core process (`Azaion.Annotations.dll`), its embedded migrator, in-memory SSE channel, in-process `BackgroundService` outbox drain, and the on-disk image / label / thumbnail / results layout under `directory_settings`.
- **Outside**: PostgreSQL (state of record), RabbitMQ Streams (durable annotation export), the on-disk media/data filesystem (mounted), and every authenticated HTTP / SSE consumer (UIs, detections service, admin sync worker, AI training).
**External systems**:
| System | Integration Type | Direction | Purpose |
|--------|------------------|-----------|---------|
| PostgreSQL | DB (Linq2DB / Npgsql) | Both | State of record (annotations, media, queue, settings, classes) |
| RabbitMQ Streams | Stream client (`RabbitMQ.Stream.Client`) | Outbound | Durable export of annotation lifecycle (`azaion-annotations` stream) |
| Filesystem (mounted) | File I/O | Both | Annotation images, YOLO label `.txt`, thumbnails, results, GPS routes/sat |
| Annotator UI / Dataset Explorer UI | REST + SSE | Inbound | User flows (suite `01_annotations.md`, `09_dataset_explorer.md`) |
| Detections service (suite `detections`) | REST | Inbound | POST annotations after model inference; long-running tokens are refreshed against admin (annotations no longer mints tokens) |
| Admin sync worker / AI training | RabbitMQ Streams | Outbound | Consume `azaion-annotations` stream offsets (suite `Annotation Sync`) |
## 2. Technology Stack
| Layer | Technology | Version | Rationale |
|-------|------------|---------|-----------|
| Language | C# | `net10.0` (`src/Azaion.Annotations.csproj`) | Single language across suite .NET services |
| Framework | ASP.NET Core (minimal hosting + controllers) | net10.0 | Built-in JWT, CORS, Swagger, hosted services |
| ORM / DB driver | Linq2DB + Npgsql | per `csproj` | Linq2DB used for `ITable<>` repositories; Npgsql under the hood |
| Database | PostgreSQL | not pinned in code (URL-driven) | Suite-wide datastore |
| Auth | JWT Bearer (`Microsoft.AspNetCore.Authentication.JwtBearer`) — verifier-only, ES256 over admin's JWKS | net10.0 | Issuer/audience/lifetime/signature all validated; admin is the sole issuer (see Section 7) |
| Messaging | RabbitMQ Streams (`RabbitMQ.Stream.Client`) + MessagePack | per `csproj` | Durable, replayable annotation export |
| API docs | Swashbuckle (Swagger / Swagger UI) | per `csproj` | Always mounted (see ADR-005) |
| Hashing | `System.IO.Hashing` | net10.0 stdlib | Annotation id derived from image bytes hash |
| Hosting | `WebApplication` + `IHostedService` | net10.0 | `FailsafeProducer` runs in-process |
| Container | `mcr.microsoft.com/dotnet/aspnet:10.0` | linux/arm64 + linux/amd64 | Multi-arch image, ARM-first per Woodpecker |
| CI | Woodpecker CI (`.woodpecker/build-arm.yml`) | n/a | Branch-based image tag (`${BRANCH}-arm`) |
**Key constraints (evidenced in code/config)**:
- `DATABASE_URL` is **required** at startup — `ConfigurationResolver.ResolveRequiredOrThrow` throws if not set. The string is auto-converted from `postgresql://user:pass@host:port/db` URI form to Linq2DB's `Host=…;Username=…` form by `Program.ConvertPostgresUrl`.
- JWT verification is **required** at startup — `JWT_ISSUER`, `JWT_AUDIENCE`, and `JWT_JWKS_URL` are all resolved by `ConfigurationResolver.ResolveRequiredOrThrow`. There is no insecure fallback. The JWKS URL must be HTTPS (`HttpDocumentRetriever { RequireHttps = true }`).
- Default directory roots are `/data/{videos,images,labels,results,thumbnails,gps_sat,gps_route}` (migrator `directory_settings` defaults) → operator must mount or override at the DB level via `PUT /settings/directories`.
- CORS is **environment-gated**: `CorsConfigurationValidator.EnsureSafeForEnvironment` refuses to start in `Production` when `CorsConfig:AllowedOrigins` is empty unless `CorsConfig:AllowAnyOrigin=true` is set explicitly. ADR-006 was retired together with the wide-open default.
## 3. Deployment Model
**Environments** (evidenced from CI branches): `dev`, `stage`, `main` → image tag `${CI_COMMIT_BRANCH}-arm` pushed to a private registry resolved from `REGISTRY_HOST` secret.
**Infrastructure**:
- Single .NET service container; container exposes port `8080`.
- Multi-arch build supported in the Dockerfile (`--platform=$BUILDPLATFORM`, `$TARGETARCH`); the ARM Woodpecker pipeline currently only emits `arm64`.
- Scaling is **vertical-only** as written: SSE uses an in-process `Channel<AnnotationEventDto>`, and the `FailsafeProducer` outbox drainer is a per-instance `BackgroundService` — see "Open Architectural Risks".
**Environment-specific configuration** (defaults vs production):
| Config | Source | Development default | Production behavior |
|--------|--------|---------------------|---------------------|
| `DATABASE_URL` | env or `Database:Url` config key | none — fail-fast on missing (`ConfigurationResolver`) | MUST set |
| `JWT_ISSUER` | env or `Jwt:Issuer` config key | none — fail-fast | MUST set (matches admin's issuer) |
| `JWT_AUDIENCE` | env or `Jwt:Audience` config key | none — fail-fast | MUST set (matches admin's audience for this service) |
| `JWT_JWKS_URL` | env or `Jwt:JwksUrl` config key | none — fail-fast; HTTPS required | MUST set to admin's JWKS endpoint |
| `RABBITMQ_HOST` / `RABBITMQ_STREAM_PORT` | env | `127.0.0.1` / `5552` | Override per environment |
| `RABBITMQ_PRODUCER_USER` / `_PASS` | env | `azaion_producer` / `producer_pass` | Override |
| `RABBITMQ_STREAM_NAME` | env | `azaion-annotations` | Usually kept (suite contract) |
| `CorsConfig:AllowedOrigins` | `IConfiguration` (string array) | empty | MUST set (or set `AllowAnyOrigin=true` explicitly) — `CorsConfigurationValidator` refuses to start in Production otherwise |
| `CorsConfig:AllowAnyOrigin` | `IConfiguration` (bool) | false | Explicit opt-in for permissive policy |
| Directory roots (`/data/...`) | DB `directory_settings` | hard-coded SQL defaults | Tune via `PUT /settings/directories` (calls `PathResolver.Reset`) |
| Swagger UI | `Program.cs` | mounted | **Also mounted in prod** (ADR-005) |
| `AZAION_REVISION` | Dockerfile build arg `CI_COMMIT_SHA` | `unknown` | Stamped per-image |
## 4. Data Model Overview
> Detailed ERD, indexes, and migration semantics live in `data_model.md`. This section is the cross-component summary.
**Core entities** (owned by `06_platform`; consumed by feature components):
| Entity | Description | Owned by component |
|--------|-------------|---------------------|
| `media` | Uploaded image/video reference (waypoint-scoped) | `03_media` (writes) / `01_annotations-rest` (reads) |
| `annotations` | Annotation row keyed by image-bytes hash, soft-versioned by `created_date`, `time` (BIGINT ticks) | `01_annotations-rest` |
| `detection` | YOLO bounding boxes (`center_x/y, width, height`, class, affiliation, combat readiness) per annotation | `01_annotations-rest` |
| `annotations_queue_records` | Outbox for failsafe stream sync (`operation`, `annotation_ids` JSON array) | `02_annotations-realtime-sync` (writer) / `01_annotations-rest` (writer side) |
| `system_settings` | Singleton-ish org settings + `generate_annotated_image`, `silent_detection` toggles | `05_settings-metadata` |
| `directory_settings` | Filesystem roots consumed by `PathResolver` | `05_settings-metadata` |
| `detection_classes` | Seeded class catalog for UI label/color (ids 018, names + Cyrillic short names + hex colors) | `05_settings-metadata` (read-only `ClassesController`) |
| `user_settings` | Per-user UI prefs (panel widths, selected flight) | `05_settings-metadata` |
| `camera_settings` | Calibration (altitude, focal length, sensor width) | `05_settings-metadata` |
**Key relationships**:
- `annotations.media_id``media.id` (FK).
- `detection.annotation_id``annotations.id` (FK; cascades on annotation update logic in service layer, not DB).
- `annotations_queue_records.annotation_ids` is a **JSON array of TEXT ids** (no FK); single-row outbox entry can reference multiple annotations (bulk).
**Data flow summary**:
- **Inbound write (Create)** — *today*: HTTP body → `AnnotationService.CreateAnnotation` → image bytes to `images_dir/{id}.jpg`, optional `media` row insert, `annotations` + `detection` rows, YOLO label to `labels_dir/{id}.txt`, SSE publish, then (if `silent_detection != true`) outbox row → drained by `FailsafeProducer` → MessagePack frame on RabbitMQ stream. **Thumbnails are not produced by this flow** — they are read-only via `PhysicalFile` and presumed populated out-of-band.
- **Inbound write (Update / UpdateStatus / Delete annotations, dataset PATCH / bulk-status)** — *today*: DB-only, silent. *Target* (RB-01): every mutation publishes SSE and enqueues the outbox with the appropriate `QueueOperation` (`Created`, `Validated`, or `Deleted`).
- **Lifecycle ordering** — *target* (RB-03): all DB writes plus the outbox row commit inside a single business transaction; FS writes (image / label / future thumbnail generation) and SSE publish are post-commit, with the outbox row as the durable promise.
- **Inbound read**: HTTP query → DB joins (`annotations × detection × media`) → JSON list (`PaginatedResponse<AnnotationListItem>`); image/thumbnail served as `PhysicalFile`.
## 5. Integration Points
### Internal communication (in-process)
| From | To | Protocol | Pattern | Notes |
|------|----|----------|---------|-------|
| `01_annotations-rest` (`AnnotationService`) | `02_annotations-realtime-sync` (`AnnotationEventService`) | C# call | Fire-and-forget publish to `Channel<>` | **Today**: only on Create. **Target (RB-01)**: every mutation publishes (Create, Update, UpdateStatus, Delete) |
| `01_annotations-rest` (`AnnotationService`) | `02_annotations-realtime-sync` (`annotations_queue_records` table) | DB INSERT via `FailsafeProducer.EnqueueAsync` (static helper) | Outbox | **Today**: Create only, gated by `silent_detection`. **Target (RB-01 + RB-02)**: every mutation enqueues with the appropriate `QueueOperation`; gating flag removed |
| `02_annotations-realtime-sync` (`FailsafeProducer`) | `06_platform` (`AppDataConnection`, `PathResolver`) | C# call | Read-then-delete | Drainer is **already plumbed** for `Created`, `Validated`, and `Deleted` operations (see `FailsafeProducer.cs:108123`) |
| `04_dataset` (`DatasetService.UpdateStatus` / `BulkUpdateStatus`) | `01_annotations-rest` (`AnnotationEventService`) + outbox | shared DB + cross-component call | Direct write today; lifecycle publish + enqueue per RB-01 | Bulk path enqueues a single `Validated` outbox record carrying all ids |
| `05_settings-metadata` (directory PUT) | `06_platform` (`PathResolver.Reset`) | C# call | Cache invalidation | Required after directory change |
### External integrations
| External system | Protocol | Auth | Rate limits | Failure mode |
|-----------------|----------|------|-------------|--------------|
| PostgreSQL | TCP / Linq2DB / Npgsql | Conn string | n/a | Surfaced as 500 via `ErrorHandlingMiddleware` |
| RabbitMQ Stream `azaion-annotations` | Stream protocol (5552) | Stream user/pass (`azaion_producer` default) | Stream-level | `FailsafeProducer` retries; rows stay in `annotations_queue_records` until drained |
| Filesystem (`/data/...`) | POSIX | OS perms | n/a | `IOException` → 500; missing image on GET → 404 |
| HTTP clients (UIs, detections, admin) | REST + SSE | JWT Bearer (`ANN`, `DATASET`, `ADM`) | n/a | `401` if invalid; `403` if missing claim |
## 6. Non-Functional Requirements
> Pulled only from code-level evidence — config defaults, validators, health checks, idempotent migrator. Anything not evidenced is left blank rather than guessed.
| Requirement | Target | Measurement | Priority | Source |
|-------------|--------|-------------|----------|--------|
| Liveness | 200 OK on `GET /health` | route in `Program.cs` | High | `Program.cs` |
| Idempotent startup | DB schema applies cleanly on every boot | `DatabaseMigrator.Migrate` uses `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE … IF NOT EXISTS` and `INSERT … ON CONFLICT DO NOTHING` | High | `Database/DatabaseMigrator.cs` |
| Recovery: queue durability | Annotation lifecycle events are not lost across pod restarts | DB-backed outbox (`annotations_queue_records`) drained by `FailsafeProducer` | High | `Services/FailsafeProducer.cs` |
| Auth lifetime / clock skew | per `JwtExtensions.AddJwtAuth` config | `auth-identity` module | Medium | `Auth/JwtExtensions.cs` |
| Pagination defaults | `PaginatedResponse<T>` total/page/pageSize | applied in list endpoints | Medium | `DTOs/PaginatedResponse.cs` |
| Thumbnail dimensions | `240×135` with `10` border (defaults) | `system_settings.thumbnail_*` | Low | migrator defaults |
| Throughput / latency / availability targets | **not evidenced in code** | — | — | open question, see `00_problem` extraction (Step 6) |
## 7. Security Architecture
**Authentication**: JWT Bearer; **ES256 signature** verified against admin's JWKS endpoint (`JWT_JWKS_URL`, default `https://admin.azaion.com/.well-known/jwks.json`). `ValidateIssuer`, `ValidateAudience`, `RequireSignedTokens`, and `RequireExpirationTime` are all enforced; algorithms are pinned to `EcdsaSha256` to block HS256-confusion forgeries. Admin is the sole token issuer for the suite — annotations no longer holds an HMAC secret and no longer mints tokens (`TokenService` and `POST /auth/refresh` were removed; callers refresh against admin).
**Authorization** (per-endpoint policy claims, all evidenced in controllers):
- `ANN``AnnotationsController`, `MediaController`.
- `DATASET``DatasetController` (status writes including bulk).
- `ADM` — mutating routes on `SettingsController`.
- `[Authorize]` (any authenticated user) — read endpoints on settings, `ClassesController`.
- `[AllowAnonymous]``/health`.
**User identity**: server resolves user from JWT `NameIdentifier` (e.g., `AnnotationsController.Create` parses `User.FindFirstValue(ClaimTypes.NameIdentifier)``Guid`). Suite spec sometimes lists `UserId` in body — drift recorded in `00_discovery.md`.
**Data protection**:
- **At rest**: nothing in-code — relies on the underlying Postgres deployment + filesystem.
- **In transit**: terminated outside the container; service speaks plain HTTP on `:8080`.
- **Secrets**: env-driven (`DATABASE_URL`, `JWT_ISSUER`, `JWT_AUDIENCE`, `JWT_JWKS_URL`, `RABBITMQ_*`). `DATABASE_URL` and the three JWT vars now fail-fast on startup if unset (no insecure default). ADR-002 was retired together with `JWT_SECRET`.
- **CORS**: config-driven allow-list (`CorsConfig:AllowedOrigins`); `CorsConfigurationValidator.EnsureSafeForEnvironment` refuses to start in `Production` with an empty list unless `CorsConfig:AllowAnyOrigin=true` is explicitly set. ADR-006 was retired together with the wide-open default.
**Audit logging**: not evidenced beyond ASP.NET Core defaults — open gap; flag in retro/security audit.
**Input validation**: surfaces through model binding + `ErrorHandlingMiddleware` mapping (`400 / 404 / 409 / 500`); detailed validators per DTO live in `DTOs/Requests/` (component specs to confirm during Step 4 verification).
## 8. Key Architectural Decisions (inferred from code)
These ADRs document choices the codebase already evidences. They are descriptive, not prescriptive — call them out so downstream skills can challenge them deliberately.
### ADR-001: In-process SSE via `Channel<T>`
**Context**: Real-time annotation activity must reach the Annotator UI within 100ms of a write.
**Decision**: Use a singleton `AnnotationEventService` exposing an unbounded `Channel<AnnotationEventDto>` and serve subscribers from `AnnotationsController.Events` over `text/event-stream`.
**Alternatives considered (implicitly rejected)**:
1. Broker-backed pub/sub (Redis / RabbitMQ exchange) — rejected because it adds a dependency for what is already a single-process workload, and the failsafe queue covers durable export needs.
2. Server-side polling — rejected because it cannot meet sub-second latency cheaply.
**Consequences**: SSE state is **per-instance only**. Horizontal scaling requires a broker fanout layer or sticky sessions on the LB.
### ADR-002 (RETIRED): Symmetric JWT, no issuer/audience validation
**Status**: superseded — annotations is now a JWKS verifier of admin-signed ES256 tokens. `AddJwtAuth(IConfiguration)` pins `ValidAlgorithms = [SecurityAlgorithms.EcdsaSha256]`, enforces `ValidateIssuer`/`ValidateAudience`/`RequireSignedTokens`/`RequireExpirationTime`, and resolves keys through `ConfigurationManager<JsonWebKeySet>` against `JWT_JWKS_URL`. `JWT_SECRET` was removed along with the local refresh path; admin is the sole issuer for the suite. The original ADR is preserved here for historical context only.
### ADR-003: Failsafe outbox + RabbitMQ Stream (not direct publish)
**Context**: Annotation lifecycle must reach external consumers (admin sync, AI training) durably even when RabbitMQ is unavailable at the moment of the write.
**Decision**: Every mutation writes a row to `annotations_queue_records`; the in-process `FailsafeProducer` (`IHostedService`) drains this table and publishes MessagePack frames on the `azaion-annotations` stream, deleting rows after success.
**Alternatives considered**:
1. Direct publish in the request path — rejected because RabbitMQ unavailability would either drop events (`fire-and-forget`) or fail user-visible writes (sync publish).
2. Transactional outbox via Debezium / CDC — heavier, deferred.
**Consequences**: One outbox-drainer per service instance. Multiple instances drain concurrently → safe because the deletion is keyed on `id` and re-reads of disk bytes are idempotent, **but** ordering across consumers is not guaranteed.
### ADR-004: Annotation id from a sampled `XxHash3.Hash128` of image bytes
**Context**: Annotation rows must be deduplicated when the same image is re-uploaded (e.g., re-runs of the detection pipeline). The system also serves video media up to **35 GB**, so hashing must remain **constant-time with respect to file size** to keep create-path latency stable under load.
**Decision** (resolved 2026-05-14): Hash a deterministic **fixed-size sample** with `XxHash3.Hash128` (128-bit output, 32-char lower-case hex). Sample composition is unchanged from the current implementation:
- For inputs **≤ 3072 bytes**: `[length(8 bytes)] + [full bytes]`.
- For inputs **> 3072 bytes**: `[length(8 bytes)] + [first 1024] + [middle 1024 starting at len/2 512] + [last 1024]`.
When `MediaId` is provided instead of bytes, the annotation id is reused from the referenced media row.
**Why this combination**:
- **Sampling preserves file-size independence.** Reading a 5 GB video front-to-back just to derive an id is unacceptable on the hot path.
- **`XxHash3.Hash128` over the same sample** keeps the hashing itself O(1) in file size while moving the collision space from 2^64 to 2^128. Distinct large images that happen to share `(length, head 1 KB, middle 1 KB, tail 1 KB)` still collide deterministically — but the practical collision probability among such samples is now negligible at any realistic volume.
**Migration consequences**:
- The annotation `id` column is `TEXT PRIMARY KEY`; switching from 16-char (`XxHash64`) to 32-char (`XxHash3.Hash128`) hex requires no schema change.
- Existing rows keep their 16-char ids; new rows get 32-char ids. Re-create of an image whose original id was generated under `XxHash64` will produce a **different** new id under `XxHash3.Hash128` — i.e., re-creates after the upgrade no longer collide with their pre-upgrade row. Acceptable (and expected): old ids are stable, the deduplication property is preserved going forward, and the upgrade is irreversible by design.
**Status**: agreed. Implementation lives in the Refactor Backlog (RB-04).
### ADR-005: Swagger UI mounted in all environments
**Context**: Internal debugging / partner integration friction.
**Decision**: `app.UseSwagger()` and `app.UseSwaggerUI()` are unconditional in `Program.cs`.
**Consequences**: Schema is publicly readable wherever the service is reachable. If the perimeter is not closed, this leaks endpoint surface — treat as a security finding for production-internet exposure.
### ADR-006 (RETIRED): Wide-open CORS
**Status**: superseded — the default policy now reads `CorsConfig:AllowedOrigins` (string array) and `CorsConfig:AllowAnyOrigin` (boolean opt-in). `CorsConfigurationValidator.EnsureSafeForEnvironment` refuses to start in `Production` when origins are empty and `AllowAnyOrigin` is not explicitly set; a `LogWarning` is emitted in non-production when running with the permissive default. The original ADR is preserved here for historical context only.
### ADR-007: Embedded SQL migrator (not EF migrations / Flyway)
**Context**: Suite values single-binary deploys; the team prefers idempotent boot-time DDL over a separate migration tool.
**Decision**: `DatabaseMigrator.Migrate` runs a single multi-statement script via Linq2DB on every startup. Schema evolution is additive (`ALTER … ADD COLUMN IF NOT EXISTS`).
**Consequences**: Backwards-only, no down migrations. Renames or destructive changes need an explicit out-of-band script. Drift detection requires diffing live DB against `Database/DatabaseMigrator.cs`.
### ADR-008: Annotation lifecycle wrapped in a business transaction (planned)
**Context**: `CreateAnnotation` today touches the filesystem, three DB tables, an in-memory channel, and an outbox row, with no atomicity. World B (lifecycle is observable — see ADR-009) widens this surface to Update / Delete / status-change paths. A naive DB transaction does not wrap the FS writes; we want a single conceptual transactional boundary for the lifecycle, not just for the DB rows.
**Decision** (resolved 2026-05-14, to-be-implemented): introduce a **business-transaction wrapper** for annotation lifecycle operations. Concretely the chosen pattern is the **transactional outbox**:
1. Write all relevant DB rows (annotation / detection / annotations_queue_records) inside a single `db.BeginTransaction` scope.
2. Commit. The outbox row is the durable promise that the post-commit work is owed.
3. **Post-commit**, perform side effects: write image / label / thumbnail files, publish SSE event. These steps are idempotent on retry; the outbox row stays until the drainer succeeds.
4. The drainer (`FailsafeProducer`) is unchanged in role — it consumes the outbox.
**Implications**:
- FS write order shifts: today image is first, before any DB row; after the refactor, DB rows + outbox commit first, then FS writes execute (with the outbox row as the recovery anchor).
- A new abstraction (e.g., `AnnotationLifecycleTransaction` or a thin extension on `AppDataConnection`) is the right place to centralize this. Implementation deferred to RB-03.
**Alternatives considered**:
1. Pure DB transaction wrapping current order — rejected: doesn't cover FS, leaves orphan-file risk.
2. Saga / compensation steps with explicit rollback handlers — rejected: overkill for the linear lifecycle here.
**Status**: agreed. Implementation lives in the Refactor Backlog (RB-03).
### ADR-009: Lifecycle observability — World B (planned)
**Context**: Today only `CreateAnnotation` publishes SSE and enqueues the outbox. Update / UpdateStatus / Delete (annotations) and UpdateStatus / BulkUpdateStatus (dataset) are silent. The `QueueOperation` enum already declares `Validated` and `Deleted`, and `FailsafeProducer.cs:108123` has a dedicated drainer branch for both — strong evidence that the design always intended every lifecycle change to be observable. The producer side simply was never wired (the prior WPF codebase blended UI + backend; lifecycle calls likely came from the UI directly, which the new HTTP backend has not replicated).
**Decision** (resolved 2026-05-14, to-be-implemented): every annotation mutation publishes SSE and enqueues the outbox.
Mapping (initial; sub-questions to be resolved at implementation time):
| Mutation | SSE | Outbox `QueueOperation` |
|----------|-----|--------------------------|
| `AnnotationService.CreateAnnotation` | yes (today) | `Created` (today) |
| `AnnotationService.UpdateAnnotation` (replace detections, status → `Edited`) | yes | open: re-enqueue as `Created` (richer payload) **or** add `QueueOperation.Updated` + corresponding drainer branch |
| `AnnotationService.UpdateStatus` (status → `Validated (30)` or `Deleted (40)`) | yes | `Validated` |
| `AnnotationService.UpdateStatus` (other transitions) | yes | open: skip outbox, or always enqueue `Validated`? |
| `AnnotationService.DeleteAnnotation` | yes | `Deleted`**soft-delete**: status flips to `AnnotationStatus.Deleted = 40`, the row stays, image / label / thumbnail files relocate to a `deleted_dir` (new `directory_settings` column added by RB-01) |
| `DatasetService.UpdateStatus` / `BulkUpdateStatus` | yes (per-id for bulk) | `Validated` (single record covers the whole bulk via `AnnotationIds`) |
**Status**: agreed. Implementation lives in the Refactor Backlog (RB-01).
### ADR-010: Remove `system_settings.silent_detection`
**Context**: `silent_detection` was a debug-time switch to keep the RabbitMQ stream clean while a developer iterated locally. Now that the suite has e2e tests with isolated queues (per `_docs/_repo-config.yaml` suite-e2e), the in-product flag is dead code — debug isolation belongs in the test harness, not in `system_settings`.
**Decision** (resolved 2026-05-14, to-be-implemented):
- Remove the gating block in `AnnotationService.CreateAnnotation:100102` (always enqueue).
- Drop `silent_detection` from `system_settings` (column, entity, migrator `CREATE TABLE`, migrator `ALTER` line, any DTO references).
- Remove the field from `UpdateSystemSettingsRequest` if present.
**Status**: agreed. Implementation lives in the Refactor Backlog (RB-02). Schema column removal is a destructive change explicitly authorized by the maintainer.
### ADR-012: Rename `Flight` → `Mission` to align with suite canonical (planned)
**Context**: The suite product spec (`suite/_docs/01_annotations.md`) calls the domain concept `mission` / `missionId`. The code uses `Flight` / `FlightId` (table `media.waypoint_id` + DTO `FlightId` filter). This drift was flagged in `00_discovery.md`.
**Decision** (resolved 2026-05-14, to-be-implemented): align code to the suite. `Flight*``Mission*` rename across DTOs, controllers, services, and the relevant query-parameter names. The `media.waypoint_id` column stays (it is the underlying physical identifier; mission is the logical grouping concept above it).
**Status**: agreed. Implementation lives in the Refactor Backlog (RB-07). Schema column changes are scoped to renames in DTOs and code only — no DB column rename is required for this ADR.
### ADR-013: Stream consumer dedupe contract is owned by this service (planned)
**Context**: The failsafe outbox + RabbitMQ Stream pipeline can produce duplicate stream entries when (a) the drainer retries after a partial publish or (b) two service instances both pick up the same outbox row before either deletes it. Today there is no documented dedupe contract; consumers (admin sync, AI training) silently accept whatever they get.
**Decision** (resolved 2026-05-14, to-be-implemented): publish a documented dedupe contract owned by this service. Working shape: consumers MUST dedupe by `(annotationId, operation, dateTime)`. The outbox row's `DateTime` (already populated by `EnqueueAsync`) becomes part of the on-the-wire stream message, alongside the `annotationId` and `operation` already in `AnnotationQueueMessage` / `AnnotationBulkQueueMessage`.
**Status**: agreed. Implementation lives in the Refactor Backlog (RB-09).
### ADR-011: Detection class catalog is admin-managed with in-memory cache (planned)
**Context**: `detection_classes` is currently seeded by the migrator (19 rows) and read-only via `GET /classes`. Operators have no way to add or correct classes (e.g., the `Smoke`/`Plane` color clash on `#000080`) without a code change and redeploy.
**Decision** (resolved 2026-05-14, to-be-implemented):
- `ClassesController` exposes `POST /classes`, `PUT /classes/{id}`, `DELETE /classes/{id}` under `[Authorize(Policy = "ADM")]`. `GET /classes` stays `[Authorize]`.
- Reads go through a new `DetectionClassCache` (DI singleton) modeled on `PathResolver`: lazy-load on first read, `Reset()` after any write.
- Migrator-seeded rows remain as the bootstrap state; admin writes overwrite them per id.
**Status**: agreed. Implementation lives in the Refactor Backlog (RB-06). Adds a new feature surface; must land before any UI change relying on dynamic class management.
## Resolved Architectural Decisions (Step 4 verification)
The following items were surfaced during verification and resolved with the maintainer on 2026-05-14. Each one either becomes an ADR above or maps to a refactor backlog entry below.
| # | Concern | Resolution | Tracked as |
|---|---------|------------|------------|
| 1 | Update / Delete / dataset-status changes are silent on SSE + outbox | Treat as gap; lifecycle is observable (World B) — every mutation publishes + enqueues | ADR-009 / RB-01 |
| 2 | `system_settings.silent_detection` semantics | Remove the flag; e2e harness covers debug isolation now | ADR-010 / RB-02 |
| 3 | F1 not transactional across FS + DB + outbox | Wrap lifecycle in a business-transaction (transactional outbox); FS writes happen post-commit | ADR-008 / RB-03 |
| 4 | `XxHash64` over sampled bytes — collision risk | Switch to `XxHash3.Hash128` over the same sample (file-size-independent + 128-bit space) | ADR-004 / RB-04 |
| 5 | `FailsafeProducer.EnqueueAsync` static method does DB I/O — violates `coderule.mdc` | Accept as-is; documented deviation from rule | (no refactor) |
| 6 | `detection_classes` schema-mutable but no controller writes | Admin-managed CRUD with read-through cache (modeled on `PathResolver`) | ADR-011 / RB-06 |
| 7 | `Flight` (code) vs `mission` (suite spec) drift | Rename code → `Mission*`; suite spec stays canonical | ADR-012 / RB-07 |
| 8 | Dataset writes coupled directly to annotation rows via shared `AppDataConnection` | Route dataset writes through `AnnotationService` (via a public domain interface) | RB-08 |
| 9 | Stream consumer dedupe contract owner | This service owns it; dedupe by `(annotationId, operation, dateTime)` baked into the wire message | ADR-013 / RB-09 |
| 10 | Hard-delete vs soft-delete on `DeleteAnnotation` | Soft-delete: status → `Deleted (40)`, files moved to a `deleted_dir` | ADR-009 (folded in) / RB-01 |
## Remaining Open Architectural Risks
These are residual risks that still need attention from later autodev steps (Test Spec, Refactor, Security Audit). Items previously listed here that have been resolved as of 2026-05-14 (Flight/mission drift, dataset coupling, hard-vs-soft delete, JWT issuer/audience validation, CORS environment gating, dev secret fallback) moved to the Resolved Architectural Decisions table above and the Refactor Backlog below.
1. **Horizontal scaling**: SSE channel is per-instance (singleton `AnnotationEventService`); the failsafe outbox uses no leasing/locking. Two pods will independently drain rows, with deletion keyed on `id`; under high concurrency the same row can be picked by both before either deletes — duplicate stream entries possible. Consumers must dedupe per ADR-013. (Touched by RB-03 / RB-09 indirectly but not solved by them.)
2. **Swagger exposure** in production: see ADR-005. Belongs to Step 14 (Security Audit). (CORS exposure was resolved by `CorsConfigurationValidator`; ADR-006 retired.)
3. **`UserId` body field vs JWT `NameIdentifier`** drift (suite spec lists `UserId` on `POST /annotations`; code uses JWT subject). Reconcile in the suite spec.
4. **No automated tests**: addressed by autodev Phase A Steps 37 (Test Spec → Implement Tests → Run Tests).
5. **`FailsafeProducer.cs:138` swallows `IOException` on image read silently** (`catch { }`). Direct `coderule.mdc` violation. Symptom in product: a missing or unreadable image yields a stream message with `image = null` and no log/metric — the gap is invisible to operators. Track on Refactor Backlog (RB-05).
6. **JWKS HTTPS-only retrieval blocks containerised test harnesses** that would otherwise serve a static JWKS over plain HTTP. Tests must either run a TLS-terminating sidecar in the test compose stack or rely on test-only configuration that relaxes `RequireHttps`. Not a production risk; a Step 4 (Code Testability Revision) item.
## Refactor Backlog
These items are the implementation work for the resolved decisions above. They are **not** part of Step 4 (Verification) corrections — they will be picked up by the autodev existing-code flow at Step 8 (Refactor) and/or new feature tasks in Phase B.
| ID | Scope | Source ADR / Risk | Notes |
|----|-------|--------------------|-------|
| RB-01 | Wire lifecycle publish + outbox enqueue across Update / UpdateStatus / Delete (annotations + dataset). Includes the soft-delete behavior: `DeleteAnnotation` flips `AnnotationStatus → Deleted (40)`, leaves the row, and moves image / label / thumbnail files to a new `deleted_dir` (added to `directory_settings`). Read paths must filter `Status = Deleted (40)` from default lists. | ADR-009 | Open sub-questions: (a) `UpdateAnnotation` mapping — re-enqueue as `Created` or add `QueueOperation.Updated` + drainer branch; (b) which non-Validated/Deleted status transitions enqueue at all |
| RB-02 | Remove `silent_detection` (schema column, entity field, gating logic, DTOs) | ADR-010 | Destructive schema change explicitly authorized |
| RB-03 | Introduce business-transaction wrapper (transactional outbox) for annotation lifecycle | ADR-008 | Reorders FS writes to post-commit; covers all mutation paths |
| RB-04 | Switch annotation id hashing to `XxHash3.Hash128` over the same sampled buffer | ADR-004 | Existing 16-char ids stay; new ids are 32-char hex |
| RB-05 | Replace `catch { }` at `FailsafeProducer.cs:138` with logged failure path; surface as a metric | Open Risk §6 | Downstream consumer should know an image-less message means a real disk error |
| RB-06 | Admin-managed `detection_classes` (CRUD endpoints `[ADM]`, in-memory cache with `Reset()`) | ADR-011 | Migrator seed remains as bootstrap; admin overrides per id; fix `Smoke`/`Plane` color collision while at it |
| RB-07 | Rename `Flight*``Mission*` across DTOs, controllers, services, and query-parameter names. `media.waypoint_id` column is unchanged (it's the physical id; mission is the logical concept). | ADR-012 | Code-only rename to align with suite spec; suite stays canonical |
| RB-08 | Decouple `04 Dataset` writes from direct `annotations` row mutations — route status writes through a public `AnnotationService` interface. Reads can stay direct for now (read coupling is lower-risk than write coupling). | Open Risks (former §4) | Likely introduces an `IAnnotationLifecycle` (or similar) interface owned by `01 Annotations REST` that `04 Dataset` consumes via DI |
| RB-09 | Bake `(annotationId, operation, dateTime)` into the on-the-wire stream message; document the dedupe contract in `suite/_docs/01_annotations.md`. | ADR-013 | Coordinate the suite-doc update with admin sync + AI training maintainers |
## References
- Suite product spec: `../../../suite/_docs/01_annotations.md` (REST contracts, SSE, Annotation Sync, camera, classes).
- Suite dataset narrative: `../../../suite/_docs/09_dataset_explorer.md`.
- Component specs: `components/01..06_*/description.md`.
- Module docs: `modules/*.md`.
- File ownership (downstream skills): `module-layout.md`.
- Component diagram: `diagrams/components.md`.
- Per-flow diagrams: `diagrams/flows/`.