refactor: enhance JWT authentication and CORS configuration

Updated JWT authentication to use configuration values instead of hardcoded secrets, improving security and flexibility. Enhanced CORS policy to conditionally allow origins based on configuration settings, with logging for permissive defaults. Updated README to reflect project renaming and clarify service context.
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-14 19:48:25 +03:00
parent 2fe394d732
commit 7025f4d075
74 changed files with 8494 additions and 19 deletions
@@ -0,0 +1,96 @@
# CI / CD Pipeline
> **NOTE (forward-looking)**: image registry path reflects the **post-rename** state. Today's pipeline pushes `azaion/flights:${BRANCH}-arm`. Rename tracked under Jira AZ-EPIC child B10 (Dockerfile + Woodpecker + suite compose).
## Source
`./.woodpecker/build-arm.yml` (single CI file). One job: build + push the container image.
```yaml
when:
event: [push, manual]
branch: [dev, stage, main]
labels:
platform: arm64
steps:
- name: build-push
image: docker
environment:
REGISTRY_HOST: { from_secret: registry_host }
REGISTRY_USER: { from_secret: registry_user }
REGISTRY_TOKEN: { from_secret: registry_token }
commands:
- echo "$REGISTRY_TOKEN" | docker login "$REGISTRY_HOST" -u "$REGISTRY_USER" --password-stdin
- export TAG=${CI_COMMIT_BRANCH}-arm
- export BUILD_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
- |
docker build -f Dockerfile \
--build-arg CI_COMMIT_SHA=$CI_COMMIT_SHA \
--label org.opencontainers.image.revision=$CI_COMMIT_SHA \
--label org.opencontainers.image.created=$BUILD_DATE \
--label org.opencontainers.image.source=$CI_REPO_URL \
-t $REGISTRY_HOST/azaion/missions:$TAG . # post-B10
- docker push $REGISTRY_HOST/azaion/missions:$TAG # post-B10
volumes:
- /var/run/docker.sock:/var/run/docker.sock
```
## Triggers
| Trigger | Branch filter | Outcome |
|---------|---------------|---------|
| `push` | `dev`, `stage`, `main` | Build + push the corresponding `${BRANCH}-arm` image tag |
| `manual` | any of `dev`, `stage`, `main` | Same as `push` — used for rebuilding without a code change (e.g., after a base-image security patch) |
| `pull_request`, other branches | — | **Not built today.** Feature branches do not produce images |
## Runner / platform
- **`labels: platform: arm64`** — the pipeline runs on an ARM64-labeled Woodpecker runner. The build is therefore **native** for the ARM64 image variant (no QEMU). The Dockerfile's `--platform=$BUILDPLATFORM` for the build stage ensures the SDK image matches the runner's architecture.
- For the (currently absent) AMD64 variant a second pipeline file (`build-amd.yml`) on an AMD64-labeled runner would be the natural pattern.
## Secrets
| Secret | Source | Purpose |
|--------|--------|---------|
| `registry_host` | Woodpecker secret store | Hostname of the suite's container registry (Caddy-fronted Gitea Container Registry per `../../../suite/_docs/00_top_level_architecture.md`) |
| `registry_user` | Woodpecker secret store | Service account that has push rights to `azaion/missions` |
| `registry_token` | Woodpecker secret store | Personal access token for the service account |
`docker login` is piped via stdin (`--password-stdin`) — token never lands on the command line or in process listings.
## OCI labels emitted
Three standard OCI labels are baked into every published image:
| Label | Value | Source |
|-------|-------|--------|
| `org.opencontainers.image.revision` | `$CI_COMMIT_SHA` | Git commit driving this build |
| `org.opencontainers.image.created` | `$BUILD_DATE` (ISO 8601 UTC) | `date -u +%Y-%m-%dT%H:%M:%SZ` at build time |
| `org.opencontainers.image.source` | `$CI_REPO_URL` | Suite Gitea repo URL |
These let `docker inspect` answer "which commit and when?" without consulting the registry's metadata.
## What the pipeline does **NOT** do (carry-forward improvements)
- **No `dotnet test` step** — there is no test project in the repo today. Tracked in `../../../suite/_docs/_process_leftovers/2026-04-22_ci-unit-test-lane-missing-projects.md`. When a `tests/Azaion.Missions.Tests/` sibling lands (autodev existing-code Steps 57), the natural insert is a `name: test` step that runs `dotnet test --collect "XPlat Code Coverage"` against the build before the docker step.
- **No security scan** — neither container scanning (Trivy / Grype on the published image) nor source scanning (CodeQL / Semgrep) is wired. Suite-level concern; out of this Epic.
- **No migration check** — the migrator runs at app startup (Flow F6). A CI-time "does the migrator's DDL parse cleanly against an empty PG" smoke test is the next-cheapest safety net once tests exist.
- **No SBOM** — `docker buildx --sbom` would surface base-image CVEs at registry-push time. Carry-forward.
- **No image-signing** — Notary v2 / cosign is not wired. Carry-forward; suite-wide concern.
- **No multi-arch matrix** — only `arm64` builds today (see `containerization.md` § Multi-arch limitation).
## Pipeline run-time characteristics
- Single step, single image. Typical wall-clock time: 25 minutes on the suite's ARM64 runner (most of it `dotnet publish` + the runtime image layer pull).
- No caching layer between builds today. `dotnet restore` re-downloads NuGet packages on every run. A persistent `~/.nuget` volume on the runner would shave ~30 seconds per build.
## Failure modes
| Failure | Symptom in Woodpecker | Recovery |
|---------|------------------------|----------|
| `dotnet publish` fails | Build step fails with compilation errors | Standard — fix source, push again |
| Registry login fails (rotated token) | `docker login` exits non-zero | Rotate `registry_token` in Woodpecker secrets |
| Registry push rate-limited | `docker push` 429 | Retry the manual run; rare in practice with Gitea-hosted registry |
| ARM64 runner offline | Pipeline waits in queue | Bring the runner back online; pipeline auto-resumes |
@@ -0,0 +1,61 @@
# Containerization
> **NOTE (forward-looking)**: image tag, csproj name, and entrypoint reflect the **post-rename** state. Today's `Dockerfile` ENTRYPOINT is `dotnet Azaion.Flights.dll` and the image tag base is `azaion/flights`. Renames tracked under Jira AZ-EPIC children B5 (csproj/namespace) and B10 (Dockerfile entrypoint + Woodpecker image tag).
## Source
`./Dockerfile` (single Dockerfile at the repo root). Multi-arch via build args.
## Build strategy: multi-arch from a single Dockerfile
```dockerfile
FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:10.0 AS build
ARG TARGETARCH
WORKDIR /src
COPY . .
RUN arch=$([ "$TARGETARCH" = "amd64" ] && echo "x64" || echo "$TARGETARCH") && \
dotnet publish -c Release -o /app --os linux --arch $arch
FROM mcr.microsoft.com/dotnet/aspnet:10.0
ARG CI_COMMIT_SHA=unknown
ENV AZAION_REVISION=$CI_COMMIT_SHA
WORKDIR /app
COPY --from=build /app .
EXPOSE 8080
ENTRYPOINT ["dotnet", "Azaion.Missions.dll"] # post-B5 + B10
```
Key choices:
- **`--platform=$BUILDPLATFORM`** on the build stage — the SDK runs on the **builder's** native architecture (typically AMD64 in CI), but `dotnet publish --os linux --arch $arch` cross-publishes for the **target** architecture (`arm64` for Jetson / OPi; `amd64` for operator-PC). This avoids needing QEMU emulation for the (slow) build phase.
- **Two-stage**: SDK image (~800 MB) for the build, runtime image (`mcr.microsoft.com/dotnet/aspnet:10.0`, ~210 MB) for the published artifacts. Final image carries no SDK, no source code, no `.csproj`.
- **`AZAION_REVISION` env var** baked from `CI_COMMIT_SHA` at build time — surfaces the source commit at runtime for support / triage. Not consumed by the application code today; visible only via `docker inspect` or `env` in the running container.
- **`EXPOSE 8080`** matches the ASP.NET Core default (no `ASPNETCORE_URLS` override) and the edge compose port mapping `5002:8080`.
## What the Dockerfile does **NOT** do (carry-forward improvements)
- **No `.dockerignore`** — every file under the repo root is copied into the build context, including `_docs/`, `.cursor/`, and the previously-committed `obj/` / `bin/` (cleaned by `dotnet publish` but still copied across the wire). A `.dockerignore` would shrink the build context meaningfully on Jetson-class builders. Tracked as opportunistic improvement.
- **No `HEALTHCHECK` directive** — `/health` exists in the application (Flow F7) but the container itself does not declare a healthcheck. Compose-level healthcheck is the suite's expected mechanism.
- **No non-root `USER`** — the runtime image runs as root. The `aspnet:10.0` base image supports a non-root user (`USER app`) since .NET 8; switching is a one-line change on the next refresh and would tighten container isolation.
- **No build-time test pass** — no `dotnet test` step. The repo has no test project today (tracked in `../../../suite/_docs/_process_leftovers/2026-04-22_ci-unit-test-lane-missing-projects.md`); when a `tests/Azaion.Missions.Tests/` project lands, a pre-publish `dotnet test` step is the natural next addition.
## Image lifecycle on edge devices
1. Woodpecker builds + pushes `${REGISTRY_HOST}/azaion/missions:${BRANCH}-arm` (post-B10) on every push to `dev` / `stage` / `main` (see `ci_cd_pipeline.md`).
2. Watchtower running on each edge device polls the registry; on a new digest for the device's pinned tag, it pulls and re-creates the container.
3. `flight-gate` (per `../../../suite/_docs/00_top_level_architecture.md`) prevents container restart mid-mission. Once the active mission completes, the new image becomes live.
4. Container starts → `Program.cs` → migrator → `app.Run()` (Flow F6).
## Image variants and tag strategy
| Branch | Tag (post-B10) | Audience |
|--------|----------------|----------|
| `dev` | `azaion/missions:dev-arm` | Engineers; rolling latest-dev |
| `stage` | `azaion/missions:stage-arm` | Pre-prod customer demo devices |
| `main` | `azaion/missions:main-arm` | Production edge fleets |
**No semantic version tags today** (no `v1.2.3`-style tags). Watchtower polls the named tags directly. Carry-forward improvement: add `${REGISTRY_HOST}/azaion/missions:${CI_COMMIT_SHA:0:8}-arm` alongside the branch tag so rollback to a specific build is possible without rebuilding.
## Multi-arch — current limitation
Today the Woodpecker pipeline (`ci_cd_pipeline.md`) builds **only the `arm64` variant** (the runner is ARM64-labeled and the tag suffix is `-arm`). For AMD64 (operator-PC) deployments, a second pipeline / second runner / `linux/amd64` build args would be needed. Out of this Epic — the current operator-PC deployment story uses local builds.
@@ -0,0 +1,101 @@
# Environment Strategy
> **NOTE (forward-looking)**: image tag, container name, and namespace reflect the **post-rename** state. Today's edge compose still references `azaion/flights:${BRANCH}-arm` and the container name is typically `flights`. Rename tracked under B10 (suite compose update).
## Environments
| Environment | Where it runs | Audience | Image tag (post-B10) |
|-------------|---------------|----------|----------------------|
| **Development** | Local workstation (`dotnet run` from the repo root, or `docker run` against a local PG) | Engineers | none — built ad-hoc |
| **Edge production** | Each customer-owned edge device (Jetson Orin / OrangePI / operator-PC), one container per device | Operators, autopilot, UI | `azaion/missions:dev-arm` / `stage-arm` / `main-arm` per device tier |
There is **no centralized staging environment** in the suite. Each edge device is its own deployment; the `stage` image tag is for pre-prod customer demo devices, not for a separate cloud staging cluster.
## Configuration sources (precedence)
`Program.cs` resolves each setting in this order:
1. `IConfiguration` (defaults: appsettings.json + ASPNETCORE_-prefixed env vars)
2. `Environment.GetEnvironmentVariable(...)` (legacy fallback for unprefixed env)
3. **Hardcoded dev fallback** (last resort)
### `DATABASE_URL`
| Environment | Value source | Resolved value |
|-------------|--------------|----------------|
| Development (no env set) | hardcoded fallback | `Host=localhost;Database=azaion;Username=postgres;Password=changeme` |
| Development (env set) | env var | Whatever the engineer sets (URL form OR raw Npgsql key=value form both work) |
| Edge production | env var passed via docker compose | `postgresql://postgres:${PG_LOCAL_PASSWORD}@postgres-local/azaion` (per `../../../suite/_docs/00_top_level_architecture.md`) |
**`ConvertPostgresUrl` helper** parses the URL form into Npgsql key=value form. **Does NOT URL-decode user/password** — credentials with `@`, `:`, `/`, `%` will be mis-parsed. `07_host` Caveats #4 calls this out. Mitigation: avoid those characters in the local PG password (the standard suite provisioning does), or pass a raw Npgsql key=value string instead of a URL.
### `JWT_SECRET`
| Environment | Value source | Resolved value |
|-------------|--------------|----------------|
| Development (no env set) | hardcoded fallback | `development-secret-key-min-32-chars!!` (**well-known; never use in production**) |
| Development (env set) | env var | Whatever the engineer sets |
| Edge production | env var (compose `env_file` or per-device-provisioned secret) | The shared HMAC secret used by `admin` + every backend service on the device. **Rotation is suite-coordinated** — see `architecture.md` § Security |
**Critical foot-gun (ADR-005 carry-forward)**: there is **no runtime gate** that blocks startup with the dev fallback in production. A misconfigured production deploy will silently boot with the well-known dev secret. The CMMC L2 scorecard tracks the broader fix at suite level (AZ-487 / AZ-494).
## Other configuration
These settings are **NOT** environment-overridable today (carry-forward improvements):
| Setting | Current behavior | Should it differ between dev and prod? |
|---------|------------------|----------------------------------------|
| Swagger UI mount | Always on | Yes — production should gate on `IsDevelopment()` (ADR-005) |
| CORS policy | `AllowAnyOrigin/Method/Header` always | Yes — production should restrict to the suite's reverse-proxy origin |
| HTTPS redirection | None — assumes upstream TLS termination | No — the suite's reverse proxy handles TLS; this is correct |
| Logging verbosity | ASP.NET Core defaults (Information+) | Probably not at this scale; `LogLevel:Default = Warning` could be useful in prod |
| Migrator `DROP TABLE IF EXISTS` (B9 one-shot) | Runs every startup; idempotent on already-cleaned devices | No — the idempotent design means this is safe everywhere |
## Edge compose excerpt (suite-wide pattern, post-B10)
Per `../../../suite/_docs/00_top_level_architecture.md` § Edge compose:
```yaml
services:
missions: # was: flights (pre-B10)
image: ${REGISTRY_HOST}/azaion/missions:${BRANCH:-main}-arm
container_name: missions
restart: unless-stopped
depends_on:
postgres-local:
condition: service_healthy
environment:
DATABASE_URL: postgresql://postgres:${PG_LOCAL_PASSWORD}@postgres-local/azaion
JWT_SECRET: ${JWT_SECRET}
ports:
- "5002:8080"
networks:
- azaion-edge
```
The actual file lives in the suite repo (`../../../suite/_infra/_compose/`) — the snippet here is illustrative.
## Secrets management
- **Local dev**: hardcoded fallbacks in `Program.cs` (the dev-secret values listed above).
- **Edge production**: env vars sourced from a per-device `.env` file (created at provisioning) OR a local secrets manager (per-customer choice — the suite supports both patterns). The `JWT_SECRET` is suite-wide (one value across all backend services on the device).
- **Rotation**: changing `JWT_SECRET` invalidates every issued token until new ones are minted. Coordinated procedure across the device's backend services + UI re-login. There is **no online rotation** — every backend must be restarted with the new secret simultaneously.
## Network / port layout
- Container `EXPOSE 8080`; bound HTTP only (no TLS in this service).
- Edge compose maps `5002:8080` per the suite convention.
- Reverse proxy (Caddy fronting the suite per `../../../suite/_docs/00_top_level_architecture.md`) terminates TLS upstream.
- No outbound network calls except to `postgres-local` (DB).
## Restart / lifecycle
- Container restart policy: `unless-stopped` in production compose (manually-stopped containers stay stopped; crash → auto-restart).
- Watchtower polls the registry and triggers re-create on new image digest.
- `flight-gate` (suite component) gates container restart so it does not happen mid-mission. Once the active mission completes, Watchtower's queued restart goes through.
- `missions` itself does not implement graceful shutdown beyond ASP.NET Core's defaults — in-flight HTTP requests are allowed to complete; idle connections are closed.
## Backup / disaster recovery
- **Out of scope for this service.** PostgreSQL backup is a per-device, suite-level concern (not per-service). Each edge device runs `postgres-local`; backup cadence and offsite replication are decided at provisioning time and documented at suite level (currently informally).
- **No application-level export** — the only way to read mission / waypoint data out of the device is through the API or `pg_dump` against `postgres-local`.
@@ -0,0 +1,75 @@
# Observability
> **Honest assessment**: observability in this service is **minimal** today. This document records what exists, what does not, and what the natural next steps are. It is intentionally short — there is not much to describe.
## What exists
### Logging
- **Provider**: ASP.NET Core defaults (Console + Debug providers via `Microsoft.Extensions.Logging`). No Serilog, no NLog, no structured logging.
- **Format**: plain text to stdout (Console provider). Docker collects via the standard container log stream; `docker logs missions` reveals everything.
- **Application-emitted logs** (only):
| Source | Level | When | Message |
|--------|-------|------|---------|
| `06_http_conventions/ErrorHandlingMiddleware` | `ERROR` | Unhandled exception caught by the catch-all branch | `"Unhandled exception"` (with stack trace via `LogError(ex, ...)`) |
No `INFO`, `DEBUG`, `WARN` logs are emitted by application code.
- **Framework logs**: `JwtBearerHandler` and ASP.NET Core's request pipeline log token-validation outcomes and request lifecycle at `Information` / `Debug` levels (`05_identity` § Logging). LinqToDB does NOT log SQL by default — `04_persistence` Caveats #7.
### Metrics
- **None.** No `Microsoft.Extensions.Diagnostics.Metrics` consumption, no Prometheus / OpenTelemetry exporters, no application-level counters.
### Tracing
- **None.** No `Activity` / `OpenTelemetry` instrumentation. No correlation IDs, no W3C `traceparent` propagation.
### Health endpoint
- `GET /health``{ "status": "healthy" }` (Flow F7). Confirms process liveness + HTTP pipeline serving. **Does NOT verify DB connectivity** today.
### Build-time metadata
- `AZAION_REVISION` env var baked from `CI_COMMIT_SHA` at build time (`Dockerfile`). Visible via `docker inspect` or `env` inside the running container, but **not surfaced via any HTTP endpoint** today.
## What does not exist (carry-forward)
### Correlation / request tracing
- No request ID generation (would be a 5-line middleware: emit `X-Request-Id` if absent, propagate to logs).
- No client-supplied correlation ID propagation.
- No way to grep logs by anything other than timestamp.
### Structured logging
- Console-only plaintext means log aggregation (Loki / ELK / Splunk) has to parse free-text. A switch to `Microsoft.Extensions.Logging.Console.JsonFormatter` (or Serilog with JSON sink) would emit `{ "Timestamp": ..., "Level": ..., "Message": ..., "Properties": {...} }` and make downstream querying viable.
### Audit logging
- No per-request audit trail (which user / token did what). The JWT's `sub` / `user_id` claim is **not consumed** today — `05_identity` Caveats #2. Adding it would unlock per-user attribution for vehicle changes and mission deletions.
### Application metrics
- No counters for: requests served, error rate, mission-delete cascade duration, `vehicles.is_default` race occurrences, JWT validation failure rate.
- No DB metrics: connection pool utilization, query latency p50 / p95 / p99, slow-query log.
- No SLO tracking — `architecture.md` § NFRs notes that no explicit latency budget is set.
### Distributed tracing
- The cross-service cascade (`missions``media` / `annotations` / `detection` rows in shared PG) would benefit from a single trace ID per cascade walk. Today it is one trace span (the HTTP request) that opaquely runs many SQL statements.
### DB liveness on `/health`
- Flow F7 § Future improvement notes the natural extension: `await db.ExecuteAsync("SELECT 1")` inside the `/health` handler. Today the migrator-at-startup is the only DB-availability gate.
## Natural next steps (in rough priority order)
1. **Request ID middleware + emit it from `ErrorHandlingMiddleware` log + Response header.** ~10 lines, immediately useful for production support.
2. **DB ping in `/health`** — flips `/health` from "process liveness" to "service readiness". Costs <1ms per probe.
3. **Switch console logger to JSON formatter** — flat-rate change; downstream log aggregation becomes feasible.
4. **Surface `AZAION_REVISION` via a `/version` endpoint** (one line on top of the existing `MapGet`) so support knows which build is on the device without `docker inspect`.
5. **OpenTelemetry instrumentation** — once #14 are in, a minimal OTel exporter (HTTP server + LinqToDB SqlClient activity) would give tracing for free.
These are deferred to **post-rename** work (out of this Epic). The rename refactor (B5B10) does not change observability, and the autodev BUILD pipeline (Steps 37 of existing-code flow) will exercise the existing code as-is. Adding observability is a separate pass that should land alongside the testing work in autodev cycle 2 or later.