mirror of
https://github.com/azaion/annotations.git
synced 2026-06-21 17:31:07 +00:00
docs: Step 4 testability refactor — list-of-changes + 2 task specs
autodev existing-code Step 4 (Code Testability Revision) — invoked
refactor skill in guided mode. Phase 0 (baseline) + Phase 1 (discovery
+ validation) + Phase 2 (analysis + task decomposition) artifacts.
list-of-changes.md identifies two surgical fixes required before the
67-scenario blackbox suite (already specified in _docs/02_document/
tests/) can run against the SUT:
C01 — env-gate JWKS RequireHttps on ASPNETCORE_ENVIRONMENT=E2ETest
(architecture.md Open Risks Section 6 prescribes this; the
mock issuer in e2e/docker-compose.test.yml serves plain HTTP)
C02 — DNS-resolve RABBITMQ_HOST in FailsafeProducer.ProcessQueue
(IPAddress.Parse currently throws on every drain cycle when
host is a service name; latent production-relevant bug, not
just a test-env issue)
Two task specs in _docs/02_tasks/todo/ (3 story points total).
Independent — no inter-task dependency.
Tracker: local — Atlassian MCP reported errored at task-creation
time. Deferred Jira writes (epic + 2 tickets) recorded in
_docs/_process_leftovers/2026-05-14_testability-tracker.md for
replay when MCP is restored.
Items explicitly deferred to Step 8 Refactor are enumerated in
list-of-changes.md "Deferred to Step 8 Refactor" — including the
FailsafeProducer static helper (F3), the JWKS GetAwaiter().GetResult()
hot path, RB-05/06/08 backlog items, and the MediaService ffprobe
empty-catch.
State: Step 4 in_progress, sub_step 3 (phase-2-task-decomposition).
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,99 @@
|
||||
# Refactor: gate JWKS HTTPS requirement on `ASPNETCORE_ENVIRONMENT=E2ETest`
|
||||
|
||||
**Task**: PENDING_refactor_jwks_https_env_gate
|
||||
**Name**: JWKS HTTPS env gate
|
||||
**Description**: Gate the JWKS document retriever's `RequireHttps` flag on the ASP.NET Core environment name so the e2e test harness (which serves the mock issuer over plain HTTP on the test-only docker network) can fetch the public key set without weakening production HTTPS enforcement.
|
||||
**Complexity**: 1 point
|
||||
**Dependencies**: None
|
||||
**Component**: `06_platform` → Auth (`src/Auth/JwtExtensions.cs`)
|
||||
**Tracker**: pending (tracker MCP unavailable at task-creation time — see `_docs/_process_leftovers/2026-05-14_testability-tracker.md`)
|
||||
**Epic**: pending — `01-testability-refactoring`
|
||||
|
||||
## Problem
|
||||
|
||||
The JWKS retriever in `AddJwtAuth` is constructed with `new HttpDocumentRetriever { RequireHttps = true }`. This is correct for production (where `JWT_JWKS_URL` is `https://admin.azaion.com/.well-known/jwks.json`), but blocks the documented blackbox test harness, which serves a per-test ES256 public key over plain HTTP at `http://e2e-issuer:8080/.well-known/jwks.json`. With the constant true, the service throws on every first JWKS fetch and ~60 of the 67 test scenarios in `_docs/02_document/tests/` cannot exercise the real validation path.
|
||||
|
||||
## Outcome
|
||||
|
||||
- When `ASPNETCORE_ENVIRONMENT` is `E2ETest`, the JWKS retriever accepts plain-HTTP JWKS URLs.
|
||||
- For any other environment name (Development, Staging, Production, unset), the JWKS retriever continues to require HTTPS exactly as today.
|
||||
- No change to issuer / audience / algorithm pinning / signature / lifetime validation. The relaxation is strictly about the *transport* used to fetch the public-key document.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `src/Auth/JwtExtensions.cs` — `AddJwtAuth` method.
|
||||
- The way `RequireHttps` is set on the `HttpDocumentRetriever` argument passed to `ConfigurationManager<JsonWebKeySet>`.
|
||||
- Use of `IHostEnvironment` (already injected into `IServiceCollection` via the ASP.NET Core host) — either by adding the env name as a parameter on `AddJwtAuth`, or by resolving it from `IConfiguration`/the `ASPNETCORE_ENVIRONMENT` env var inline. Either approach is acceptable; smaller-diff option preferred.
|
||||
|
||||
### Excluded
|
||||
|
||||
- Any change to `TokenValidationParameters` (issuer/audience/lifetime/algorithm/signature).
|
||||
- Any change to the `IssuerSigningKeyResolver` lambda.
|
||||
- Any change to `JwksRetriever`.
|
||||
- Any change to policy registration (`ANN` / `DATASET` / `ADM`).
|
||||
- Adding new env vars or configuration keys.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: HTTPS still enforced outside the test environment**
|
||||
Given `ASPNETCORE_ENVIRONMENT` is unset, `Development`, `Staging`, or `Production`,
|
||||
When `AddJwtAuth` runs and `JWT_JWKS_URL` is `http://anywhere/jwks.json`,
|
||||
Then the service refuses to fetch JWKS over plain HTTP (existing `Microsoft.IdentityModel` behavior is preserved; the retriever throws `InvalidOperationException` for non-HTTPS URLs).
|
||||
|
||||
**AC-2: HTTPS relaxed under E2ETest only**
|
||||
Given `ASPNETCORE_ENVIRONMENT=E2ETest`,
|
||||
When `AddJwtAuth` runs and `JWT_JWKS_URL` is `http://e2e-issuer:8080/.well-known/jwks.json`,
|
||||
Then the JWKS document is fetched successfully over plain HTTP and used to populate the signing-key cache.
|
||||
|
||||
**AC-3: Validation semantics unchanged**
|
||||
Given any environment,
|
||||
When a token presents valid `iss`, `aud`, `exp`, ES256 signature, and `permissions` claim,
|
||||
Then the token is accepted exactly as today.
|
||||
|
||||
**AC-4: Forgery / tamper / cross-policy attacks still rejected**
|
||||
Given the harness from AC-2,
|
||||
When a token is presented with `alg=HS256`, with an expired `exp`, with a wrong `iss` or `aud`, or with a tampered payload,
|
||||
Then the request is rejected with 401 — same behavior as today.
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Compatibility**
|
||||
- No env var renames; no breaking change to operators' deployment configs.
|
||||
- No public method signature change on `AddJwtAuth` that breaks the single existing caller (`Program.cs:49`).
|
||||
|
||||
**Reliability**
|
||||
- The env-name read must be deterministic at startup (read once during `AddJwtAuth` invocation; do not re-read per request).
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|--------------|------------------|
|
||||
| AC-1 | Constructing `HttpDocumentRetriever` under `EnvironmentName="Production"` | `RequireHttps == true` |
|
||||
| AC-2 | Constructing `HttpDocumentRetriever` under `EnvironmentName="E2ETest"` | `RequireHttps == false` |
|
||||
|
||||
(Unit tests above are illustrative — Step 6 will write the executable test code; this task only adjusts source.)
|
||||
|
||||
## Blackbox Tests
|
||||
|
||||
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|
||||
|--------|------------------------|--------------|-------------------|----------------|
|
||||
| AC-2 | Docker e2e stack with `ASPNETCORE_ENVIRONMENT=E2ETest` and the mock issuer on HTTP | `POST /annotations` with a runner-minted ES256 token whose public key is published by the mock issuer | HTTP 200, body matches `AnnotationDto` (this is FT-P-01 from `blackbox-tests.md`) | — |
|
||||
| AC-3 | Same env | FT-P-12 (Bearer happy path) | HTTP 200 | — |
|
||||
| AC-4 | Same env | NFT-SEC-01..10 (signature mismatch / expired / wrong iss / wrong aud / alg confusion / tamper) | HTTP 401 every time | — |
|
||||
|
||||
## Constraints
|
||||
|
||||
- The integration with `Microsoft.IdentityModel.Protocols.ConfigurationManager<JsonWebKeySet>` must be preserved; only the constructor argument to `HttpDocumentRetriever` changes.
|
||||
- ASP.NET Core convention: read environment via `IHostEnvironment` rather than `Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT")` directly where possible.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: Wrong environment-name comparison style**
|
||||
- *Risk*: A case-sensitive equality check could miss `E2ETest` vs `e2etest` if a future operator types the env name differently.
|
||||
- *Mitigation*: Use `string.Equals(envName, "E2ETest", StringComparison.OrdinalIgnoreCase)` — matches the existing `CorsConfigurationValidator.EnsureSafeForEnvironment` pattern.
|
||||
|
||||
**Risk 2: Accidental Production opt-in**
|
||||
- *Risk*: An operator could set `ASPNETCORE_ENVIRONMENT=E2ETest` in production to silence an HTTPS-only error.
|
||||
- *Mitigation*: Documented as test-only in `architecture.md`. `Program.cs`'s `CorsConfigurationValidator` already runs an environment-aware safety check on a related axis; a future Step 8 hardening item can add a similar "Production refuses E2ETest token over HTTP" guard if desired. Out of scope for this task.
|
||||
@@ -0,0 +1,117 @@
|
||||
# Refactor: resolve RabbitMQ broker host via DNS in `FailsafeProducer`
|
||||
|
||||
**Task**: PENDING_refactor_rabbitmq_host_dns_resolution
|
||||
**Name**: RabbitMQ host DNS resolution
|
||||
**Description**: Replace `IPAddress.Parse(config.Host)` with a hostname-aware resolver (literal-IP shortcut + `Dns.GetHostAddressesAsync` fallback) so the `FailsafeProducer` outbox-drain loop can reach the broker when `RABBITMQ_HOST` is a DNS service name — which is the documented test configuration and the production-normal configuration for any container deployment.
|
||||
**Complexity**: 2 points
|
||||
**Dependencies**: None
|
||||
**Component**: `02_annotations-realtime-sync` → `FailsafeProducer` (`src/Services/FailsafeProducer.cs`)
|
||||
**Tracker**: pending (tracker MCP unavailable at task-creation time — see `_docs/_process_leftovers/2026-05-14_testability-tracker.md`)
|
||||
**Epic**: pending — `01-testability-refactoring`
|
||||
|
||||
## Problem
|
||||
|
||||
`FailsafeProducer.ProcessQueue` builds the RabbitMQ Stream endpoint with `new IPEndPoint(IPAddress.Parse(config.Host), config.Port)`. `IPAddress.Parse` throws `FormatException` on any string that is not a literal IPv4 / IPv6 address. The test stack (`e2e/docker-compose.test.yml:82`) sets `RABBITMQ_HOST=rabbitmq` (a docker-compose service name); the in-class default in `RabbitMqConfig` is also `"rabbitmq"`. Every drain cycle today throws on the first line of `ProcessQueue`; the outer catch logs and backs off 10 s, and the outbox never drains. Five test scenarios depend on the drain path working: FT-P-09, NFT-RES-01, NFT-RES-06, NFT-RES-LIM-03, NFT-PERF-OUTBOX-DRAIN-01.
|
||||
|
||||
Beyond tests, this is a latent **production-relevant** logic bug: any deployment that uses container DNS, Kubernetes service names, or any non-IP value in `RABBITMQ_HOST` has the same broken drain. The bug is masked because outbox row inserts (synchronous, via `AnnotationService` → static `EnqueueAsync`) keep returning 200; only consumers of the stream see the absence.
|
||||
|
||||
## Outcome
|
||||
|
||||
- When `RABBITMQ_HOST` is a literal IP, behavior is unchanged.
|
||||
- When `RABBITMQ_HOST` is a DNS hostname, the producer resolves it via `System.Net.Dns` and connects to the resulting IP.
|
||||
- The DNS resolution participates in the existing retry envelope — a DNS failure surfaces through the same `catch (Exception ex)` + 10 s back-off path that exception-throws use today, so operator-visible behavior on broker-unreachable is preserved.
|
||||
- No change to MessagePack payload, gzip compression, queue-table delete, or any aspect of `DrainQueue`.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `src/Services/FailsafeProducer.cs` — `ProcessQueue` method, specifically the construction of the `IPEndPoint` for `StreamSystemConfig.Endpoints`.
|
||||
- Cancellation-token propagation through the new resolve call (use the existing `ct`).
|
||||
|
||||
### Excluded
|
||||
|
||||
- Any change to `EnqueueAsync` (static helper).
|
||||
- Any change to `DrainQueue` (serialization / publish / delete).
|
||||
- Any change to `ExecuteAsync`'s outer retry envelope.
|
||||
- Any change to `RabbitMqConfig` shape — same env-var contract.
|
||||
- Switching from `IPEndPoint` to `DnsEndPoint` (the library accepts both, but `DnsEndPoint` path is untested in this codebase; pick the smaller-diff option).
|
||||
- Caching resolved IPs across drain cycles (broker may rotate; resolve per cycle is fine and matches the connect-per-cycle pattern).
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Literal IP host still works**
|
||||
Given `RABBITMQ_HOST=127.0.0.1` (or any literal IPv4/IPv6),
|
||||
When `ProcessQueue` enters its first iteration,
|
||||
Then the producer connects to the broker without any DNS lookup.
|
||||
|
||||
**AC-2: DNS hostname now works**
|
||||
Given `RABBITMQ_HOST=rabbitmq` (or any non-IP hostname),
|
||||
When `ProcessQueue` enters its first iteration in an environment where the name resolves,
|
||||
Then the producer connects to the broker and the outbox drain executes.
|
||||
|
||||
**AC-3: Unresolvable hostname surfaces through existing retry envelope**
|
||||
Given `RABBITMQ_HOST=does-not-resolve.invalid`,
|
||||
When `ProcessQueue` enters its first iteration,
|
||||
Then the resolve call throws (e.g., `SocketException`), the outer `catch` in `ExecuteAsync` logs the exception, and the loop backs off 10 seconds — same surface behavior as a `FormatException` today.
|
||||
|
||||
**AC-4: Cancellation honored during resolution**
|
||||
Given a cancellation request mid-resolve,
|
||||
When `ProcessQueue` is in the resolution call,
|
||||
Then `OperationCanceledException` propagates and `ExecuteAsync`'s `catch (OperationCanceledException) when (ct.IsCancellationRequested)` exits the outer loop cleanly.
|
||||
|
||||
**AC-5: Wire format / consumers unaffected**
|
||||
Given any successful drain after the fix,
|
||||
When a stream consumer reads from `azaion-annotations`,
|
||||
Then the message body (MessagePack-encoded `AnnotationQueueMessage` / `AnnotationBulkQueueMessage`, gzip-compressed) is byte-for-byte identical to what the unchanged `DrainQueue` produced before this task.
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- DNS resolution adds at most one network round-trip per drain cycle (every ~10 s). Negligible.
|
||||
- No per-request DNS lookups; the resolution is in `ProcessQueue`, which runs once per outer retry cycle.
|
||||
|
||||
**Compatibility**
|
||||
- No env-var rename; `RABBITMQ_HOST` semantics expanded from "IP literal" to "IP literal OR DNS hostname".
|
||||
- No change to consumers (admin's `AnnotationSyncWorker`, AI Training consumer).
|
||||
|
||||
**Reliability**
|
||||
- Resolution failure path uses the same retry envelope as the previous parsing failure path; no new failure modes are introduced.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|--------------|------------------|
|
||||
| AC-1 | Helper that maps `"127.0.0.1"` to `IPAddress` | returns `IPAddress.Loopback` without performing a DNS lookup |
|
||||
| AC-2 | Helper that maps `"localhost"` to `IPAddress` | returns the first address from `Dns.GetHostAddresses("localhost")` |
|
||||
| AC-3 | Helper invoked with `"does-not-resolve.invalid"` | throws (caller-side handling lives in `ExecuteAsync`'s catch) |
|
||||
|
||||
(Step 6 produces executable test code; these guide the implementation.)
|
||||
|
||||
## Blackbox Tests
|
||||
|
||||
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|
||||
|--------|------------------------|--------------|-------------------|----------------|
|
||||
| AC-2, AC-5 | Docker e2e stack with `RABBITMQ_HOST=rabbitmq`, broker up | FT-P-08 (outbox row insert) + FT-P-09 (stream message arrives) | row appears, drained within ~10–30 s, consumer receives MessagePack+gzip message matching the documented schema | — |
|
||||
| AC-3 | Docker e2e stack, broker stopped via `rabbitmqctl stop_app` | NFT-RES-01 (broker outage) | SUT does not crash, `/health` stays 200, outbox preserves the row, recovery delivers the deferred message within 60 s of broker `start_app` | — |
|
||||
| AC-5 | Docker e2e stack at sustained 5 RPS | NFT-PERF-OUTBOX-DRAIN-01 | max queue depth ≤ 100 rows | — |
|
||||
|
||||
## Constraints
|
||||
|
||||
- Keep using `RabbitMQ.Stream.Client.StreamSystemConfig.Endpoints` with `IPEndPoint` entries (matches existing examples and the rest of the call).
|
||||
- Resolution must accept a `CancellationToken`; use the `ct` already in scope.
|
||||
- No `IPAddress.Parse` calls on `config.Host` anywhere in the diff (the bug being fixed).
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: DNS round-trip on the drain hot loop**
|
||||
- *Risk*: Sub-second DNS lookup on every drain cycle might surprise operators expecting a pure local-IP fast path.
|
||||
- *Mitigation*: Drain cycle is already ~10 s spacing; one DNS lookup per cycle is negligible. If profiling later reveals a problem, a literal-IP fast path (already in this design — `TryParse` before `Dns`) means IP-literal deployments are not affected at all.
|
||||
|
||||
**Risk 2: Resolved IP becomes stale across cycles**
|
||||
- *Risk*: A broker IP change between cycles would not be picked up if we cached.
|
||||
- *Mitigation*: Don't cache. Resolve every cycle (same cost; matches connect-every-cycle pattern).
|
||||
|
||||
**Risk 3: Multi-address records (round-robin DNS)**
|
||||
- *Risk*: `GetHostAddressesAsync` returns N addresses; picking only the first ignores load balancing.
|
||||
- *Mitigation*: For testability scope, "first address" is fine — broker LBs are a Step 8 concern. Documented here, not implemented.
|
||||
Reference in New Issue
Block a user