docs: Step 4 testability refactor — list-of-changes + 2 task specs

autodev existing-code Step 4 (Code Testability Revision) — invoked
refactor skill in guided mode. Phase 0 (baseline) + Phase 1 (discovery
+ validation) + Phase 2 (analysis + task decomposition) artifacts.

list-of-changes.md identifies two surgical fixes required before the
67-scenario blackbox suite (already specified in _docs/02_document/
tests/) can run against the SUT:

  C01 — env-gate JWKS RequireHttps on ASPNETCORE_ENVIRONMENT=E2ETest
       (architecture.md Open Risks Section 6 prescribes this; the
       mock issuer in e2e/docker-compose.test.yml serves plain HTTP)

  C02 — DNS-resolve RABBITMQ_HOST in FailsafeProducer.ProcessQueue
       (IPAddress.Parse currently throws on every drain cycle when
       host is a service name; latent production-relevant bug, not
       just a test-env issue)

Two task specs in _docs/02_tasks/todo/ (3 story points total).
Independent — no inter-task dependency.

Tracker: local — Atlassian MCP reported errored at task-creation
time. Deferred Jira writes (epic + 2 tickets) recorded in
_docs/_process_leftovers/2026-05-14_testability-tracker.md for
replay when MCP is restored.

Items explicitly deferred to Step 8 Refactor are enumerated in
list-of-changes.md "Deferred to Step 8 Refactor" — including the
FailsafeProducer static helper (F3), the JWKS GetAwaiter().GetResult()
hot path, RB-05/06/08 backlog items, and the MediaService ffprobe
empty-catch.

State: Step 4 in_progress, sub_step 3 (phase-2-task-decomposition).
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-14 20:19:27 +03:00
parent 03f879206e
commit 13e9731a8f
12 changed files with 775 additions and 0 deletions
@@ -0,0 +1,61 @@
# Discovery — Component: Auth & Identity (scoped to C01)
**Component**: `06_platform` → Auth & Identity subsystem
**Source files in scope**: `src/Auth/JwtExtensions.cs`
**Component spec reference**: `_docs/02_document/modules/auth-identity.md`
## Purpose
JWT validation for API authorization policies (`ANN`, `DATASET`, `ADM`). Annotations is a **verifier-only** service — all token minting is the admin service's responsibility.
## Affected API / behavior
- `JwtExtensions.AddJwtAuth(IServiceCollection, IConfiguration)` — wires the JWT bearer scheme. The line affected by C01 is the `HttpDocumentRetriever` construction (line 33). No method signature changes. No DI graph changes. The `TokenValidationParameters` block (issuer, audience, lifetime, ES256 alg pinning, signed-tokens requirement) is untouched by this change.
## Coupling map (affected only)
```
Program.cs
└─ builder.Services.AddJwtAuth(builder.Configuration) ← caller of the affected code
Auth/JwtExtensions.cs (AddJwtAuth)
├─ ConfigurationResolver.ResolveRequiredOrThrow ← unaffected
├─ new ConfigurationManager<JsonWebKeySet>( ← container, unaffected
│ jwksUrl,
│ new JwksRetriever(),
│ new HttpDocumentRetriever { RequireHttps = true } ← C01 changes this constant to env-gated
│ )
└─ services.AddAuthentication(...).AddJwtBearer(...) ← unaffected
```
## C01 — input file claims vs. code reality
| Claim in `list-of-changes.md` C01 | Verification against `src/Auth/JwtExtensions.cs` | Status |
|----------------------------------|--------------------------------------------------|--------|
| `RequireHttps = true` on line 33 | Confirmed at line 33 (`new HttpDocumentRetriever { RequireHttps = true }`). | ✓ |
| No `IHostEnvironment` parameter on `AddJwtAuth` today | Confirmed — signature is `AddJwtAuth(IServiceCollection services, IConfiguration configuration)`. Adding an environment-name parameter (or reading `Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT")` inline) does not change the public method shape. | ✓ |
| `Program.cs:53` already uses `builder.Environment.EnvironmentName` for the CORS validator | Confirmed (`CorsConfigurationValidator.EnsureSafeForEnvironment(allowedOrigins, allowAnyOrigin, builder.Environment.EnvironmentName)`). | ✓ |
| Test stack sets `ASPNETCORE_ENVIRONMENT=E2ETest` | Confirmed in `e2e/docker-compose.test.yml` line 76 (`ASPNETCORE_ENVIRONMENT: E2ETest`). | ✓ |
| Open Risks §6 in `architecture.md` flags this exact change | Confirmed — `_docs/02_document/architecture.md` Open Risks §6 reads: "JWKS HTTPS-only retrieval blocks plain-HTTP test harness; resolution is `ASPNETCORE_ENVIRONMENT=E2ETest` + relaxed `RequireHttps` for tests, never in production." | ✓ |
| `test-data.md` "Bearer token harness" §2 prescribes the same fix | Confirmed verbatim. | ✓ |
All claims hold; no contradictions to surface to the user.
## Issues discovered during scoped analysis (additional to the input file)
None within the C01 scope. The `IssuerSigningKeyResolver` uses `.GetAwaiter().GetResult()` (sync-over-async on the auth hot path) — already enumerated under "Deferred to Step 8 Refactor" in `list-of-changes.md`; the test suite does not depend on substituting it, so no change is required for testability.
## Architecture Vision check
`_docs/02_document/architecture.md` Architecture Vision § "Verifier-only auth, no token issuance in annotations":
- C01 does NOT change verification semantics — algorithm pinning, signature, lifetime, audience, and issuer all remain enforced.
- C01 changes only the *transport requirement* for fetching the public-key document from a non-production issuer URL.
- No contradiction.
## Module-layout check
`_docs/02_document/module-layout.md` Component 06 (`06_platform`) → Auth: the affected file `src/Auth/JwtExtensions.cs` is the documented owner; no boundary crossing.
## Public API impact
None — `AddJwtAuth` signature unchanged; no DTOs, OpenAPI shapes, or HTTP responses affected.
@@ -0,0 +1,75 @@
# Discovery — Component: Realtime Sync / Failsafe Producer (scoped to C02)
**Component**: `02 annotations-realtime-sync`
**Source files in scope**: `src/Services/FailsafeProducer.cs`
**Component spec reference**: `_docs/02_document/modules/rabbitmq-stream-sync.md`
## Purpose
Outbox drain + RabbitMQ Stream producer (`BackgroundService`). Reads `annotations_queue_records`, serializes payloads (MessagePack + gzip), publishes to the `azaion-annotations` stream, then deletes drained rows.
## Affected API / behavior
- `FailsafeProducer.ProcessQueue(CancellationToken)` — line 54-76 — currently constructs `StreamSystem` via:
```csharp
Endpoints = [new IPEndPoint(IPAddress.Parse(config.Host), config.Port)]
```
Affected by C02. No other call site uses `IPAddress.Parse` against `config.Host`.
- The `FailsafeProducer` constructor (line 24-29) takes `IServiceScopeFactory`, `PathResolver`, `RabbitMqConfig`, `ILogger`. **Unchanged by C02.**
- The static `FailsafeProducer.EnqueueAsync` (line 195) — synchronous outbox row insert called from `AnnotationService` — does NOT use the broker connection and is unaffected.
## Coupling map (affected only)
```
Program.cs
└─ builder.Services.AddSingleton(rabbitMqConfig) ← unaffected
└─ builder.Services.AddHostedService<FailsafeProducer> ← unaffected
Services/FailsafeProducer.cs
├─ ExecuteAsync ← unaffected (loop / retry envelope)
├─ ProcessQueue ← C02 changes this line
│ ├─ IPAddress.Parse(config.Host) ← REPLACED by env-resolve
│ └─ StreamSystem.Create / Producer.Create ← unchanged
├─ DrainQueue ← unaffected (queue read / msg build / publish / delete)
└─ EnqueueAsync (static) ← unaffected
```
## C02 — input file claims vs. code reality
| Claim in `list-of-changes.md` C02 | Verification against `src/Services/FailsafeProducer.cs` | Status |
|----------------------------------|---------------------------------------------------------|--------|
| `IPAddress.Parse(config.Host)` on line 56 | Confirmed at line 56. | ✓ |
| `IPAddress.Parse` throws `FormatException` for non-IP strings | Verified against .NET BCL contract for `IPAddress.Parse(string)`. | ✓ |
| `config.Host` is populated from `RABBITMQ_HOST` env var | Confirmed at `Program.cs:40` (`Environment.GetEnvironmentVariable("RABBITMQ_HOST") ?? "127.0.0.1"`). | ✓ |
| Test stack sets `RABBITMQ_HOST=rabbitmq` (DNS hostname) | Confirmed in `e2e/docker-compose.test.yml` line 82. | ✓ |
| Test-environment fallback default in `RabbitMqConfig` class is `"rabbitmq"` | Confirmed in `FailsafeProducer.cs:17` (`public string Host { get; set; } = "rabbitmq"`). This means even ignoring `Program.cs`, the *default* triggers the bug. | ✓ |
| `BackgroundService` catches exceptions in `ExecuteAsync` and backs off 10 s | Confirmed at lines 44-48. | ✓ |
| Outbox insert (`EnqueueAsync`) is synchronous from the request thread and unaffected | Confirmed at line 195; called from `AnnotationService.cs:102`. | ✓ |
| `IPEndPoint` ctor requires `IPAddress`, not hostname | Verified against `RabbitMQ.Stream.Client` API surface (`StreamSystemConfig.Endpoints` is `IList<EndPoint>`; `IPEndPoint` is the standard-library type the existing code uses; `RabbitMQ.Stream.Client` accepts any `System.Net.EndPoint`, so a `DnsEndPoint` is a theoretical alternative — but every example in the client repo uses `IPEndPoint`, and the call is wrapped in a sync `IPEndPoint` constructor today, so the smallest-change path is to keep `IPEndPoint` and resolve the hostname ourselves). | ✓ |
All claims hold; no contradictions to surface to the user.
## Issues discovered during scoped analysis (additional to the input file)
1. **`IServiceScopeFactory.CreateScope()` is called inside `DrainQueue` to fetch a scoped `AppDataConnection`** (line 80). This is fine — it follows the documented `BackgroundService` pattern. Not in scope.
2. **`catch { }` at line 138 swallows image-read failures** — already enumerated under "Deferred to Step 8 Refactor" in `list-of-changes.md` (RB-05 tracks the proper logging + metric). No change here.
3. **`ProcessQueue` creates a new `StreamSystem` on every entry** — i.e., on every retry. With C02 applied, this remains the behavior — broker reconnects per outage cycle. Acceptable; matches the documented "broker recovers, drain resumes" behavior in NFT-RES-01. No additional change.
## Architecture Vision check
`_docs/02_document/architecture.md` Architecture Vision § "Lifecycle observability via outbox + stream":
- C02 is required to *honor* this vision in any environment where `RABBITMQ_HOST` is a hostname. Today the producer silently never drains in such environments.
- The fix preserves the documented flow (outbox row → batch read → MessagePack serialize → gzip → publish → delete row).
- No contradiction; this change is squarely aligned with the vision.
## Module-layout check
`_docs/02_document/module-layout.md` Component 02 (`02_annotations-realtime-sync`) → `FailsafeProducer` is the documented owner; the static `EnqueueAsync` helper is part of the Component 02 Public API (per F3 in the baseline) and is unchanged. C02 is internal to the component.
## Public API impact
None — `FailsafeProducer` is a `BackgroundService`; its `ExecuteAsync` is called by the host. The static `EnqueueAsync` (the only external surface) is unchanged. No DTOs, no HTTP shapes, no MessagePack wire format affected.
## Wire-format / stream contract check
C02 changes *how the producer reaches the broker*, not what it sends. Verified by reading `DrainQueue` (lines 78-181): the `MessagePackSerializer.Serialize(...)` calls, `Producer.Send(messages, CompressionType.Gzip)`, and the queue-table delete are all downstream of the line C02 touches and are untouched. Consumers (admin's `AnnotationSyncWorker`, AI Training consumer) see identical messages.
@@ -0,0 +1,78 @@
# Logical Flow Analysis — 01-testability-refactoring (scoped)
**Scope**: only the two flows whose code is touched by C01 and C02.
**Method**: each documented flow walked through actual code line-by-line; classified per phases/01-discovery.md guidance (logic bug / performance waste / design contradiction / documentation drift).
## Flow 1 — Bearer-token verification
**Documented in**: `_docs/02_document/diagrams/flows/` (auth-related sequence), `_docs/02_document/modules/auth-identity.md`, `_docs/02_document/tests/test-data.md` "Bearer token harness", `_docs/02_document/architecture.md` Open Risks §6.
**Walk-through**:
1. Request arrives at any `[Authorize]` controller (e.g., `POST /annotations`).
2. ASP.NET Core auth middleware invokes the configured `JwtBearer` scheme.
3. The scheme reads `TokenValidationParameters` from `JwtExtensions.AddJwtAuth` — these are correct (alg pinned, lifetime/issuer/audience validated, signature required).
4. For signature verification, `IssuerSigningKeyResolver` (line 52-63) calls `jwksConfigManager.GetConfigurationAsync(...)`.
5. On first call, `ConfigurationManager<JsonWebKeySet>` fetches the JWKS document using `HttpDocumentRetriever`. Today: `RequireHttps = true`. The mock issuer in the e2e stack serves `http://e2e-issuer:8080/.well-known/jwks.json` — a plain-HTTP URL. The retriever throws `InvalidOperationException("The URL must use HTTPS")`.
6. Token validation surfaces the exception as a 401/500 in unpredictable ways (depends on the framework's error envelope path), and **no real validation logic was ever exercised**.
**Findings**:
- **Documentation drift — NONE.** The architecture doc and test-data doc both already specify the expected behavior and the fix (env-gated `RequireHttps`).
- **Logic bug — None in production.** In production the JWKS URL IS HTTPS, so the current code works. The bug is environment-specific: it only blocks the test harness.
- **Design contradiction** — the test harness assumes HTTP-only JWKS service is acceptable, but the SUT enforces HTTPS unconditionally. C01 resolves this by reading the environment name and relaxing the requirement only under `E2ETest`.
- **Silent data loss** — N/A.
**Classification**: documented design contradiction. Resolution: C01.
**Loop / boundary check**: not applicable (no loops in the auth-init code path).
## Flow 2 — Outbox drain → RabbitMQ stream
**Documented in**: `_docs/02_document/diagrams/flows/flow_failsafe_drain.md`, `_docs/02_document/modules/rabbitmq-stream-sync.md`, `_docs/02_document/architecture.md` (ADR-008 transactional outbox).
**Walk-through**:
1. `AnnotationService.CreateAnnotation` (line 102) calls `FailsafeProducer.EnqueueAsync(db, id, QueueOperation.Created)` — synchronous DB insert into `annotations_queue_records`. **No broker dependency.** Always succeeds when DB is up.
2. The host's `BackgroundService` invokes `FailsafeProducer.ExecuteAsync` shortly after startup (`Task.Delay(5s)` at line 32).
3. `ExecuteAsync` enters its loop and calls `ProcessQueue`.
4. `ProcessQueue` constructs `StreamSystem`:
```csharp
Endpoints = [new IPEndPoint(IPAddress.Parse(config.Host), config.Port)]
```
With `config.Host = "rabbitmq"`, `IPAddress.Parse` throws **`FormatException: An invalid IP address was specified.`**.
5. Control returns up the stack. `ExecuteAsync`'s outer `catch (Exception ex)` at line 44 catches it, logs `ex.Message`, and `await Task.Delay(TimeSpan.FromSeconds(10), ct)`.
6. The loop restarts → step 4 → same exception → same 10s back-off. **The drain never executes.**
**Findings**:
- **Silent data loss — YES (production-relevant).** With `RABBITMQ_HOST` set to any non-IP value (typical: docker-compose service name, Kubernetes service DNS, or any deployment using container DNS), the outbox grows monotonically and is never published to consumers. The error IS logged, but unless logs are alerted on, it manifests as "stream consumers see no traffic". This is a **logic bug**, not a documentation drift — the documented flow assumes the drain works.
- **Documentation drift — NONE.** The flow diagram and module spec describe correct behavior; the implementation has a latent bug.
- **Design contradiction — NONE.** The fix is mechanical.
- **Performance waste** — every 10 seconds the producer does the work to start an exception, log it, and back off. Trivial compared to the real production impact (no messages publish).
- **Why not caught by `architecture_compliance_baseline.md`**: the baseline checks structural properties (layering, public-API respect, cycles, duplicate symbols, cross-cutting concerns). API-level correctness ("is `IPAddress.Parse` the right API for this input?") is not in Phase 7's mandate.
- **Why not caught by `_docs/02_document/00_discovery.md`**: discovery documents *intent*, not implementation correctness against the BCL contract.
- **Why surfaced now**: the test harness uses a service-name hostname, which forces exercise of the bug. Without the test harness, the bug remained latent.
**Classification**: logic bug. Resolution: C02.
**Loop / boundary check**:
- The retry loop in `ExecuteAsync` correctly catches `OperationCanceledException` and propagates cancellation (line 40-42).
- The drain loop in `ProcessQueue` (line 65-69) correctly checks `ct.IsCancellationRequested`.
- The `DrainQueue` foreach is correct — it processes every queue record, deletes drained rows in bulk (line 176-180).
- No silent-drop edge cases inside the drain itself.
The bug is strictly in the *transition from `RabbitMqConfig.Host` to a `System.Net.IPEndPoint`*.
## Cross-flow check
C01 and C02 are **independent**:
- C01 affects only `AddJwtAuth` (a one-shot, at startup).
- C02 affects only `FailsafeProducer.ProcessQueue` (a `BackgroundService` cold-path).
- No shared symbol; no ordering dependency.
Both changes preserve the documented Architecture Vision and module-layout boundaries (see component reports).
## Contradictions surfaced to user — NONE
Both input-file entries are consistent with the code reality. No changes recommended outside the input file. No need to escalate before the Phase 1 BLOCKING gate.