Files
ui/_docs/02_document/deployment/observability.md
Oleksandr Bezdieniezhnykh 510df68bcf [AZ-447] autodev Steps 1-4 baseline: docs, tests, refactor specs
Captures the full output of autodev existing-code Phase A through
Step 4 (Code Testability Revision) for the Azaion UI workspace:

- Step 1 Document: _docs/02_document/ (FINAL_report, architecture,
  glossary, components/, modules/, diagrams/, system-flows,
  module-layout) plus _docs/00_problem/ + _docs/01_solution/ +
  _docs/legacy/ + _docs/how_to_test + README.
- Step 2 Architecture Baseline: architecture_compliance_baseline.md.
- Step 3 Test Spec: _docs/02_document/tests/ (environment,
  test-data, blackbox/performance/resilience/security/
  resource-limit tests, traceability-matrix), enum_spec_snapshot,
  expected_results/results_report.md (98 rows), plus the
  run-tests.sh + run-performance-tests.sh runners.
- Step 4 Code Testability Revision: 01-testability-refactoring/
  run dir (list-of-changes C01-C07, deferred_to_refactor,
  analysis/research_findings + refactoring_roadmap) and the 7
  child task specs AZ-448..AZ-454 under _docs/02_tasks/todo/
  plus _dependencies_table.md.
- _docs/_autodev_state.md pins the cursor at Step 4 / refactor
  Phase 4 entry so /autodev resumes cleanly.

Epic AZ-447 (UI testability gates) tracks the 7 child tasks that
will land in subsequent commits.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 00:38:49 +03:00

4.1 KiB
Raw Permalink Blame History

Azaion UI — Observability

Synthesis output of /document Step 3d (observability). Derived from inspection of all module docs + nginx.conf + the absence of any client telemetry SDK in package.json.

1. Status: minimal

The browser-side SPA emits no centralized telemetry today:

  • No analytics SDK (no @sentry/*, @datadog/*, web-vitals, posthog, etc.).
  • No error reporting service.
  • No client-side feature-flag service.
  • Errors that aren't caught by an <ErrorBoundary> (which doesn't exist today — finding in 10_app-shell) end up as console.error only.

This is acceptable as a starting state. A future iteration adds an error-tracking SDK (Sentry candidate) with the SDK key sourced from a runtime /config.json — see environment_strategy.md.

2. Existing logging (per module)

Module What is logged How Why it's unsatisfactory
01_api-transport/client.ts request / response errors console.error No retries, no spans, no correlation IDs
01_api-transport/sse.ts EventSource errors console.error No reconnect logic; no telemetry
02_auth/AuthContext.tsx login / refresh outcomes console.error Successful refresh is silent (good); failures are silent (bad — need user-visible recovery flow)
03_shared-ui/FlightContext.tsx flight load + select-flight errors swallowed selectFlight is fire-and-forget, error invisible
06_annotations/AnnotationsSidebar.tsx AI-detect errors console.error User sees no feedback (finding #2123)
06_annotations/AnnotationsPage.tsx save errors partial — handleSave has fallback that hides save loss (finding) Worst case: user thinks the annotation saved but it didn't
07_dataset/DatasetPage.tsx various swallowed catch blocks (finding #6) Same risk
05_flights/FlightsPage.tsx save partial-failure not detected Per-waypoint failures invisible (finding #19)
05_flights/flightPlanUtils.ts weather fetch errors swallowed silently Wind data missing → battery estimate wrong; user not informed

The dominant pattern is "silent catch + console.error" — this is the single biggest observability gap.

3. Server-side logs the UI relies on

The suite services (admin, flights, annotations, detect, etc.) are responsible for:

  • Audit logging (login, logout, role changes, destructive admin actions)
  • Request tracing (the UI does not send a traceparent header today — Step 6 candidate)
  • Performance metrics (UI does not measure RUM)

The UI's bug-reproduction story relies on suite-side logs. A correlation ID injected by the UI on every request would dramatically simplify cross-service debugging — a Step 6 problem-extraction surface.

4. Client-side metrics (none)

No web-vitals or equivalent is installed. Recommended (Step 5 solution surface):

  • CLS (cumulative layout shift) — the canvas + leaflet + chart layout has known shifts on initial load.
  • LCP (largest contentful paint) — the bundle is the dominant cost.
  • FID / INP (interaction latency) — relevant for the canvas drag and waypoint drag-drop.
  • Custom metrics: time-to-first-flight-list, time-to-first-thumbnail, time-to-first-detection.

5. Error boundaries

10_app-shell finding: no <ErrorBoundary> wraps the route tree. A single uncaught render error today crashes the whole SPA. Step 4 / Step 5 surface — add a top-level <ErrorBoundary> plus per-feature boundaries for the canvas / map / chart so isolated failures don't take down the whole UI.

  1. Add a top-level <ErrorBoundary> in App.tsx with a "something broke" recovery card.
  2. Replace silent catches (} catch {}) with console.error + user toast — at minimum.
  3. Inject a correlation ID (X-Request-Id header) on every fetch + EventSource.
  4. Surface AI-detect progress + errors — see Flow F7 (currently flow doesn't even subscribe).
  5. Add Sentry (or equivalent) with runtime-config-driven DSN.
  6. Add web-vitals + emit to suite admin/ telemetry endpoint.