diff --git a/.cursor/skills/autodev/flows/existing-code.md b/.cursor/skills/autodev/flows/existing-code.md index 5761320..6af1c97 100644 --- a/.cursor/skills/autodev/flows/existing-code.md +++ b/.cursor/skills/autodev/flows/existing-code.md @@ -152,15 +152,17 @@ If `_docs/02_tasks/` subfolders have some task files already (e.g., refactoring --- **Step 6 — Implement Tests** -Condition (folder fallback): `_docs/02_tasks/todo/` contains task files AND `_dependencies_table.md` exists AND `_docs/03_implementation/implementation_report_tests.md` does not exist. +Condition (folder fallback): `_docs/02_tasks/todo/` contains test task files AND `_dependencies_table.md` exists AND `_docs/03_implementation/implementation_report_tests.md` does not exist. State-driven: reached by auto-chain from Step 5. -Action: Read and execute `.cursor/skills/implement/SKILL.md` +Action: Invoke `.cursor/skills/implement/SKILL.md` with task selection context **Test implementation**. -The implement skill reads test tasks from `_docs/02_tasks/todo/` and implements them. +The implement skill reads only test tasks from `_docs/02_tasks/todo/` and implements them. If `_docs/03_implementation/` has batch reports, the implement skill detects completed tasks and continues. +For folder fallback, **test task files** means `*_test_infrastructure.md` plus task specs whose `**Component**` or `**Epic**` identifies `Blackbox Tests`. + --- **Step 7 — Run Tests** diff --git a/.cursor/skills/autodev/flows/greenfield.md b/.cursor/skills/autodev/flows/greenfield.md index 6f186da..4a177fc 100644 --- a/.cursor/skills/autodev/flows/greenfield.md +++ b/.cursor/skills/autodev/flows/greenfield.md @@ -1,6 +1,6 @@ # Greenfield Workflow -Workflow for new projects built from scratch. Flows linearly: Problem → Research → Plan → UI Design (if applicable) → Test Spec → Decompose → Implement → Code Testability Revision → Decompose Tests → Implement Tests → Run Tests → Test-Spec Sync → Update Docs → Security Audit (optional) → Performance Test (optional) → Deploy → Retrospective. +Workflow for new projects built from scratch. Flows linearly: Problem → Research → Plan → UI Design (if applicable) → Test Spec → Decompose → Implement + Product Completeness Gate → Code Testability Revision → Decompose Tests → Implement Tests → Run Tests → Test-Spec Sync → Update Docs → Security Audit (optional) → Performance Test (optional) → Deploy → Retrospective. ## Step Reference Table @@ -11,8 +11,8 @@ Workflow for new projects built from scratch. Flows linearly: Problem → Resear | 3 | Plan | plan/SKILL.md | Step 1–6 + Final | | 4 | UI Design | ui-design/SKILL.md | Phase 0–8 (conditional — UI projects only) | | 5 | Test Spec | test-spec/SKILL.md | Phases 1–4 | -| 6 | Decompose | decompose/SKILL.md | Step 1–4 | -| 7 | Implement | implement/SKILL.md | (batch-driven, no fixed sub-steps) | +| 6 | Decompose | decompose/SKILL.md (implementation task decomposition) | Step 1 + Step 1.5 + Step 2 + Step 4 | +| 7 | Implement | implement/SKILL.md | Batch loop + Product Implementation Completeness Gate | | 8 | Code Testability Revision | refactor/SKILL.md (guided mode) | Phases 0–7 (conditional) | | 9 | Decompose Tests | decompose/SKILL.md (tests-only) | Step 1t + Step 3 + Step 4 | | 10 | Implement Tests | implement/SKILL.md | (batch-driven, no fixed sub-steps) | @@ -112,27 +112,36 @@ This step converts the greenfield problem statement, acceptance criteria, soluti **Step 6 — Decompose** Condition: `_docs/02_document/` contains `architecture.md` AND `_docs/02_document/components/` has at least one component AND `_docs/02_document/tests/traceability-matrix.md` exists AND `_docs/02_tasks/todo/` does not exist or has no implementation task files. -Action: Read and execute `.cursor/skills/decompose/SKILL.md` in normal implementation mode. Test tasks are intentionally deferred to Step 9 (Decompose Tests) so the first implementation batch stays focused on product functionality. +Action: Invoke `.cursor/skills/decompose/SKILL.md` for **implementation task decomposition**. The greenfield flow selects the implementation entrypoint before handing off: Bootstrap Structure, Module Layout, Component Task Decomposition, and Cross-Task Verification. + +Do not invoke Blackbox Test Task Decomposition from Step 6. Test tasks are intentionally deferred to Step 9 (Decompose Tests) so the first implementation batch stays focused on product functionality and Step 8 can revise testability before test task files exist. If `_docs/02_tasks/` subfolders have some task files already, the decompose skill's resumability handles it. --- **Step 7 — Implement** -Condition: `_docs/02_tasks/todo/` contains implementation task files AND `_dependencies_table.md` exists AND `_docs/03_implementation/` does not contain any product `implementation_report_*.md` file. +Condition: `_docs/02_tasks/todo/` contains implementation task files AND `_dependencies_table.md` exists AND `_docs/03_implementation/` does not contain a valid product implementation report. -Action: Read and execute `.cursor/skills/implement/SKILL.md` +Action: Invoke `.cursor/skills/implement/SKILL.md` with task selection context **Product implementation**. + +The implement skill must run its **Product Implementation Completeness Gate** before it writes any final product implementation report. This gate compares completed product task specs, architecture/component promises, and actual source code so scaffold-only implementations cannot advance to Step 8. A final product implementation report without `_docs/03_implementation/implementation_completeness_cycle[N]_report.md` is incomplete and must not be treated as Step 7 completion. If `_docs/03_implementation/` has batch reports, the implement skill detects completed tasks and continues. The FINAL report filename is context-dependent — see implement skill documentation for naming convention. For folder fallback, **implementation task files** means task specs that are not test-only specs: exclude `*_test_infrastructure.md` and task specs whose `**Component**` or `**Epic**` identifies `Blackbox Tests`. -For folder fallback, a **product implementation report** is any `_docs/03_implementation/implementation_report_*.md` file except `_docs/03_implementation/implementation_report_tests.md` and refactor reports. +For folder fallback, a **product implementation report** is any `_docs/03_implementation/implementation_report_*.md` file except `_docs/03_implementation/implementation_report_tests.md` and refactor reports. It is valid for greenfield progression only when: +- the matching `_docs/03_implementation/implementation_completeness_cycle[N]_report.md` exists, +- that completeness report does not contain unresolved `FAIL` classifications, and +- `_docs/02_tasks/todo/` contains no pending implementation task files. + +If a product report exists but any of those validity checks fail, treat product implementation as incomplete and stay in Step 7. --- **Step 8 — Code Testability Revision** -Condition (folder fallback): `_docs/03_implementation/` contains a product implementation report AND `_docs/04_refactoring/01-testability-refactoring/testability_assessment.md` does not exist AND `_docs/04_refactoring/01-testability-refactoring/testability_changes_summary.md` does not exist AND `_docs/03_implementation/implementation_report_tests.md` does not exist AND `_docs/02_tasks/todo/` does not contain test task files. +Condition (folder fallback): `_docs/03_implementation/` contains a valid product implementation report, `_docs/03_implementation/implementation_completeness_cycle[N]_report.md` exists without unresolved `FAIL` classifications, `_docs/04_refactoring/01-testability-refactoring/testability_assessment.md` does not exist, `_docs/04_refactoring/01-testability-refactoring/testability_changes_summary.md` does not exist, `_docs/03_implementation/implementation_report_tests.md` does not exist, and `_docs/02_tasks/todo/` does not contain test task files. State-driven: reached by auto-chain from Step 7. **Purpose**: verify the newly built code can be exercised by the planned tests before writing the test suite. Greenfield code should be testable by design; this step catches accidental hardcoded paths, singletons, direct external service construction, or other implementation choices that would make meaningful tests impossible. @@ -184,7 +193,7 @@ Action: Analyze the codebase against the test specs to determine whether the cod --- **Step 9 — Decompose Tests** -Condition (folder fallback): `_docs/02_document/tests/traceability-matrix.md` exists AND workspace contains source code files AND `_docs/03_implementation/` contains a product implementation report AND (`_docs/04_refactoring/01-testability-refactoring/testability_assessment.md` exists OR `_docs/04_refactoring/01-testability-refactoring/testability_changes_summary.md` exists) AND (`_docs/02_tasks/todo/` does not exist or has no test task files) AND `_docs/03_implementation/implementation_report_tests.md` does not exist. +Condition (folder fallback): `_docs/02_document/tests/traceability-matrix.md` exists AND workspace contains source code files AND `_docs/03_implementation/` contains a valid product implementation report AND `_docs/03_implementation/implementation_completeness_cycle[N]_report.md` exists without unresolved `FAIL` classifications AND (`_docs/04_refactoring/01-testability-refactoring/testability_assessment.md` exists OR `_docs/04_refactoring/01-testability-refactoring/testability_changes_summary.md` exists) AND (`_docs/02_tasks/todo/` does not exist or has no test task files) AND `_docs/03_implementation/implementation_report_tests.md` does not exist. State-driven: reached by auto-chain from Step 8. Action: Read and execute `.cursor/skills/decompose/SKILL.md` in **tests-only mode** (pass `_docs/02_document/tests/` as input). The decompose skill will: @@ -200,9 +209,9 @@ If `_docs/02_tasks/` subfolders have some task files already, the decompose skil Condition (folder fallback): `_docs/02_tasks/todo/` contains test task files AND `_dependencies_table.md` exists AND `_docs/03_implementation/implementation_report_tests.md` does not exist. State-driven: reached by auto-chain from Step 9. -Action: Read and execute `.cursor/skills/implement/SKILL.md` +Action: Invoke `.cursor/skills/implement/SKILL.md` with task selection context **Test implementation**. -The implement skill reads test tasks from `_docs/02_tasks/todo/` and implements them. +The implement skill reads only test tasks from `_docs/02_tasks/todo/` and implements them. If `_docs/03_implementation/` has batch reports, the implement skill detects completed test tasks and continues. @@ -319,7 +328,7 @@ On the next invocation, Flow Resolution rule 1 reads `flow: existing-code` and r | UI Design (4, done or skipped) | Auto-chain → Test Spec (5) | | Test Spec (5) | Auto-chain → Decompose (6) | | Decompose (6) | **Session boundary** — suggest new conversation before Implement | -| Implement (7) | Auto-chain → Code Testability Revision (8) | +| Implement (7) | Auto-chain only after Product Implementation Completeness Gate passes → Code Testability Revision (8) | | Code Testability Revision (8) | Auto-chain → Decompose Tests (9) | | Decompose Tests (9) | **Session boundary** — suggest new conversation before Implement Tests | | Implement Tests (10) | Auto-chain → Run Tests (11) | diff --git a/.cursor/skills/autodev/protocols.md b/.cursor/skills/autodev/protocols.md index beee18b..edc4037 100644 --- a/.cursor/skills/autodev/protocols.md +++ b/.cursor/skills/autodev/protocols.md @@ -110,7 +110,7 @@ Before entering a step from this table for the first time in a session, verify t | Flow | Step | Sub-Step | Tracker Action | |------|------|----------|----------------| | greenfield | Plan | Step 6 — Epics | Create epics for each component | -| greenfield | Decompose | Step 1 + Step 2 + Step 3 — All tasks | Create ticket per task, link to epic | +| greenfield | Decompose | Implementation decomposition Step 1 + Step 2 — Product tasks | Create ticket per product task, link to epic | | greenfield | Decompose Tests | Step 1t + Step 3 — All test tasks | Create ticket per task, link to epic | | existing-code | Decompose Tests | Step 1t + Step 3 — All test tasks | Create ticket per task, link to epic | | existing-code | New Task | Step 7 — Ticket | Create ticket per task, link to epic | diff --git a/.cursor/skills/decompose/SKILL.md b/.cursor/skills/decompose/SKILL.md index fa1e789..8bde60e 100644 --- a/.cursor/skills/decompose/SKILL.md +++ b/.cursor/skills/decompose/SKILL.md @@ -2,8 +2,8 @@ name: decompose description: | Decompose planned components into atomic implementable tasks with bootstrap structure plan. - 4-step workflow: bootstrap structure plan, component task decomposition, blackbox test task decomposition, and cross-task verification. - Supports full decomposition (_docs/ structure), single component mode, and tests-only mode. + Workflow entrypoints: implementation task decomposition, single component decomposition, and tests-only decomposition. + The invoking flow decides which entrypoint to run; this skill executes that selected sequence. Trigger phrases: - "decompose", "decompose features", "feature decomposition" - "task decomposition", "break down components" @@ -20,7 +20,7 @@ Decompose planned components into atomic, implementable task specs with a bootst ## Core Principles -- **Atomic tasks**: each task does one thing; if it exceeds 8 complexity points, split it +- **Atomic tasks**: each task does one thing; if it exceeds 5 complexity points, split it - **Behavioral specs, not implementation plans**: describe what the system should do, not how to build it - **Flat structure**: all tasks are tracker-ID-prefixed files in TASKS_DIR — no component subdirectories - **Save immediately**: write artifacts to disk after each task; never accumulate unsaved work @@ -30,14 +30,15 @@ Decompose planned components into atomic, implementable task specs with a bootst ## Context Resolution -Determine the operating mode based on invocation before any other logic runs. +Resolve the selected entrypoint from the invocation context before any other logic runs. The caller decides whether this is implementation, single component, or tests-only decomposition; this skill only executes the selected sequence. -**Default** (no explicit input file provided): +**Implementation task decomposition** (default; selected by flows before invoking this skill): - DOCUMENT_DIR: `_docs/02_document/` - TASKS_DIR: `_docs/02_tasks/` - TASKS_TODO: `_docs/02_tasks/todo/` - Reads from: `_docs/00_problem/`, `_docs/01_solution/`, DOCUMENT_DIR +- Produces only implementation tasks. Blackbox/e2e test task files are produced only when the invoking flow selects tests-only decomposition. **Single component mode** (provided file is within `_docs/02_document/` and inside a `components/` subdirectory): @@ -55,24 +56,24 @@ Determine the operating mode based on invocation before any other logic runs. - TESTS_DIR: `DOCUMENT_DIR/tests/` - Reads from: `_docs/00_problem/`, `_docs/01_solution/`, TESTS_DIR -Announce the detected mode and resolved paths to the user before proceeding. +Announce the selected entrypoint and resolved paths to the user before proceeding. ### Step Applicability by Mode -| Step | File | Default | Single | Tests-only | -|------|------|:-------:|:------:|:----------:| +| Step | File | Implementation | Single | Tests-only | +|------|------|:--------------:|:------:|:----------:| | 1 Bootstrap Structure | `steps/01_bootstrap-structure.md` | ✓ | — | — | | 1t Test Infrastructure | `steps/01t_test-infrastructure.md` | — | — | ✓ | | 1.5 Module Layout | `steps/01-5_module-layout.md` | ✓ | — | — | | 2 Task Decomposition | `steps/02_task-decomposition.md` | ✓ | ✓ | — | -| 3 Blackbox Test Tasks | `steps/03_blackbox-test-decomposition.md` | ✓ | — | ✓ | +| 3 Blackbox Test Tasks | `steps/03_blackbox-test-decomposition.md` | — | — | ✓ | | 4 Cross-Verification | `steps/04_cross-verification.md` | ✓ | — | ✓ | ## Input Specification ### Required Files -**Default:** +**Implementation task decomposition:** | File | Purpose | |------|---------| @@ -84,7 +85,7 @@ Announce the detected mode and resolved paths to the user before proceeding. | `DOCUMENT_DIR/glossary.md` | Project terminology (confirmed by user in plan Phase 2a.0 or document Step 4.5). Use it to keep task names, component references, and AC wording consistent with the user's vocabulary | | `DOCUMENT_DIR/system-flows.md` | System flows from plan skill | | `DOCUMENT_DIR/components/[##]_[name]/description.md` | Component specs from plan skill | -| `DOCUMENT_DIR/tests/` | Blackbox test specs from plan skill | +| `DOCUMENT_DIR/tests/` | Optional product acceptance context from test-spec skill; do not create test task files from it in this entrypoint | **Single component mode:** @@ -111,7 +112,7 @@ Announce the detected mode and resolved paths to the user before proceeding. ### Prerequisite Checks (BLOCKING) -**Default:** +**Implementation task decomposition:** 1. DOCUMENT_DIR contains `architecture.md` and `components/` — **STOP if missing** 2. Create TASKS_DIR and TASKS_TODO if they do not exist @@ -145,6 +146,8 @@ TASKS_DIR/ **Naming convention**: Each task file is initially saved in `TASKS_TODO/` with a temporary numeric prefix (`[##]_[short_name].md`). After creating the work item ticket, rename the file to use the work item ticket ID as prefix (`[TRACKER-ID]_[short_name].md`). For example: `todo/01_initial_structure.md` → `todo/AZ-42_initial_structure.md`. +If tracker availability fails, follow `.cursor/rules/tracker.mdc` before continuing. Only when the user explicitly chooses `tracker: local` may the numeric prefix remain; in that mode set `Tracker: pending` and `Epic: pending` in the task header and keep the task eligible for later tracker sync. + ### Save Timing | Step | Save immediately after | Filename | @@ -166,11 +169,11 @@ If TASKS_DIR subfolders already contain task files: ## Progress Tracking -At the start of execution, create a TodoWrite with all applicable steps for the detected mode (see Step Applicability table). Update status as each step/component completes. +At the start of execution, create a TodoWrite with all applicable steps for the selected entrypoint (see Step Applicability table). Update status as each step/component completes. ## Workflow -### Step 1: Bootstrap Structure Plan (default mode only) +### Step 1: Bootstrap Structure Plan (implementation mode only) Read and follow `steps/01_bootstrap-structure.md`. @@ -182,25 +185,25 @@ Read and follow `steps/01t_test-infrastructure.md`. --- -### Step 1.5: Module Layout (default mode only) +### Step 1.5: Module Layout (implementation mode only) Read and follow `steps/01-5_module-layout.md`. --- -### Step 2: Task Decomposition (default and single component modes) +### Step 2: Task Decomposition (implementation and single component modes) Read and follow `steps/02_task-decomposition.md`. --- -### Step 3: Blackbox Test Task Decomposition (default and tests-only modes) +### Step 3: Blackbox Test Task Decomposition (tests-only mode only) Read and follow `steps/03_blackbox-test-decomposition.md`. --- -### Step 4: Cross-Task Verification (default and tests-only modes) +### Step 4: Cross-Task Verification (implementation and tests-only modes) Read and follow `steps/04_cross-verification.md`. @@ -208,7 +211,7 @@ Read and follow `steps/04_cross-verification.md`. - **Coding during decomposition**: this workflow produces specs, never code - **Over-splitting**: don't create many tasks if the component is simple — 1 task is fine -- **Tasks exceeding 8 points**: split them; no task should be too complex for a single implementer +- **Tasks exceeding 5 points**: split them; no task should be too complex for a single implementer - **Cross-component tasks**: each task belongs to exactly one component - **Skipping BLOCKING gates**: never proceed past a BLOCKING marker without user confirmation - **Creating git branches**: branch creation is an implementation concern, not a decomposition one @@ -221,7 +224,7 @@ Read and follow `steps/04_cross-verification.md`. | Situation | Action | |-----------|--------| | Ambiguous component boundaries | ASK user | -| Task complexity exceeds 8 points after splitting | ASK user | +| Task complexity exceeds 5 points after splitting | ASK user | | Missing component specs in DOCUMENT_DIR | ASK user | | Cross-component dependency conflict | ASK user | | Tracker epic not found for a component | ASK user for Epic ID | @@ -233,15 +236,14 @@ Read and follow `steps/04_cross-verification.md`. ┌────────────────────────────────────────────────────────────────┐ │ Task Decomposition (Multi-Mode) │ ├────────────────────────────────────────────────────────────────┤ -│ CONTEXT: Resolve mode (default / single component / tests-only) │ +│ CONTEXT: Invoke the selected entrypoint (implementation / single / tests-only) │ │ │ -│ DEFAULT MODE: │ +│ IMPLEMENTATION TASK DECOMPOSITION: │ │ 1. Bootstrap Structure → steps/01_bootstrap-structure.md │ │ [BLOCKING: user confirms structure] │ │ 1.5 Module Layout → steps/01-5_module-layout.md │ │ [BLOCKING: user confirms layout] │ │ 2. Component Tasks → steps/02_task-decomposition.md │ -│ 3. Blackbox Tests → steps/03_blackbox-test-decomposition.md │ │ 4. Cross-Verification → steps/04_cross-verification.md │ │ [BLOCKING: user confirms dependencies] │ │ │ diff --git a/.cursor/skills/decompose/steps/02_task-decomposition.md b/.cursor/skills/decompose/steps/02_task-decomposition.md index 77ad15a..572a81a 100644 --- a/.cursor/skills/decompose/steps/02_task-decomposition.md +++ b/.cursor/skills/decompose/steps/02_task-decomposition.md @@ -26,7 +26,7 @@ For each component (or the single provided component): 4. Do not create tasks for other components — only tasks for the current component 5. Each task should be atomic, containing 1 API or a list of semantically connected APIs 6. Write each task spec using `templates/task.md` -7. Estimate complexity per task (1, 2, 3, 5, 8 points); no task should exceed 8 points — split if it does +7. Estimate complexity per task (1, 2, 3, 5 points); no task should exceed 5 points — split if it does 8. Note task dependencies (referencing tracker IDs of already-created dependency tasks, e.g., `AZ-42_initial_structure`) 9. **Cross-cutting rule**: if a concern spans ≥2 components (logging, config loading, auth/authZ, error envelope, telemetry, feature flags, i18n), create ONE shared task under the cross-cutting epic. Per-component tasks declare it as a dependency and consume it; they MUST NOT re-implement it locally. Duplicate local implementations are an `Architecture` finding (High) in code-review Phase 7 and a `Maintainability` finding in Phase 6. 10. **Shared-models / shared-API rule**: classify the task as shared if ANY of the following is true: @@ -46,7 +46,7 @@ For each component (or the single provided component): ## Self-verification (per component) - [ ] Every task is atomic (single concern) -- [ ] No task exceeds 8 complexity points +- [ ] No task exceeds 5 complexity points - [ ] Task dependencies reference correct tracker IDs - [ ] Tasks cover all interfaces defined in the component spec - [ ] No tasks duplicate work from other components diff --git a/.cursor/skills/decompose/steps/03_blackbox-test-decomposition.md b/.cursor/skills/decompose/steps/03_blackbox-test-decomposition.md index 6dfe929..8073d9a 100644 --- a/.cursor/skills/decompose/steps/03_blackbox-test-decomposition.md +++ b/.cursor/skills/decompose/steps/03_blackbox-test-decomposition.md @@ -1,4 +1,4 @@ -# Step 3: Blackbox Test Task Decomposition (default and tests-only modes) +# Step 3: Blackbox Test Task Decomposition (tests-only mode only) **Role**: Professional Quality Assurance Engineer **Goal**: Decompose blackbox test specs into atomic, implementable task specs. @@ -6,7 +6,6 @@ ## Numbering -- In default mode: continue sequential numbering from where Step 2 left off. - In tests-only mode: start from 02 (01 is the test infrastructure bootstrap from Step 1t). ## Steps @@ -15,10 +14,9 @@ 2. Group related test scenarios into atomic tasks (e.g., one task per test category or per component under test) 3. Each task should reference the specific test scenarios it implements and the environment/test-data specs 4. Dependencies: - - In default mode: blackbox test tasks depend on the component implementation tasks they exercise - In tests-only mode: blackbox test tasks depend on the test infrastructure bootstrap task (Step 1t) 5. Write each task spec using `templates/task.md` -6. Estimate complexity per task (1, 2, 3, 5, 8 points); no task should exceed 8 points — split if it does +6. Estimate complexity per task (1, 2, 3, 5 points); no task should exceed 5 points — split if it does 7. Note task dependencies (referencing tracker IDs of already-created dependency tasks) 8. **Immediately after writing each task file**: create a work item ticket under the "Blackbox Tests" epic, write the work item ticket ID and Epic ID back into the task header, then rename the file from `todo/[##]_[short_name].md` to `todo/[TRACKER-ID]_[short_name].md`. @@ -26,8 +24,8 @@ - [ ] Every scenario from `tests/blackbox-tests.md` is covered by a task - [ ] Every scenario from `tests/performance-tests.md`, `tests/resilience-tests.md`, `tests/security-tests.md`, and `tests/resource-limit-tests.md` is covered by a task -- [ ] No task exceeds 8 complexity points -- [ ] Dependencies correctly reference the dependency tasks (component tasks in default mode, test infrastructure in tests-only mode) +- [ ] No task exceeds 5 complexity points +- [ ] Dependencies correctly reference the test infrastructure task - [ ] Every task has a work item ticket linked to the "Blackbox Tests" epic ## Save action diff --git a/.cursor/skills/decompose/steps/04_cross-verification.md b/.cursor/skills/decompose/steps/04_cross-verification.md index 6de043a..3435b82 100644 --- a/.cursor/skills/decompose/steps/04_cross-verification.md +++ b/.cursor/skills/decompose/steps/04_cross-verification.md @@ -1,4 +1,4 @@ -# Step 4: Cross-Task Verification (default and tests-only modes) +# Step 4: Cross-Task Verification (implementation and tests-only modes) **Role**: Professional software architect and analyst **Goal**: Verify task consistency and produce `_dependencies_table.md`. @@ -8,7 +8,7 @@ 1. Verify task dependencies across all tasks are consistent 2. Check no gaps: - - In default mode: every interface in `architecture.md` has tasks covering it + - In implementation mode: every product interface in `architecture.md` has implementation task coverage - In tests-only mode: every test scenario in `traceability-matrix.md` is covered by a task 3. Check no overlaps: tasks don't duplicate work 4. Check no circular dependencies in the task graph @@ -16,9 +16,9 @@ ## Self-verification -### Default mode +### Implementation mode -- [ ] Every architecture interface is covered by at least one task +- [ ] Every product interface in `architecture.md` is covered by at least one implementation task - [ ] No circular dependencies in the task graph - [ ] Cross-component dependencies are explicitly noted in affected task specs - [ ] `_dependencies_table.md` contains every task with correct dependencies diff --git a/.cursor/skills/decompose/templates/dependencies-table.md b/.cursor/skills/decompose/templates/dependencies-table.md index 3390fba..868bb76 100644 --- a/.cursor/skills/decompose/templates/dependencies-table.md +++ b/.cursor/skills/decompose/templates/dependencies-table.md @@ -28,4 +28,4 @@ Use this template after cross-task verification. Save as `TASKS_DIR/_dependencie - Dependencies column lists tracker IDs (e.g., "AZ-43, AZ-44") or "None" - No circular dependencies allowed - Tasks should be listed in recommended execution order -- The `/implement` skill reads this table to compute parallel batches +- The `/implement` skill reads this table to compute dependency-aware batches; task execution remains sequential diff --git a/.cursor/skills/decompose/templates/task.md b/.cursor/skills/decompose/templates/task.md index 7b90b71..01b980e 100644 --- a/.cursor/skills/decompose/templates/task.md +++ b/.cursor/skills/decompose/templates/task.md @@ -11,7 +11,7 @@ Save as `TASKS_DIR/[##]_[short_name].md` initially, then rename to `TASKS_DIR/[T **Task**: [TRACKER-ID]_[short_name] **Name**: [short human name] **Description**: [one-line description of what this task delivers] -**Complexity**: [1|2|3|5|8] points +**Complexity**: [1|2|3|5] points **Dependencies**: [AZ-43_shared_models, AZ-44_db_migrations] or "None" **Component**: [component name for context] **Tracker**: [TASK-ID] @@ -102,8 +102,7 @@ Consumers MUST read that file — not this task spec — to discover the interfa - 2 points: Non-trivial, low complexity, minimal coordination - 3 points: Multi-step, moderate complexity, potential alignment needed - 5 points: Difficult, interconnected logic, medium-high risk -- 8 points: High difficulty, high ambiguity or coordination, multiple components -- 13 points: Too complex — split into smaller tasks +- 8+ points: Too complex — split into smaller tasks ## Output Guidelines diff --git a/.cursor/skills/implement/SKILL.md b/.cursor/skills/implement/SKILL.md index 099f71f..b9a7ac9 100644 --- a/.cursor/skills/implement/SKILL.md +++ b/.cursor/skills/implement/SKILL.md @@ -25,6 +25,7 @@ For each task the main agent receives a task spec, analyzes the codebase, implem - **Dependency-aware ordering**: tasks run only when all their dependencies are satisfied - **Batching for review, not parallelism**: tasks are grouped into batches so `/code-review` and commits operate on a coherent unit of work — all tasks inside a batch are still implemented one after the other - **Integrated review**: `/code-review` skill runs automatically after each batch +- **Completeness before testing**: product implementation is not done until code is checked against task outcomes, included scope, architecture/component promises, and unresolved scaffold/native placeholders — not just task AC tests - **Auto-start**: batches start immediately — no user confirmation before a batch - **Gate on failure**: user confirmation is required only when code review returns FAIL - **Commit per batch**: after each batch is confirmed, commit. Ask the user whether to push to remote unless the user previously opted into auto-push for this session. @@ -32,9 +33,26 @@ For each task the main agent receives a task spec, analyzes the codebase, implem ## Context Resolution - TASKS_DIR: `_docs/02_tasks/` -- Task files: all `*.md` files in `TASKS_DIR/todo/` (excluding files starting with `_`) +- Task files: selected `*.md` files in `TASKS_DIR/todo/` (excluding files starting with `_`) - Dependency table: `TASKS_DIR/_dependencies_table.md` +### Task Selection Context + +The invoking flow decides which task category this run should execute. The implement skill must honor that selected context instead of consuming every file in `todo/`. + +| Context | Selected task files | +|---------|---------------------| +| Product implementation | Task specs that are not test-only and not refactoring specs | +| Test implementation | `*_test_infrastructure.md` plus task specs whose `Component` or `Epic` identifies `Blackbox Tests` | +| Refactoring | Task specs whose filename or task ID includes `_refactor_` | + +If no explicit context is provided, infer it from the active autodev step: +- greenfield Step 7 or existing-code Step 10 → Product implementation +- greenfield Step 10 or existing-code Step 6 → Test implementation +- refactor Phase 4 → Refactoring + +Unselected task files remain in `TASKS_DIR/todo/` for their later flow step. + ### Task Lifecycle Folders ``` @@ -47,7 +65,7 @@ TASKS_DIR/ ## Prerequisite Checks (BLOCKING) -1. `TASKS_DIR/todo/` exists and contains at least one task file — **STOP if missing** +1. `TASKS_DIR/todo/` exists and contains at least one task file for the selected context — **STOP if missing** 2. `_dependencies_table.md` exists — **STOP if missing** 3. At least one task is not yet completed — **STOP if all done** 4. **Working tree is clean** — run `git status --porcelain`; the output must be empty. @@ -62,9 +80,9 @@ TASKS_DIR/ ### 1. Parse -- Read all task `*.md` files from `TASKS_DIR/todo/` (excluding files starting with `_`) +- Read selected task `*.md` files from `TASKS_DIR/todo/` (excluding files starting with `_`) - Read `_dependencies_table.md` — parse into a dependency graph (DAG) -- Validate: no circular dependencies, all referenced dependencies exist +- Validate: no circular dependencies in the selected task graph, all referenced selected-task dependencies exist or are already completed in `TASKS_DIR/done/` ### 2. Detect Progress @@ -102,7 +120,7 @@ If `_docs/02_document/module-layout.md` is missing or the component is not found ### 5. Update Tracker Status → In Progress -For each task in the batch, transition its ticket status to **In Progress** via the configured work item tracker (see `protocols.md` for tracker detection) before starting work. If `tracker: local`, skip this step. +For each task in the batch, transition its ticket status to **In Progress** via the configured work item tracker (see `protocols.md` for tracker detection) before starting work. If `tracker: local`, skip this step. If a tracker operation fails unexpectedly, follow `.cursor/rules/tracker.mdc`. ### 6. Implement Tasks Sequentially @@ -188,12 +206,14 @@ Track `auto_fix_attempts` and `escalated_findings` in the batch report for retro ### 12. Update Tracker Status → In Testing -After the batch is committed and pushed, transition the ticket status of each task in the batch to **In Testing** via the configured work item tracker. If `tracker: local`, skip this step. +After the batch is committed (and pushed if the user approved pushing), transition the ticket status of each task in the batch to **In Testing** via the configured work item tracker. If `tracker: local`, skip this step. If a tracker operation fails unexpectedly, follow `.cursor/rules/tracker.mdc`. ### 13. Archive Completed Tasks Move each completed task file from `TASKS_DIR/todo/` to `TASKS_DIR/done/`. +For product implementation, this archive means "batch implementation accepted." The Product Implementation Completeness Gate can still require follow-up remediation tasks before the feature is complete; it does not move original task files back to `todo/`. + ### 14. Loop - Go back to step 2 until all tasks in `todo/` are done @@ -215,16 +235,70 @@ Move each completed task file from `TASKS_DIR/todo/` to `TASKS_DIR/done/`. - **Interaction with Auto-Fix Gate**: Architecture findings (new category from code-review Phase 7) always escalate per the implement auto-fix matrix; they cannot silently auto-fix - **Resumability**: if interrupted, the next invocation checks for the latest `cumulative_review_batches_*.md` and computes the changed-file set from batch reports produced after that review -### 15. Final Test Run +### 15. Product Implementation Completeness Gate -- After all batches are complete, run the full test suite once -- Read and execute `.cursor/skills/test-run/SKILL.md` (detect runner, run suite, diagnose failures, present blocking choices) -- Test failures are a **blocking gate** — do not proceed until the test-run skill completes with a user decision -- When tests pass, report final summary +Run this gate after all **product implementation** tasks are complete and before writing any final product implementation report or allowing autodev to proceed to testability/test decomposition. Skip this gate only when the remaining context is explicitly test implementation or refactoring, as determined by the task files and report filename rules. + +**Goal**: catch the failure mode where narrow tests validate scaffold behavior while the task's actual outcome, included scope, architecture promise, or named integration remains unimplemented. + +Inputs: + +- Completed product task specs from `_docs/02_tasks/done/` for the current cycle +- `_docs/02_document/architecture.md` +- `_docs/02_document/system-flows.md` +- Relevant `_docs/02_document/components/*/description.md` files +- Current source code under each completed task's ownership envelope +- Batch reports and code-review reports for the current cycle + +For each completed product task: + +1. Read these sections from the task spec: `Description`, `Outcome`, `Scope / Included`, `Acceptance Criteria`, `Non-Functional Requirements`, `Constraints`, and explicit named technologies or integrations. +2. Compare those promises against actual source code, not only tests or report prose. +3. Search the task's owned component files for unresolved implementation markers: `placeholder`, `stub`, `reserved`, `TODO`, `NotImplemented`, `pass`, `deterministic`, `fake`, `mock`, `scaffold`, `native bridge`, and empty native/readme-only integration directories. Ignore test fixtures/mocks only when they are under test-owned paths and not used as production behavior. +4. Verify that each named runtime dependency in the task promise is either integrated behind the approved boundary or explicitly documented as a blocked prerequisite in the task/report. Examples: if a task promises FAISS, DINOv2, BASALT, LightGlue, OpenCV, RANSAC, a database, cloud service, or hardware SDK, the production code must contain that integration boundary; a deterministic fallback alone is not complete. +5. Verify tests exercise the real implementation path where local prerequisites exist. Environment-gated tests may skip only with an explicit prerequisite reason; they do not make missing production code complete. +6. Classify each task: + - **PASS**: task promises are implemented or explicitly out of scope in the task itself. + - **BLOCKED**: production code exists but cannot be fully verified due to external hardware/data/license/runtime prerequisites; the blocker is explicit and tests report blocked/skipped with reason. + - **FAIL**: promised production behavior is missing, only scaffolded, or only represented in tests/reports. + +Save the audit to `_docs/03_implementation/implementation_completeness_cycle[N]_report.md` with: + +- Per-task classification +- Evidence files/symbols checked +- Any unresolved scaffold/native placeholders +- Any named promised technologies not integrated +- Required remediation task suggestions, each sized to 5 points or less + +Gate: + +- If every product task is `PASS` or `BLOCKED` with explicit prerequisite evidence, continue to Final Test Run. +- If any product task is `FAIL`, STOP. Do not write the final product implementation report and do not proceed to any downstream autodev step. Completed original task files remain in `done/`; the missing work is represented by remediation tasks. Present a Choose block: + - A) Create remediation tasks now and return to implementation + - B) Mark the missing behavior explicitly out of scope in task/docs, then re-run this gate + - C) Abort for manual correction +- Recommendation must normally be A unless the user deliberately accepts reduced scope. + +Remediation task creation: + +1. For each `FAIL`, create one or more task specs using `.cursor/skills/decompose/templates/task.md`; each remediation task must be sized at 5 points or less. +2. Save each task to `_docs/02_tasks/todo/` with a short name prefixed by `remediate_`. +3. Set **Component** to the failed task's component and set **Dependencies** to the failed task ID plus any remediation prerequisites. +4. Create or defer tracker tickets using the same tracker rules as decompose/new-task: if tracker is available, create tickets immediately; if the user explicitly chose `tracker: local`, keep numeric prefixes with `Tracker: pending` / `Epic: pending`. +5. Append the remediation tasks to `_docs/02_tasks/_dependencies_table.md`. +6. Return to Step 1 (Parse) in **Product implementation** context. The final product implementation report can be written only after remediation tasks complete and this gate reruns without `FAIL`. + +### 16. Final Test Run + +- After all batches are complete, run the full test suite once unless the invoking flow's immediate next step is `Run Tests`. +- If the next flow step is `Run Tests`, record a handoff in the final implementation report and let `.cursor/skills/test-run/SKILL.md` own the full-suite gate to avoid duplicate full runs. +- When this step does run, read and execute `.cursor/skills/test-run/SKILL.md` (detect runner, run suite, diagnose failures, present blocking choices). +- Test failures are a **blocking gate** — do not proceed until the test-run skill completes with a user decision. +- When tests pass, report final summary. ## Batch Report Persistence -After each batch completes, save the batch report to `_docs/03_implementation/batch_[NN]_cycle[N]_report.md` for feature implementation (or `batch_[NN]_report.md` for test/refactor runs). Create the directory if it doesn't exist. When all tasks are complete, produce a FINAL implementation report with a summary of all batches. The filename depends on context: +After each batch completes, save the batch report to `_docs/03_implementation/batch_[NN]_cycle[N]_report.md` for feature implementation (or `batch_[NN]_report.md` for test/refactor runs). Create the directory if it doesn't exist. For product implementation, produce the FINAL implementation report only after the Product Implementation Completeness Gate passes. For test and refactor implementation, produce the FINAL report after all selected tasks complete and the full-suite gate is either run or handed off per Step 16. The filename depends on context: - **Test implementation** (tasks from test decomposition): `_docs/03_implementation/implementation_report_tests.md` - **Feature implementation**: `_docs/03_implementation/implementation_report_{feature_slug}_cycle{N}.md` where `{feature_slug}` is derived from the batch task names (e.g., `implementation_report_core_api_cycle2.md`) and `{N}` is the current `state.cycle` from `_docs/_autodev_state.md`. If `state.cycle` is absent (pre-migration), default to `cycle1`. @@ -266,6 +340,7 @@ After each batch, produce a structured report: | Same task rewritten 3+ times without green tests | Mark Blocked, continue batch, escalate at batch end | | Task blocked on external dependency (not in task list) | Report and skip | | File ownership violated (task wrote outside OWNED) | ASK user | +| Product completeness gate finds missing promised implementation | STOP — create remediation tasks or get explicit user scope reduction | | Test failure after final test run | Delegate to test-run skill — blocking gate | | All tasks complete | Report final summary, suggest final commit | | `_dependencies_table.md` missing | STOP — run `/decompose` first | @@ -283,4 +358,5 @@ Each batch commit serves as a rollback checkpoint. If recovery is needed: - Never start a task whose dependencies are not yet completed - Never run tasks in parallel and never spawn subagents — see `.cursor/rules/no-subagents.mdc` - If a task is flagged as stuck, stop working on it and report — do not let it loop indefinitely -- Always run the full test suite after all batches complete (step 15) +- Always run the Product Implementation Completeness Gate before final product reports +- Always run or hand off the full test suite after all batches complete (step 16) diff --git a/.cursor/skills/new-task/SKILL.md b/.cursor/skills/new-task/SKILL.md index c63e9cf..b630d58 100644 --- a/.cursor/skills/new-task/SKILL.md +++ b/.cursor/skills/new-task/SKILL.md @@ -282,7 +282,7 @@ Present using the Choose format for each decision that has meaningful alternativ - Update **Epic** field: `[EPIC-ID]` 3. Rename the file from `[##]_[short_name].md` to `[TICKET-ID]_[short_name].md` -If the work item tracker is not authenticated or unavailable (`tracker: local`): +If the work item tracker is not authenticated or unavailable, follow `.cursor/rules/tracker.mdc` before continuing. Only if the user explicitly chooses `tracker: local`: - Keep the numeric prefix - Set **Tracker** to `pending` - Set **Epic** to `pending` @@ -337,7 +337,7 @@ After the user chooses **Done**: | Research skill hits a blocker | Follow research skill's own escalation rules | | Codebase analysis reveals conflicting architectures | **ASK** user which pattern to follow | | Complexity exceeds 5 points | **WARN** user and suggest splitting into multiple tasks | -| Work item tracker MCP unavailable | **WARN**, continue with local-only task files | +| Work item tracker MCP unavailable | Follow `.cursor/rules/tracker.mdc`; do not continue in local mode unless the user explicitly chooses it | ## Trigger Conditions diff --git a/.cursor/skills/plan/steps/06_work-item-epics.md b/.cursor/skills/plan/steps/06_work-item-epics.md index d131738..fef82fb 100644 --- a/.cursor/skills/plan/steps/06_work-item-epics.md +++ b/.cursor/skills/plan/steps/06_work-item-epics.md @@ -58,4 +58,4 @@ Do NOT create minimal epics with just a summary and short description. The epic 8. **Create "Blackbox Tests" epic** — this epic will parent the blackbox test tasks created by the `/decompose` skill. It covers implementing the test scenarios defined in `tests/`. -**Save action**: Epics created via the configured tracker MCP. Also saved locally in `epics.md` with ticket IDs. If `tracker: local`, save locally only. +**Save action**: Epics created via the configured tracker MCP. Also saved locally in `epics.md` with ticket IDs. If tracker availability fails, follow `.cursor/rules/tracker.mdc`; only if the user explicitly chooses `tracker: local`, save locally only with pending tracker markers. diff --git a/.cursor/skills/plan/templates/epic-spec.md b/.cursor/skills/plan/templates/epic-spec.md index 3d51622..6f653a2 100644 --- a/.cursor/skills/plan/templates/epic-spec.md +++ b/.cursor/skills/plan/templates/epic-spec.md @@ -133,4 +133,4 @@ Link to architecture.md and relevant component spec.] - `component` — a normal per-component epic - `cross-cutting` — a shared concern that spans ≥2 components - `tests` — the blackbox-tests epic (always exactly one) -- Complexity points for child issues follow the project standard: 1, 2, 3, 5, 8. Do not create issues above 5 points — split them. +- Complexity points for child issues follow the project standard: 1, 2, 3, 5. Do not create issues above 5 points — split them. diff --git a/.cursor/skills/refactor/SKILL.md b/.cursor/skills/refactor/SKILL.md index def4d75..ac60fb8 100644 --- a/.cursor/skills/refactor/SKILL.md +++ b/.cursor/skills/refactor/SKILL.md @@ -59,7 +59,7 @@ Create REFACTOR_DIR and RUN_DIR if missing. If a RUN_DIR with the same name alre Both modes produce `RUN_DIR/list-of-changes.md` (template: `templates/list-of-changes.md`). Both modes then convert that file into task files in TASKS_DIR during Phase 2. -**Guided mode cleanup**: after `RUN_DIR/list-of-changes.md` is created from the input file, delete the original input file to avoid duplication. +**Guided mode cleanup**: after `RUN_DIR/list-of-changes.md` is created from the input file, delete the original input file only if it lives outside `RUN_DIR`. If the provided file is already the canonical `RUN_DIR/list-of-changes.md`, keep it as the audit record. ## Workflow @@ -81,10 +81,10 @@ Both modes produce `RUN_DIR/list-of-changes.md` (template: `templates/list-of-ch - "refactor [specific target]" → skip phase 1 if docs exist - Default → all phases -**Testability-run specifics** (guided mode invoked by autodev existing-code flow Step 4): +**Testability-run specifics** (guided mode invoked by autodev existing-code Step 4 or greenfield Step 8): - Run name is `01-testability-refactoring`. - Phase 3 (Safety Net) is skipped by design — no tests exist yet. Compensating control: the `list-of-changes.md` gate in Phase 1 must be reviewed and approved by the user before Phase 4 runs. -- Scope is MINIMAL and surgical; reject change entries that drift into full refactor territory (see existing-code flow Step 4 for allowed/disallowed lists). Flagged entries go to `RUN_DIR/deferred_to_refactor.md` for Step 8 (optional full refactor) consideration. +- Scope is MINIMAL and surgical; reject change entries that drift into full refactor territory (see the invoking flow's testability step for allowed/disallowed lists). Flagged entries go to `RUN_DIR/deferred_to_refactor.md` for the next optional full-refactor step or backlog consideration. - After Phase 4 (Execution) completes, write `RUN_DIR/testability_changes_summary.md` as Phase 4.5. Format: one bullet per applied change. ```markdown # Testability Changes Summary ({{run_name}}) diff --git a/.cursor/skills/refactor/phases/02-analysis.md b/.cursor/skills/refactor/phases/02-analysis.md index 8f5df2b..d8fd88a 100644 --- a/.cursor/skills/refactor/phases/02-analysis.md +++ b/.cursor/skills/refactor/phases/02-analysis.md @@ -74,7 +74,7 @@ Create a work item tracker epic for this refactoring run: 1. Epic name: the RUN_DIR name (e.g., `01-testability-refactoring`) 2. Create the epic via configured tracker MCP 3. Record the Epic ID — all tasks in 2d will be linked under this epic -4. If tracker unavailable, use `PENDING` placeholder and note for later +4. If tracker is unavailable, follow `.cursor/rules/tracker.mdc`; only use `PENDING` placeholders if the user explicitly chooses `tracker: local` ## 2d. Task Decomposition diff --git a/.cursor/skills/refactor/phases/04-execution.md b/.cursor/skills/refactor/phases/04-execution.md index e165275..c0f8393 100644 --- a/.cursor/skills/refactor/phases/04-execution.md +++ b/.cursor/skills/refactor/phases/04-execution.md @@ -10,7 +10,7 @@ - All `[TRACKER-ID]_refactor_*.md` files are present - Each task file has valid header fields (Task, Name, Description, Complexity, Dependencies) 2. Verify `TASKS_DIR/_dependencies_table.md` includes the refactoring tasks -3. Verify all tests pass (safety net from Phase 3 is green) +3. Verify all tests pass (safety net from Phase 3 is green), unless this is a testability run where Phase 3 was intentionally skipped 4. If any check fails, go back to the relevant phase to fix ## 4b. Delegate to Implement Skill @@ -23,7 +23,7 @@ The implement skill will: 3. Compute execution batches for the refactoring tasks 4. Implement tasks sequentially in topological order (no subagents, no parallelism) 5. Run code review after each batch -6. Commit and push per batch +6. Commit per batch and push only when the user approved pushing 7. Update work item ticket status Do NOT modify, skip, or abbreviate any part of the implement skill's workflow. The refactor skill is delegating execution, not optimizing it. @@ -47,7 +47,7 @@ After the implement skill completes: For each successfully completed refactoring task: 1. Transition the work item ticket status to **Done** via the configured tracker MCP -2. If tracker unavailable, note the pending status transitions in `RUN_DIR/execution_log.md` +2. If tracker is unavailable, follow `.cursor/rules/tracker.mdc`; if the user explicitly chose `tracker: local`, note the pending status transitions in `RUN_DIR/execution_log.md` For any failed or blocked tasks, leave their status as-is (the implement skill already set them to In Testing or blocked). diff --git a/.cursor/skills/test-run/SKILL.md b/.cursor/skills/test-run/SKILL.md index e64734e..9651e17 100644 --- a/.cursor/skills/test-run/SKILL.md +++ b/.cursor/skills/test-run/SKILL.md @@ -22,7 +22,7 @@ test-run has two modes. The caller passes the mode explicitly; if missing, defau | Mode | Scope | Typical caller | Input artifacts | |------|-------|---------------|-----------------| | `functional` (default) | Unit / integration / blackbox tests — correctness | autodev Steps that verify after Implement Tests or Implement | `scripts/run-tests.sh`, `_docs/02_document/tests/environment.md`, `_docs/02_document/tests/blackbox-tests.md` | -| `perf` | Performance / load / stress / soak tests — latency, throughput, error-rate thresholds | autodev greenfield Step 9, existing-code Step 15 (pre-deploy) | `scripts/run-performance-tests.sh`, `_docs/02_document/tests/performance-tests.md`, AC thresholds in `_docs/00_problem/acceptance_criteria.md` | +| `perf` | Performance / load / stress / soak tests — latency, throughput, error-rate thresholds | autodev greenfield Step 15, existing-code Step 15 (pre-deploy) | `scripts/run-performance-tests.sh`, `_docs/02_document/tests/performance-tests.md`, AC thresholds in `_docs/00_problem/acceptance_criteria.md` | Direct user invocation (`/test-run`) defaults to `functional`. If the user says "perf tests", "load test", "performance", or passes a performance scenarios file, run `perf` mode. diff --git a/_docs/02_tasks/_dependencies_table.md b/_docs/02_tasks/_dependencies_table.md index d3d6322..90c7d4d 100644 --- a/_docs/02_tasks/_dependencies_table.md +++ b/_docs/02_tasks/_dependencies_table.md @@ -1,8 +1,8 @@ # Dependencies Table -**Date**: 2026-05-03 -**Total Tasks**: 14 -**Total Complexity Points**: 60 +**Date**: 2026-05-04 +**Total Tasks**: 24 +**Total Complexity Points**: 108 **Lessons applied**: No `_docs/LESSONS.md` file exists; no prior estimation or dependency lessons were available. | Task | Name | Complexity | Dependencies | Epic | @@ -21,9 +21,29 @@ | AZ-230 | satellite_service_vpr_retrieval | 5 | AZ-223, AZ-225, AZ-229 | AZ-214 | | AZ-231 | anchor_verification_matching | 5 | AZ-223, AZ-225, AZ-230 | AZ-215 | | AZ-232 | safety_anchor_state_machine | 5 | AZ-223, AZ-224, AZ-227, AZ-228, AZ-231 | AZ-216 | +| AZ-240 | native_vio_backend_integration | 5 | AZ-228 | AZ-213 | +| AZ-241 | real_satellite_vpr_descriptor_retrieval | 5 | AZ-230 | AZ-214 | +| AZ-242 | real_anchor_feature_matching_ransac | 5 | AZ-231, AZ-241 | AZ-215 | +| AZ-233 | test_infrastructure | 5 | AZ-240, AZ-241, AZ-242 | AZ-218 | +| AZ-234 | replay_geolocation_confidence_tests | 3 | AZ-233 | AZ-218 | +| AZ-235 | vio_replay_performance_tests | 5 | AZ-233, AZ-240 | AZ-218 | +| AZ-236 | satellite_anchor_cache_tests | 5 | AZ-233, AZ-241, AZ-242 | AZ-218 | +| AZ-237 | mavlink_blackout_spoofing_tests | 5 | AZ-233 | AZ-218 | +| AZ-238 | cold_start_restart_tests | 5 | AZ-233 | AZ-218 | +| AZ-239 | jetson_resource_endurance_tests | 5 | AZ-233 | AZ-218 | ## Verification Notes - No task exceeds 5 complexity points. -- E2E/blackbox test work remains outside this product implementation task set and is deferred to the greenfield Decompose Tests phase. -- The graph is acyclic: foundations precede adapters/stores, then VIO/retrieval/matching, then safety wrapper orchestration. +- Test implementation tasks are appended under Blackbox Tests (AZ-218); the test infrastructure bootstrap now depends on the product remediation tasks so tests do not validate scaffold behavior. +- The graph is acyclic: product foundations precede adapters/stores, then VIO/retrieval/matching, then safety wrapper orchestration; remediation tasks close native VIO, real VPR, and real matching gaps before affected blackbox tests run. + +## Test Coverage Verification + +- AZ-234 covers FT-P-01, FT-P-02, and NFT-PERF-01. +- AZ-235 covers FT-P-03 and NFT-PERF-02 after AZ-240 provides the real native VIO path. +- AZ-236 covers FT-P-04, FT-N-01, FT-N-03, NFT-PERF-03, NFT-RES-04, NFT-SEC-01, NFT-SEC-02, NFT-SEC-04, and NFT-RES-LIM-03 after AZ-241 and AZ-242 provide real VPR retrieval and anchor matching. +- AZ-237 covers FT-N-02, NFT-RES-01, and NFT-SEC-03. +- AZ-238 covers NFT-RES-02, NFT-RES-03, NFT-PERF-04, and NFT-RES-LIM-05. +- AZ-239 covers NFT-RES-LIM-01, NFT-RES-LIM-02, and NFT-RES-LIM-04. +- All traceability-matrix AC and restriction groups remain covered by at least one test task. diff --git a/_docs/02_tasks/todo/AZ-233_test_infrastructure.md b/_docs/02_tasks/todo/AZ-233_test_infrastructure.md new file mode 100644 index 0000000..26ccd4e --- /dev/null +++ b/_docs/02_tasks/todo/AZ-233_test_infrastructure.md @@ -0,0 +1,163 @@ +# Test Infrastructure + +**Task**: AZ-233_test_infrastructure +**Name**: Test Infrastructure +**Description**: Scaffold the blackbox and e2e test project: runner, deterministic fixtures, isolated replay/SITL environment, reporting, and external dependency stubs. +**Complexity**: 5 points +**Dependencies**: AZ-240_native_vio_backend_integration, AZ-241_real_satellite_vpr_descriptor_retrieval, AZ-242_real_anchor_feature_matching_ransac +**Component**: Blackbox Tests +**Tracker**: AZ-233 +**Epic**: AZ-218 + +## Test Project Folder Layout + +```text +e2e/ +├── replay/ +│ ├── run_replay.py +│ ├── scenarios/ +│ └── reports/ +├── fixtures/ +│ ├── cache/ +│ ├── mavlink/ +│ ├── telemetry/ +│ └── expected/ +├── tests/ +│ ├── test_still_image_replay.py +│ ├── test_vio_replay.py +│ ├── test_satellite_anchor.py +│ ├── test_blackout_spoofing.py +│ ├── test_resource_limits.py +│ └── test_security_gates.py +├── mocks/ +│ ├── satellite_cache_stub/ +│ ├── ardupilot_sitl/ +│ └── qgc_observer/ +└── reports/ +``` + +### Layout Rationale + +The test project keeps blackbox/e2e runner code outside product runtime internals. Scenario definitions, fixtures, mocks, and reports are separated so tests can reset state between runs and produce release evidence without importing private component modules. + +Test implementation starts only after remediation tasks AZ-240, AZ-241, and AZ-242 close the native VIO, real satellite VPR, and real anchor matching gaps found during autodev verification. + +## Mock Services + +| Mock Service | Replaces | Interfaces | Behavior | +|-------------|----------|------------|----------| +| `satellite_cache_stub` | Offline Azaion Suite Satellite Service cache package | Local COG/manifest/descriptor fixture volume | Serves preloaded valid, stale, unsigned, hash-mismatched, and low-resolution cache fixtures; never performs network fetches during flight-mode tests. | +| `ardupilot_sitl` | ArduPilot Plane flight controller | MAVLink telemetry and `GPS_INPUT` receiving path | Emits generated IMU, attitude, GPS health, spoofing, and failsafe traces; records injected `GPS_INPUT` for assertions. | +| `qgc_observer` | QGroundControl status consumer | MAVLink/tlog parser | Records downsampled `STATUSTEXT`, status, and failsafe messages for rate and content assertions. | + +### Mock Control API + +Each mock or runner fixture must expose deterministic scenario controls for normal replay, stale cache, missing cache, spoofed GPS, blackout, restart, and resource-load modes. Recorded interactions must be queryable after each test run for assertions. + +## Docker Test Environment + +### `docker-compose.test.yml` Structure + +| Service | Image / Build | Purpose | Depends On | +|---------|---------------|---------|------------| +| `gps-denied-service` | Project runtime image or local package mount | System under test | `satellite-cache-stub` | +| `replay-consumer` | Python replay/test harness | Feeds frames, telemetry, cache data, and faults | `gps-denied-service`, mock services | +| `satellite-cache-stub` | Fixture volume/service | Provides offline cache manifests, sidecars, descriptors, and generated invalid variants | none | +| `ardupilot-plane-sitl` | SITL container or local process wrapper | Validates `GPS_INPUT`, spoofing, and failsafe behavior | `gps-denied-service` | +| `qgc-observer` | MAVLink log parser | Verifies GCS-visible status output | `ardupilot-plane-sitl` | + +### Networks and Volumes + +- `replay-net`: connects the runtime, replay consumer, and satellite-cache stub. +- `sitl-net`: connects the runtime, ArduPilot Plane SITL, and QGC observer. +- `input-data`: read-only mount for `_docs/00_problem/input_data/`. +- `expected-results`: read-only mount for expected coordinate and report fixtures. +- `derkachi-replay`: read-only mount for `flight_derkachi.mp4` and `data_imu.csv`. +- `satellite-cache`: fixture cache volume with valid and invalid manifests. +- `fdr-output`: fresh per-run output volume for FDR and report artifacts. + +## Test Runner Configuration + +**Framework**: Python pytest-style replay harness. +**Entry point**: `run-blackbox-replay` or equivalent pytest command that executes scenario groups and writes reports. +**Reports**: CSV summary plus FDR validation Markdown. + +### Fixture Strategy + +| Fixture | Scope | Purpose | +|---------|-------|---------| +| `project_60_still_images` | session | Provides 60 nadir images and expected WGS84 centers. | +| `derkachi_video_telemetry` | session | Provides synchronized video, IMU, and `GLOBAL_POSITION_INT` replay data. | +| `cache_integrity_fixtures` | function | Provides valid, stale, unsigned, hash-mismatched, and low-resolution cache variants. | +| `sitl_spoofing_scenarios` | function | Provides generated GPS loss/spoofing and blackout traces. | +| `public_nadir_vio_candidates` | optional/session | Provides public or representative synchronized datasets when available. | + +## Test Data Fixtures + +| Data Set | Source | Format | Used By | +|----------|--------|--------|---------| +| `project_60_still_images` | `_docs/00_problem/input_data/` | JPG + metadata | Still-image accuracy, confidence, latency smoke | +| `expected_frame_centers` | `_docs/00_problem/input_data/coordinates.csv` and expected-results report | CSV/Markdown | Geolocation assertions | +| `derkachi_video_telemetry` | `_docs/00_problem/input_data/flight_derkachi/` | MP4 + CSV | VIO replay, latency, resilience | +| `cache_integrity_fixtures` | generated fixture volume | COG/manifest/sidecar/index fixtures | Cache freshness, poisoning, no-fetch tests | +| `sitl_spoofing_scenarios` | generated by SITL harness | MAVLink/tlog traces | Spoofing, blackout, failsafe, GCS status | +| `public_nadir_vio_candidates` | pinned external fixtures | dataset-specific | Final VIO and satellite-anchor validation | + +### Data Isolation + +Every run uses read-only input fixtures and fresh run-scoped output directories. FDR, generated tiles, tlogs, and reports are written only to per-run output volumes. Mock state and generated fixtures are reset before each scenario group. + +## Test Reporting + +**Format**: CSV summary and Markdown evidence report. +**Output paths**: `test-results/blackbox-report.csv` and `test-results/fdr-validation-summary.md`. +**Required columns**: Test ID, test name, input dataset, execution time, result, error distance, source label, covariance 95% semi-major, `GPS_INPUT.fix_type`, and error message. + +## Acceptance Criteria + +**AC-1: Test environment starts** +Given the Docker/replay test environment +When the test stack starts +Then the runtime, replay consumer, cache fixture, SITL, and observer services are reachable or report a clear blocked prerequisite. + +**AC-2: External dependency stubs are deterministic** +Given a scenario config for cache, MAVLink, QGC, or fixture behavior +When the replay consumer executes it +Then mocks produce repeatable responses and expose recorded interactions for assertions. + +**AC-3: Test runner executes scenario groups** +Given valid fixtures and a running test environment +When the test runner starts +Then it discovers and executes blackbox, performance, resilience, security, and resource-limit scenario groups. + +**AC-4: Reports are generated** +Given a completed or blocked test run +When reporting finishes +Then CSV and Markdown evidence files are written with the required columns, metrics, artifact paths, and blocked-prerequisite reasons. + +## Non-Functional Requirements + +**Reliability** +- Missing hardware, public datasets, calibration, or SITL prerequisites are reported as `blocked`, not `passed`. + +**Security** +- Fixture stubs must not access external satellite-provider or Suite service networks during in-flight test scenarios. + +**Data Isolation** +- No test may mutate source fixtures or write FDR/generated-tile artifacts outside run-scoped output paths. + +## Constraints + +- The test suite must use public runtime boundaries only: navigation frames, telemetry, offline cache, MAVLink output, QGC status, and FDR outputs. +- The suite must not import private estimator, BASALT, wrapper, or tile-manager internals. +- Hardware-specific Jetson gates remain release-gate tests and may be skipped or blocked in ordinary local replay. + +## Risks & Mitigation + +**Risk 1: Environment prerequisites hide real failures** +- *Risk*: Missing hardware, calibration, or datasets could be treated as success. +- *Mitigation*: Report unavailable prerequisites as `blocked` with explicit artifact evidence. + +**Risk 2: Fixture mutation contaminates later runs** +- *Risk*: Generated FDR, cache, or SITL output changes expected input fixtures. +- *Mitigation*: Use read-only fixture mounts and fresh run-scoped output volumes for every execution. diff --git a/_docs/02_tasks/todo/AZ-234_replay_geolocation_confidence_tests.md b/_docs/02_tasks/todo/AZ-234_replay_geolocation_confidence_tests.md new file mode 100644 index 0000000..3a9ee8b --- /dev/null +++ b/_docs/02_tasks/todo/AZ-234_replay_geolocation_confidence_tests.md @@ -0,0 +1,88 @@ +# Replay Geolocation And Confidence Tests + +**Task**: AZ-234_replay_geolocation_confidence_tests +**Name**: Replay Geolocation And Confidence Tests +**Description**: Implement blackbox tests for still-image geolocation, confidence/source-label output, and replay latency smoke. +**Complexity**: 3 points +**Dependencies**: AZ-233_test_infrastructure +**Component**: Blackbox Tests +**Tracker**: AZ-234 +**Epic**: AZ-218 + +## Problem + +The project needs deterministic blackbox evidence that the 60-image replay path emits WGS84 frame-center estimates with required confidence fields and latency metrics. + +## Outcome + +- Still-image replay reports per-frame coordinate error and aggregate threshold results. +- Every emitted estimate includes covariance, source label, and anchor-age fields. +- Replay smoke latency and dropped-frame metrics are captured in the shared report format. + +## Scope + +### Included + +- FT-P-01 Still-Image Frame Center Geolocation. +- FT-P-02 Position Confidence Output Contract. +- NFT-PERF-01 Per-Frame Latency On Project Still Images. +- CSV and Markdown evidence output for these scenarios. + +### Excluded + +- Synchronized VIO video/IMU replay. +- Satellite-anchor VPR/local matching. +- Jetson-only release-gate profiling. + +## Acceptance Criteria + +**AC-1: Still-image coordinates are validated** +Given the 60-image project fixture and expected frame-center coordinates +When the replay test runs +Then per-frame WGS84 error is reported and aggregate 50 m / 20 m thresholds are evaluated. + +**AC-2: Confidence output contract is validated** +Given emitted position estimates from the replay +When the test inspects public output fields +Then each estimate includes WGS84 coordinates, 95% covariance semi-major axis, source label, and anchor age. + +**AC-3: Replay latency is measured** +Given the still-image replay runs at the configured smoke rate +When processing completes +Then capture-to-output latency and dropped-frame rate are recorded with pass/fail or blocked status. + +## Non-Functional Requirements + +**Performance** +- Replay smoke evidence includes p50/p95/p99 latency and dropped-frame rate. + +**Reliability** +- Missing or invalid expected-coordinate fixtures fail fixture validation before scenario execution. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Expected-coordinate loader validation | Invalid coordinates are rejected before replay | +| AC-2 | Report field validation | Missing confidence/source fields fail the scenario | +| AC-3 | Latency metric aggregation | p50/p95/p99 and dropped-frame metrics are emitted | + +## Blackbox Tests + +| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References | +|--------|-------------------------|--------------|-------------------|----------------| +| AC-1 | `project_60_still_images`, `expected_frame_centers` | FT-P-01 | >=80% within 50 m and >=50% within 20 m or explicit failure | Reliability | +| AC-2 | Same replay output | FT-P-02 | 100% of emitted estimates include required confidence fields | Reliability | +| AC-3 | Replay smoke run | NFT-PERF-01 | Latency and drop-rate metrics are recorded | Performance | + +## Constraints + +- Tests must use public replay input and output artifacts only. +- Input fixtures must be mounted read-only. +- Blocked prerequisites must be reported as `blocked`, not `passed`. + +## Risks & Mitigation + +**Risk 1: Calibration limits are mistaken for product failure** +- *Risk*: Fixture limits can make absolute accuracy inconclusive. +- *Mitigation*: Report the fixture source and threshold basis with each failure. diff --git a/_docs/02_tasks/todo/AZ-235_vio_replay_performance_tests.md b/_docs/02_tasks/todo/AZ-235_vio_replay_performance_tests.md new file mode 100644 index 0000000..925ba33 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-235_vio_replay_performance_tests.md @@ -0,0 +1,89 @@ +# VIO Replay Performance Tests + +**Task**: AZ-235_vio_replay_performance_tests +**Name**: VIO Replay Performance Tests +**Description**: Implement synchronized video/IMU replay tests for VIO output, covariance evidence, and replay performance metrics. +**Complexity**: 5 points +**Dependencies**: AZ-233_test_infrastructure, AZ-240_native_vio_backend_integration +**Component**: Blackbox Tests +**Tracker**: AZ-235 +**Epic**: AZ-218 + +## Problem + +The runtime needs blackbox evidence that synchronized navigation video and flight-controller telemetry can drive VIO/wrapper output with honest confidence and measurable performance. + +This test task must run after AZ-240 so it validates the real native VIO path rather than the deterministic scaffold. + +## Outcome + +- Derkachi video/telemetry fixture alignment is validated before replay. +- Synchronized replay produces frame-by-frame output or a clear blocked/failure reason. +- Latency, completion rate, memory, trajectory comparison, and calibration-gated checks are reported. + +## Scope + +### Included + +- FT-P-03 BASALT VIO Replay With Synchronized Video/Telemetry. +- NFT-PERF-02 BASALT + Wrapper Replay Latency. +- Public/representative dataset prerequisite reporting. + +### Excluded + +- Satellite-anchor local verification. +- SITL spoofing/failsafe scenarios. +- Thermal/endurance release gates. + +## Acceptance Criteria + +**AC-1: Replay fixture alignment is validated** +Given the Derkachi MP4 and telemetry CSV +When fixture validation runs +Then duration, frame-to-telemetry ratio, and timestamp monotonicity are verified before replay. + +**AC-2: Synchronized replay emits estimates** +Given a valid synchronized video/IMU replay fixture +When replay executes +Then estimates are emitted frame-by-frame with source labels, covariance, and segment evidence. + +**AC-3: VIO performance evidence is reported** +Given replay completed or blocked +When reporting finishes +Then latency, completion rate, memory, and calibration/public-dataset prerequisite status are written. + +## Non-Functional Requirements + +**Performance** +- Reports include per-frame latency and memory metrics where the environment can measure them. + +**Reliability** +- Calibration-gated absolute accuracy checks must be marked explicitly instead of silently passing. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Video/telemetry validator | Invalid duration or timestamp alignment blocks replay | +| AC-2 | Replay result parser | Missing per-frame confidence fields fail the scenario | +| AC-3 | Calibration gate reporting | Missing calibration/public data is reported as blocked | + +## Blackbox Tests + +| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References | +|--------|-------------------------|--------------|-------------------|----------------| +| AC-1 | `derkachi_video_telemetry` | FT-P-03 fixture validation | Fixture accepted only when alignment rules pass | Reliability | +| AC-2 | Valid synchronized replay | FT-P-03 output | Continuous estimates for normal overlapping segments or explicit degradation | Reliability | +| AC-3 | Replay performance run | NFT-PERF-02 | Latency, completion rate, and memory evidence are recorded | Performance | + +## Constraints + +- Tests must not import BASALT/OpenVINS/Kimera internals directly. +- Public/representative datasets are optional prerequisites and may produce blocked results. +- Raw input video and telemetry fixtures remain read-only. + +## Risks & Mitigation + +**Risk 1: Hardware or dataset prerequisites are unavailable** +- *Risk*: The scenario cannot produce final accuracy evidence locally. +- *Mitigation*: Emit blocked results with exact missing prerequisite and continue other scenario groups. diff --git a/_docs/02_tasks/todo/AZ-236_satellite_anchor_cache_tests.md b/_docs/02_tasks/todo/AZ-236_satellite_anchor_cache_tests.md new file mode 100644 index 0000000..dd7e974 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-236_satellite_anchor_cache_tests.md @@ -0,0 +1,102 @@ +# Satellite Anchor Cache Tests + +**Task**: AZ-236_satellite_anchor_cache_tests +**Name**: Satellite Anchor Cache Tests +**Description**: Implement blackbox, security, and performance tests for satellite-anchor retrieval, local verification, cache integrity, and no in-flight external access. +**Complexity**: 5 points +**Dependencies**: AZ-233_test_infrastructure, AZ-241_real_satellite_vpr_descriptor_retrieval, AZ-242_real_anchor_feature_matching_ransac +**Component**: Blackbox Tests +**Tracker**: AZ-236 +**Epic**: AZ-218 + +## Problem + +Satellite anchors and cache fixtures are safety-critical: invalid, stale, poisoned, or externally fetched data must not become trusted localization output. + +This test task must run after AZ-241 and AZ-242 so it validates real local VPR retrieval and real anchor feature matching rather than scaffold evidence gates. + +## Outcome + +- Accepted anchors include retrieval, matching, geometry, freshness, and provenance evidence. +- Invalid/stale/poisoned cache fixtures cannot produce trusted anchors or trusted generated tiles. +- No in-flight Satellite Service or provider access occurs when cache data is missing. + +## Scope + +### Included + +- FT-P-04 Satellite Service And Anchor Verification. +- FT-N-01 Repetitive Or Low-Texture Imagery. +- FT-N-03 Invalid Or Stale Satellite Cache. +- NFT-PERF-03 Relocalization Trigger Path Latency. +- NFT-RES-04 Tile Cache Freshness Degradation. +- NFT-SEC-01 Signed Cache Manifest Enforcement. +- NFT-SEC-02 Cache Poisoning Write Gate. +- NFT-SEC-04 No In-Flight Satellite Provider Access. +- NFT-RES-LIM-03 Satellite Cache Storage Budget. + +### Excluded + +- VIO synchronized replay. +- MAVLink spoofing/failsafe behavior. +- Jetson thermal endurance. + +## Acceptance Criteria + +**AC-1: Verified anchors include evidence** +Given a valid local cache/index fixture and relocalization trigger +When retrieval and verification run +Then accepted anchors include candidate IDs, scores, MRE, inliers, covariance, and tile provenance. + +**AC-2: Unsafe candidates are rejected** +Given low-texture, stale, unsigned, hash-mismatched, or low-resolution fixtures +When anchor/cache tests run +Then no invalid candidate emits a trusted `satellite_anchored` estimate or trusted generated tile. + +**AC-3: No in-flight external access occurs** +Given flight-mode replay with missing cache data +When relocalization is requested +Then the system reports degraded/no-candidate behavior without satellite-provider or Suite service network calls. + +**AC-4: Cache and trigger-path metrics are reported** +Given cache and relocalization scenarios complete +When reporting finishes +Then latency, MRE, trust level, freshness, and storage-budget evidence are written. + +## Non-Functional Requirements + +**Security** +- Invalid cache data must not be trusted or promoted. + +**Performance** +- Trigger-path latency and bounded top-K behavior are measured. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Anchor evidence parser | Required evidence fields are present | +| AC-2 | Invalid cache fixture generator | Stale/unsigned/hash-mismatched fixtures are produced deterministically | +| AC-3 | Network-block assertion | Unexpected external calls fail the scenario | +| AC-4 | Cache metrics report | Latency, freshness, and storage metrics are present | + +## Blackbox Tests + +| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References | +|--------|-------------------------|--------------|-------------------|----------------| +| AC-1 | Public/cache fixture | FT-P-04 | Accepted anchors meet MRE/evidence requirements | Performance | +| AC-2 | Ambiguous and invalid cache fixtures | FT-N-01, FT-N-03, NFT-SEC-01, NFT-SEC-02 | 0 unsafe trusted outputs | Security | +| AC-3 | Network-blocked flight-mode replay | NFT-SEC-04 | Missing cache causes degraded behavior, not fetch | Security | +| AC-4 | Relocalization/cache runs | NFT-PERF-03, NFT-RES-04, NFT-RES-LIM-03 | Metrics and storage evidence are recorded | Performance | + +## Constraints + +- Tests must use local preloaded cache/index fixtures only. +- External network access during flight-mode scenarios is a failure. +- VPAir and UZH FPV licensing must be respected before use as commercial acceptance evidence. + +## Risks & Mitigation + +**Risk 1: Dataset licensing blocks final anchor evidence** +- *Risk*: Public dataset terms prevent commercial acceptance use. +- *Mitigation*: Mark dataset-specific checks blocked and keep generated cache fixtures for deterministic security coverage. diff --git a/_docs/02_tasks/todo/AZ-237_mavlink_blackout_spoofing_tests.md b/_docs/02_tasks/todo/AZ-237_mavlink_blackout_spoofing_tests.md new file mode 100644 index 0000000..ef2b29f --- /dev/null +++ b/_docs/02_tasks/todo/AZ-237_mavlink_blackout_spoofing_tests.md @@ -0,0 +1,94 @@ +# MAVLink Blackout Spoofing Tests + +**Task**: AZ-237_mavlink_blackout_spoofing_tests +**Name**: MAVLink Blackout Spoofing Tests +**Description**: Implement SITL/replay tests for visual blackout, spoofed GPS, MAVLink source validation, degraded covariance, no-fix thresholds, and QGC status. +**Complexity**: 5 points +**Dependencies**: AZ-233_test_infrastructure +**Component**: Blackbox Tests +**Tracker**: AZ-237 +**Epic**: AZ-218 + +## Problem + +The system must prove that spoofed GPS and unauthorized MAVLink messages cannot override estimator state during visual blackout or degraded operation. + +## Outcome + +- Blackout and spoofing traces drive visible degraded-mode transitions. +- Covariance, `GPS_INPUT`, QGC status, and FDR evidence match the safety thresholds. +- Unauthorized MAVLink sources are rejected and recorded. + +## Scope + +### Included + +- FT-N-02 GPS Spoofing During Total Visual Blackout. +- NFT-RES-01 Total Visual Blackout With GPS Spoofing. +- NFT-SEC-03 MAVLink Source And Spoofing Rejection. + +### Excluded + +- Still-image geolocation accuracy. +- Satellite-anchor cache poisoning. +- Cold-start and restart trials. + +## Acceptance Criteria + +**AC-1: Blackout transitions to dead reckoning** +Given a replay/SITL trace with total camera blackout and spoofed GPS +When the scenario runs +Then the system enters `dead_reckoned` mode within the required frame or timing threshold. + +**AC-2: Degraded output thresholds are enforced** +Given blackout continues beyond configured thresholds +When estimates are emitted +Then covariance grows monotonically and `GPS_INPUT` fields degrade to no-fix/failsafe values at the specified limits. + +**AC-3: Spoofed or unauthorized MAVLink inputs are rejected** +Given spoofed real-GPS measurements or unauthorized MAVLink source IDs +When messages arrive during normal or blackout operation +Then no confident position estimate is produced from those inputs. + +**AC-4: Operator and FDR evidence is visible** +Given degraded-mode transitions occur +When reporting completes +Then QGC status and FDR evidence show promotion, demotion, blackout, and failsafe events at expected rates. + +## Non-Functional Requirements + +**Safety** +- Spoofed GPS must not be promoted during blackout without the documented recovery gates. + +**Reliability** +- Missing SITL prerequisites are reported as blocked with exact setup evidence. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Scenario trigger builder | Blackout and spoofing events are generated deterministically | +| AC-2 | Threshold assertion logic | Fix type, covariance, and `horiz_accuracy` thresholds are checked | +| AC-3 | MAVLink source filter assertion | Unauthorized source messages fail the scenario | +| AC-4 | Status/FDR parser | Expected status events and rates are validated | + +## Blackbox Tests + +| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References | +|--------|-------------------------|--------------|-------------------|----------------| +| AC-1 | SITL or replay spoofing trace | FT-N-02, NFT-RES-01 | Dead-reckoned transition within timing threshold | Safety | +| AC-2 | Continued blackout | FT-N-02, NFT-RES-01 | Monotonic covariance and no-fix/failsafe fields | Safety | +| AC-3 | Unauthorized/spoofed MAVLink messages | NFT-SEC-03 | No confident estimate from bad source | Safety | +| AC-4 | QGC/FDR outputs | FT-N-02, NFT-SEC-03 | Status and evidence are visible and rate-limited | Reliability | + +## Constraints + +- ArduPilot Plane SITL is the authoritative autopilot target. +- v1 asserts `GPS_INPUT` output and intentional absence of ODOMETRY. +- Tests must not depend on Mission Planner or PX4 behavior. + +## Risks & Mitigation + +**Risk 1: SITL setup varies by environment** +- *Risk*: Local runs may not have SITL installed or configured. +- *Mitigation*: Report blocked prerequisites clearly and keep replay-level assertions runnable where possible. diff --git a/_docs/02_tasks/todo/AZ-238_cold_start_restart_tests.md b/_docs/02_tasks/todo/AZ-238_cold_start_restart_tests.md new file mode 100644 index 0000000..c264c86 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-238_cold_start_restart_tests.md @@ -0,0 +1,95 @@ +# Cold Start Restart Tests + +**Task**: AZ-238_cold_start_restart_tests +**Name**: Cold Start Restart Tests +**Description**: Implement tests for cold start, companion restart, sharp-turn/disconnected relocalization, and first-fix resource spikes. +**Complexity**: 5 points +**Dependencies**: AZ-233_test_infrastructure +**Component**: Blackbox Tests +**Tracker**: AZ-238 +**Epic**: AZ-218 + +## Problem + +The test suite must prove that the runtime recovers from disconnected visual segments and companion restarts without hiding missing prerequisites or unsafe degraded behavior. + +## Outcome + +- Sharp-turn/disconnected-segment scenarios trigger relocalization or explicit degraded output. +- Companion restart scenarios measure first valid output timing and FDR evidence. +- Cold-start trials record first-fix latency and resource spikes. + +## Scope + +### Included + +- NFT-RES-02 Sharp Turn And Disconnected Segment Relocalization. +- NFT-RES-03 Companion Computer Restart Mid-Flight. +- NFT-PERF-04 Cold Boot Time To First Fix. +- NFT-RES-LIM-05 Cold Start Resource Spike. + +### Excluded + +- Long thermal endurance. +- FDR 8-hour rollover load. +- Cache poisoning and no-fetch security tests. + +## Acceptance Criteria + +**AC-1: Disconnected segments trigger relocalization** +Given a sharp-turn or disconnected segment fixture +When replay reaches the low-overlap transition +Then relocalization is requested and the system either reconnects via verified anchor or reports degraded status. + +**AC-2: Companion restart recovery is measured** +Given a replay/SITL mission in progress +When the GPS-denied service is restarted +Then first valid output timing, FC-state handoff behavior, and FDR restart evidence are recorded. + +**AC-3: Cold-start trials report first-fix timing** +Given cold-start conditions and local cache/index prerequisites +When 50 trials run or are blocked +Then the p95 time-to-first-fix result or exact blocked prerequisite is reported. + +**AC-4: Cold-start resource spikes are captured** +Given initialization begins +When engines/indexes/cache are loaded +Then peak memory and initialization-stage timing are recorded where measurable. + +## Non-Functional Requirements + +**Reliability** +- Missing calibration, public datasets, or hardware prerequisites must not be treated as passing. + +**Performance** +- First-fix timing and peak memory are reported with percentile summaries where enough trials run. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Relocalization trigger assertion | Missing-position thresholds trigger request checks | +| AC-2 | Restart report parser | Restart and first-output events are present | +| AC-3 | Trial aggregation | p95 first-fix summary or blocked reason is emitted | +| AC-4 | Resource metric parser | Peak memory and stage timings are captured | + +## Blackbox Tests + +| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References | +|--------|-------------------------|--------------|-------------------|----------------| +| AC-1 | Sharp-turn/disconnected replay | NFT-RES-02 | Verified relocalization or degraded evidence | Reliability | +| AC-2 | Mission restart trace | NFT-RES-03 | First valid output and FDR restart evidence | Reliability | +| AC-3 | Cold-start harness | NFT-PERF-04 | p95 first fix <30 s or blocked prerequisite | Performance | +| AC-4 | Cold-start resource monitoring | NFT-RES-LIM-05 | Peak memory <8 GB or blocked/failure evidence | Performance | + +## Constraints + +- Restart tests must preserve fixture read-only guarantees. +- Trial loops must be bounded and report partial results if interrupted. +- Hardware-only assertions must be clearly marked when not runnable locally. + +## Risks & Mitigation + +**Risk 1: Long cold-start trials are expensive** +- *Risk*: Full 50-run evidence may not be practical on every PR. +- *Mitigation*: Support smoke mode for PRs and full mode for release gates, with clear report labels. diff --git a/_docs/02_tasks/todo/AZ-239_jetson_resource_endurance_tests.md b/_docs/02_tasks/todo/AZ-239_jetson_resource_endurance_tests.md new file mode 100644 index 0000000..5403190 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-239_jetson_resource_endurance_tests.md @@ -0,0 +1,94 @@ +# Jetson Resource Endurance Tests + +**Task**: AZ-239_jetson_resource_endurance_tests +**Name**: Jetson Resource Endurance Tests +**Description**: Implement release-gate resource and endurance tests for Jetson memory, thermal/power behavior, and FDR rollover. +**Complexity**: 5 points +**Dependencies**: AZ-233_test_infrastructure +**Component**: Blackbox Tests +**Tracker**: AZ-239 +**Epic**: AZ-218 + +## Problem + +Release readiness requires hardware/resource evidence that cannot be proven by ordinary unit tests or short local replay runs. + +## Outcome + +- Jetson memory and thermal/power metrics are captured where hardware is available. +- FDR 8-hour synthetic load verifies rollover, storage cap, and retained payload classes. +- Hardware-only prerequisites are reported as blocked when not available. + +## Scope + +### Included + +- NFT-RES-LIM-01 Jetson Memory Budget. +- NFT-RES-LIM-02 Thermal And Power Envelope. +- NFT-RES-LIM-04 Flight Data Recorder Rollover. + +### Excluded + +- Still-image replay accuracy. +- Satellite anchor/cache security tests. +- Cold-start first-fix trials. + +## Acceptance Criteria + +**AC-1: Jetson memory budget is measured** +Given Jetson hardware or equivalent production target is available +When sustained replay and trigger-path workload runs +Then CPU/GPU shared memory, process RSS, CUDA allocations, and OOM/throttle status are recorded. + +**AC-2: Thermal and power endurance is validated or blocked** +Given thermal test prerequisites are available +When the sustained 25 W workload runs +Then throttle flags, temperatures, clocks, and latency are recorded for the required duration; otherwise the run reports blocked prerequisites. + +**AC-3: FDR rollover is validated** +Given an 8-hour synthetic mission load +When FDR output reaches rollover conditions +Then storage remains within the cap, rollover is logged, and no payload class is silently dropped. + +**AC-4: Evidence artifacts are complete** +Given resource/endurance scenarios complete or block +When reporting finishes +Then metrics, duration, environment, status, and artifact paths are written. + +## Non-Functional Requirements + +**Performance** +- Resource evidence must include duration and sampling interval. + +**Reliability** +- Hardware-unavailable results are `blocked`, not `passed`. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Resource metric parser | Memory and throttle fields are present | +| AC-2 | Blocked prerequisite reporter | Missing hardware/thermal setup records blocked status | +| AC-3 | FDR rollover report parser | Storage, rollover, and payload-class fields are validated | +| AC-4 | Evidence manifest writer | Artifact paths and run metadata are present | + +## Blackbox Tests + +| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References | +|--------|-------------------------|--------------|-------------------|----------------| +| AC-1 | Jetson/prod-equivalent hardware | NFT-RES-LIM-01 | Peak memory <8 GB or explicit failure | Performance | +| AC-2 | Thermal/power test setup | NFT-RES-LIM-02 | No throttle over required duration or blocked/failure | Performance | +| AC-3 | Synthetic 8-hour mission load | NFT-RES-LIM-04 | FDR cap and rollover behavior are evidenced | Reliability | +| AC-4 | Resource/endurance reports | All included scenarios | Complete artifact manifest and status | Reliability | + +## Constraints + +- These tests are release-gate oriented and may be skipped or blocked in ordinary PR mode. +- Raw frames must not be retained during FDR load tests. +- Resource tests must not write outside run-scoped output directories. + +## Risks & Mitigation + +**Risk 1: Hardware gates are unavailable during local development** +- *Risk*: Developers cannot run full evidence locally. +- *Mitigation*: Support blocked status and separate PR smoke mode from release-gate execution. diff --git a/_docs/02_tasks/todo/AZ-240_native_vio_backend_integration.md b/_docs/02_tasks/todo/AZ-240_native_vio_backend_integration.md new file mode 100644 index 0000000..ff1b574 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-240_native_vio_backend_integration.md @@ -0,0 +1,95 @@ +# Native VIO Backend Integration + +**Task**: AZ-240_native_vio_backend_integration +**Name**: Native VIO Backend Integration +**Description**: Replace the deterministic VIO placeholder path with a real native backend integration boundary for representative replay. +**Complexity**: 5 points +**Dependencies**: AZ-228_vio_adapter +**Component**: VIO Adapter +**Tracker**: AZ-240 +**Epic**: AZ-213 + +## Problem + +The current VIO adapter satisfies the public contract with deterministic scaffold behavior, but it does not exercise a real native VIO backend for synchronized replay. + +## Outcome + +- A production-capable native VIO bridge is available behind the existing `VioBackend` protocol. +- Backend-specific setup remains isolated from the public VIO adapter boundary. +- Existing timestamp mismatch, tracking-loss, health, and no-WGS84-authority behavior is preserved. + +## Scope + +### Included + +- Native/backend bridge implementation behind `VioBackend`. +- Backend initialization and runtime failure mapping into explicit health/error states. +- Replay-driven relative pose, velocity, bias, tracking quality, and covariance output. +- Tests that prove the real backend path is selected when configured. + +### Excluded + +- Absolute WGS84 authority or safety fusion. +- Satellite-anchor fallback logic. +- Direct test imports of backend internals. + +## Dependencies + +### Document Dependencies + +- `_docs/02_document/components/02_vio_adapter/description.md` +- `_docs/02_document/contracts/shared/runtime_contracts.md` +- `_docs/02_document/contracts/shared/geometry_time_sync.md` +- `_docs/02_document/contracts/shared/config_errors_telemetry.md` + +## Acceptance Criteria + +**AC-1: Native backend path emits VIO state** +Given synchronized replay frames and telemetry +When VIO processing runs with the native backend enabled +Then the adapter emits a relative VIO state packet from the native path. + +**AC-2: Backend failures are explicit** +Given backend initialization or runtime failure +When VIO processing or health reporting runs +Then the adapter surfaces an explicit error and degraded or failed health state. + +**AC-3: Existing safety boundaries remain intact** +Given timestamp mismatch, low tracking quality, or successful native output +When the adapter returns a result +Then degraded behavior, tracking quality, and absence of WGS84 authority remain intact. + +## Non-Functional Requirements + +**Performance** +- Replay execution must expose latency and memory metrics for later Jetson profiling gates. + +**Reliability** +- Backend failures must not be hidden behind deterministic fallback success. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Configured native backend path | Native estimate is used, not deterministic fallback | +| AC-2 | Backend init/runtime failure | Explicit error and degraded/failed health | +| AC-3 | Timestamp/quality boundaries | Existing degraded/no-WGS84 behavior preserved | + +## Blackbox Tests + +| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References | +|--------|-------------------------|--------------|-------------------|----------------| +| AC-1 | Derkachi or representative synchronized replay | Native VIO replay path | Relative estimates are emitted or blocked with a real prerequisite reason | Performance | + +## Constraints + +- Keep backend-specific dependencies behind the `vio_adapter` native boundary. +- Do not make the VIO adapter the safety or WGS84 authority. +- If required native packages are unavailable locally, tests must skip or block with explicit prerequisite evidence rather than passing through the deterministic fallback. + +## Risks & Mitigation + +**Risk 1: Native dependency unavailable in local CI** +- *Risk*: The real backend cannot run on all developer machines. +- *Mitigation*: Provide dependency-gated tests that fail only when the backend is configured but broken, and report blocked prerequisites for full replay gates. diff --git a/_docs/02_tasks/todo/AZ-241_real_satellite_vpr_descriptor_retrieval.md b/_docs/02_tasks/todo/AZ-241_real_satellite_vpr_descriptor_retrieval.md new file mode 100644 index 0000000..c3a1436 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-241_real_satellite_vpr_descriptor_retrieval.md @@ -0,0 +1,95 @@ +# Real Satellite VPR Descriptor Retrieval + +**Task**: AZ-241_real_satellite_vpr_descriptor_retrieval +**Name**: Real Satellite VPR Descriptor Retrieval +**Description**: Replace the tuple-similarity satellite retrieval scaffold with the real local descriptor/index retrieval path promised by the Satellite Service design. +**Complexity**: 5 points +**Dependencies**: AZ-230_satellite_service_vpr_retrieval +**Component**: Satellite Service +**Tracker**: AZ-241 +**Epic**: AZ-214 + +## Problem + +The current Satellite Service can load in-memory descriptor records and rank them with local tuple similarity, but it does not yet integrate the real offline descriptor/index retrieval path. + +## Outcome + +- Local mission cache descriptor/index packages can be loaded by the runtime retrieval path. +- Retrieval uses the selected CPU FAISS/DINOv2-VLAD-compatible boundary where available. +- Freshness filtering, bounded top-K output, descriptor-fidelity checks, and no in-flight network behavior remain intact. + +## Scope + +### Included + +- Local descriptor/index package loading from the offline cache boundary. +- Real local VPR retrieval implementation behind the public Satellite Service API. +- Explicit degraded/no-candidate/index failure behavior. +- Tests that distinguish the real retrieval path from the current tuple-similarity scaffold. + +### Excluded + +- Local feature matching, RANSAC, or anchor acceptance. +- In-flight provider or Suite service calls. +- TensorRT/ONNX optimization unless descriptor-fidelity gates are in place. + +## Dependencies + +### Document Dependencies + +- `_docs/02_document/components/04_satellite_retrieval/description.md` +- `_docs/02_document/contracts/shared/runtime_contracts.md` +- `_docs/02_document/contracts/shared/config_errors_telemetry.md` +- `_docs/02_document/components/06_cache_tile_lifecycle/description.md` + +## Acceptance Criteria + +**AC-1: Real local index readiness is reported** +Given a valid local descriptor/index package +When the Satellite Service loads the package +Then readiness reflects the real local index and loaded record count. + +**AC-2: Real top-K retrieval returns candidates** +Given a relocalization request and loaded local index +When retrieval runs +Then bounded candidates come from the real local descriptor/index path with scores, footprints, and freshness state. + +**AC-3: Missing or invalid indexes degrade safely** +Given missing, corrupt, incompatible, or empty local index data +When retrieval runs +Then the result is explicit degraded/no-candidate behavior without unsafe anchors or network calls. + +## Non-Functional Requirements + +**Performance** +- Retrieval remains trigger-based and exposes latency metrics for Jetson profiling. + +**Security** +- Retrieval must not perform in-flight provider or Suite service calls. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Real index package load | Ready status references loaded real index data | +| AC-2 | Query against fixture index | Candidates come from the real retrieval path | +| AC-3 | Missing/corrupt index | Explicit degraded/no-candidate result | + +## Blackbox Tests + +| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References | +|--------|-------------------------|--------------|-------------------|----------------| +| AC-2 | Public/cache fixture with descriptor index | VPR recall and top-K policy | Candidate bounds, freshness, and latency evidence are reported | Performance | + +## Constraints + +- Use only local preloaded cache/index data during flight-mode retrieval. +- Keep optional optimized engines behind descriptor-fidelity gates. +- Missing native/index prerequisites must be reported as blocked, not silently passed by the scaffold path. + +## Risks & Mitigation + +**Risk 1: Heavy native/index dependencies do not run in ordinary CI** +- *Risk*: The real retrieval path needs packages or data unavailable in local CI. +- *Mitigation*: Keep fast contract tests for package parsing and dependency-gated integration tests for real index execution. diff --git a/_docs/02_tasks/todo/AZ-242_real_anchor_feature_matching_ransac.md b/_docs/02_tasks/todo/AZ-242_real_anchor_feature_matching_ransac.md new file mode 100644 index 0000000..737819f --- /dev/null +++ b/_docs/02_tasks/todo/AZ-242_real_anchor_feature_matching_ransac.md @@ -0,0 +1,94 @@ +# Real Anchor Feature Matching And RANSAC + +**Task**: AZ-242_real_anchor_feature_matching_ransac +**Name**: Real Anchor Feature Matching And RANSAC +**Description**: Replace the precomputed evidence gate-only scaffold with real local feature matching and geometry verification behind the Anchor Verification boundary. +**Complexity**: 5 points +**Dependencies**: AZ-231_anchor_verification_matching, AZ-241_real_satellite_vpr_descriptor_retrieval +**Component**: Anchor Verification +**Tracker**: AZ-242 +**Epic**: AZ-215 + +## Problem + +The current Anchor Verification component can classify precomputed `MatchEvidence`, but it does not yet run real feature extraction, matching, homography estimation, or RANSAC/USAC geometry checks. + +## Outcome + +- Approved matcher profiles can compute correspondence evidence from frame imagery and candidate tile data. +- Geometry verification produces inliers, MRE, homography/provenance, runtime, and rejection reasons. +- Existing safety gates continue to reject unsafe candidates before any anchor is trusted. + +## Scope + +### Included + +- Matcher bridge for approved ALIKED/DISK + LightGlue and SIFT/ORB baseline profiles where dependencies are available. +- Homography and RANSAC/USAC evidence generation from local imagery/tile fixtures. +- Integration with existing `GeometryGatedAnchorVerifier` decision output. +- Benchmark reporting from actual matching paths. + +### Excluded + +- VPR candidate ranking. +- Safety wrapper fusion/promotion policy. +- Per-frame steady-state VIO hot path execution. + +## Dependencies + +### Document Dependencies + +- `_docs/02_document/components/05_anchor_verification/description.md` +- `_docs/02_document/contracts/shared/runtime_contracts.md` +- `_docs/02_document/components/04_satellite_retrieval/description.md` + +## Acceptance Criteria + +**AC-1: Matching path computes evidence** +Given a usable frame and fresh candidate tile +When anchor verification runs +Then matcher evidence is computed from local imagery and includes inliers, MRE, homography, provenance, and runtime. + +**AC-2: Unsafe candidates are rejected** +Given low inliers, high reprojection error, stale or untrusted provenance, or geometry failure +When verification runs +Then no accepted anchor decision is emitted for that candidate. + +**AC-3: Real matcher benchmark is reportable** +Given configured matcher profiles and fixture inputs +When benchmark runs +Then runtime and quality metrics are reported from actual matching paths. + +## Non-Functional Requirements + +**Performance** +- Learned matching remains trigger-based and benchmarked separately from the VIO hot path. + +**Reliability** +- Missing matcher dependencies or fixture data must be explicit blocked prerequisites, not passing scaffold behavior. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Fixture matching path | Evidence is computed from imagery/tile input | +| AC-2 | Bad geometry/provenance | Candidate is rejected with reason | +| AC-3 | Matcher benchmark | Runtime and quality metrics come from real path | + +## Blackbox Tests + +| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References | +|--------|-------------------------|--------------|-------------------|----------------| +| AC-1 | Aerial/cache fixture pair | Anchor verification path | Accepted anchors meet MRE/inlier gates with real evidence | Performance | + +## Constraints + +- Keep native feature extraction and RANSAC acceleration under `anchor_verification`. +- Do not trust precomputed evidence in production paths without provenance checks. +- SuperPoint or other legally restricted models remain excluded unless explicitly approved. + +## Risks & Mitigation + +**Risk 1: False anchor acceptance** +- *Risk*: Real cross-domain matching can produce plausible but unsafe geometry. +- *Mitigation*: Preserve freshness, provenance, inlier, MRE, and downstream safety gates; add negative fixtures for low-texture and stale-cache cases. diff --git a/_docs/03_implementation/implementation_report_product_runtime_cycle1.md b/_docs/03_implementation/implementation_report_product_runtime_cycle1.md new file mode 100644 index 0000000..73d9d9e --- /dev/null +++ b/_docs/03_implementation/implementation_report_product_runtime_cycle1.md @@ -0,0 +1,74 @@ +# Implementation Report + +**Feature**: Product runtime +**Cycle**: 1 +**Date**: 2026-05-04 +**Status**: Superseded — remediation pending + +## Summary + +Greenfield product implementation completed the initial GPS-denied onboard runtime scaffold and component behavior tasks. Later product verification identified required remediation work before the flow can advance to testability revision. + +- Total tasks completed: 14 +- Completed batches: 9 +- Blocked tasks: 0 +- Code review verdicts: PASS for all batch reviews and cumulative review +- Final test run: 49 passed + +## Completed Tasks + +| Task | Name | Batch | Status | +|------|------|-------|--------| +| AZ-219 | initial_structure | 1 | Done | +| AZ-220 | shared_runtime_contracts | 2 | Done | +| AZ-221 | shared_geometry_time_sync | 3 | Done | +| AZ-222 | runtime_config_errors_telemetry | 3 | Done | +| AZ-223 | camera_ingest_calibration | 4 | Done | +| AZ-224 | mavlink_gcs_gateway | 4 | Done | +| AZ-225 | tile_manager_cache_manifest | 4 | Done | +| AZ-227 | fdr_event_recorder | 4 | Done | +| AZ-226 | generated_tile_orthorectification | 5 | Done | +| AZ-228 | vio_adapter | 6 | Done | +| AZ-229 | satellite_service_sync | 6 | Done | +| AZ-230 | satellite_service_vpr_retrieval | 7 | Done | +| AZ-231 | anchor_verification_matching | 8 | Done | +| AZ-232 | safety_anchor_state_machine | 9 | Done | + +## Batch Outcomes + +| Batch | Tasks | Code Review | Tests | +|-------|-------|-------------|-------| +| 1 | AZ-219_initial_structure | PASS | 5 passed | +| 2 | AZ-220_shared_runtime_contracts | PASS | 11 passed | +| 3 | AZ-221_shared_geometry_time_sync, AZ-222_runtime_config_errors_telemetry | PASS | 17 passed | +| 4 | AZ-223_camera_ingest_calibration, AZ-224_mavlink_gcs_gateway, AZ-225_tile_manager_cache_manifest, AZ-227_fdr_event_recorder | PASS | 29 passed | +| 5 | AZ-226_generated_tile_orthorectification | PASS | 32 passed | +| 6 | AZ-228_vio_adapter, AZ-229_satellite_service_sync | PASS | 38 passed | +| 7 | AZ-230_satellite_service_vpr_retrieval | PASS | 42 passed | +| 8 | AZ-231_anchor_verification_matching | PASS | 45 passed | +| 9 | AZ-232_safety_anchor_state_machine | PASS | 49 passed | + +## Acceptance Coverage + +All acceptance criteria documented in the product implementation task specs are covered by tests recorded in the batch reports: + +- Shared contracts, configuration, errors, telemetry, geometry, and time-sync behavior are validated by shared unit tests. +- Component runtime boundaries for camera ingest, MAVLink/GCS, tile management, FDR, VIO, Satellite Service, anchor verification, and safety/anchor state management are validated by component unit tests. +- Safety-critical behavior for explicit errors, no raw-frame retention, no mid-flight Satellite Service calls, conservative generated-tile writes, rejected unsafe anchors, monotonic blackout degradation, and honest covariance is covered by the current unit suite. + +## Review Summary + +- Batch reviews: `_docs/03_implementation/reviews/batch_01_review.md` through `_docs/03_implementation/reviews/batch_09_review.md` +- Cumulative review: `_docs/03_implementation/reviews/cumulative_review_batches_01-09_cycle1_report.md` +- Auto-fix attempts: 0 across all batches +- Stuck agents: none + +## Final Verification + +- `.venv/bin/python -m black --check src tests e2e/replay` passed. +- `.venv/bin/python -m ruff check src tests e2e/replay` passed. +- `.venv/bin/python -m pytest` passed: 49 tests. + +## Next Step + +Autodev should remain at Step 7, Implement, until remediation tasks AZ-240 through AZ-242 are implemented and the Product Implementation Completeness Gate produces `_docs/03_implementation/implementation_completeness_cycle1_report.md` without unresolved `FAIL` classifications. diff --git a/_docs/03_implementation/reviews/cumulative_review_batches_01-09_cycle1_report.md b/_docs/03_implementation/reviews/cumulative_review_batches_01-09_cycle1_report.md new file mode 100644 index 0000000..652ab04 --- /dev/null +++ b/_docs/03_implementation/reviews/cumulative_review_batches_01-09_cycle1_report.md @@ -0,0 +1,65 @@ +# Code Review Report + +**Batch**: cumulative batches 01-09, cycle 1 +**Date**: 2026-05-04 +**Verdict**: PASS + +## Scope + +- Task specs reviewed: AZ-219 through AZ-232. +- Batch reports reviewed: `_docs/03_implementation/batch_01_cycle1_report.md` through `_docs/03_implementation/batch_09_cycle1_report.md`. +- Code scope reviewed: `src/`, `tests/`, and `e2e/replay`. +- Architecture references reviewed: `_docs/02_document/architecture.md` and `_docs/02_document/module-layout.md`. + +## Findings + +| # | Severity | Category | File:Line | Title | +|---|----------|----------|-----------|-------| +| - | - | - | - | No findings | + +## Phase Results + +### Phase 1: Context Loading + +All 14 product implementation tasks, the project restrictions, the solution overview, module layout, architecture, and batch reports were reviewed. + +### Phase 2: Spec Compliance + +Every task acceptance criterion is covered by the per-batch reports and unit tests. The final full suite passed with 49 tests. + +### Phase 3: Code Quality + +Formatter and lint checks passed: + +- `.venv/bin/python -m black --check src tests e2e/replay` +- `.venv/bin/python -m ruff check src tests e2e/replay` + +No dead imports, style errors, or obvious duplicated component-local contract shapes were found. + +### Phase 4: Security Quick-Scan + +No hardcoded secrets, `eval`, `exec`, shell subprocess usage, insecure deserialization, or sensitive-data logging patterns were found in `src/`. + +### Phase 5: Performance Scan + +The implemented code remains lightweight and trigger-oriented for the current scaffold/runtime-contract level. Heavy VPR, matching, Jetson, SITL, and endurance profiling remain release-gate work for later test implementation and deploy phases. + +### Phase 6: Cross-Task Consistency + +Shared DTOs and component interfaces are consistently consumed through public package surfaces. Batch-level reports show all dependencies were implemented before consumers. + +### Phase 7: Architecture Compliance + +Observed imports align with the component public API layout: + +- Runtime components import shared helpers and contracts through `shared/*` public modules. +- Cross-component imports use package-level public exports such as `tile_manager`, not internal component files. +- No component imports from `internal/`, `_*.py`, or native bridge paths owned by another component. + +No architecture baseline file exists, so no baseline delta section is required. + +## Verification + +- `.venv/bin/python -m black --check src tests e2e/replay` passed. +- `.venv/bin/python -m ruff check src tests e2e/replay` passed. +- `.venv/bin/python -m pytest` passed: 49 tests. diff --git a/_docs/04_refactoring/01-testability-refactoring/testability_assessment.md b/_docs/04_refactoring/01-testability-refactoring/testability_assessment.md new file mode 100644 index 0000000..d370a53 --- /dev/null +++ b/_docs/04_refactoring/01-testability-refactoring/testability_assessment.md @@ -0,0 +1,56 @@ +# Code Testability Assessment + +**Date**: 2026-05-04 +**Autodev step**: Greenfield Step 8 — Code Testability Revision +**Outcome**: Code is testable — no changes needed + +## Scope Reviewed + +- Test specifications in `_docs/02_document/tests/` +- Traceability matrix in `_docs/02_document/tests/traceability-matrix.md` +- Runtime source under `src/` +- Existing unit tests under `tests/` +- Product implementation report `_docs/03_implementation/implementation_report_product_runtime_cycle1.md` + +## Testability Result + +The implemented product runtime can support the planned tests without a testability-focused refactor. + +- Runtime components expose public package-level APIs through `__init__.py`, `types.py`, and `interfaces.py`. +- Component behavior is expressed through data models and class/protocol boundaries that can be constructed directly in tests. +- External systems are represented as boundary objects or planned black-box fixtures, not hardwired network calls. +- No direct filesystem, environment, subprocess, socket, HTTP, global singleton, or wall-clock usage was found in `src/` that would block deterministic tests. +- Planned hardware, SITL, Jetson, and dataset dependencies belong in test harness tasks and can report `blocked` when prerequisites are unavailable. + +## Scenario Review + +| Scenario Area | Testability Assessment | +|---------------|------------------------| +| Unit/component tests | Current public classes and DTOs are directly constructible and already covered by 49 passing tests. | +| Black-box replay | The planned harness can drive public frame, telemetry, cache, MAVLink, status, and FDR boundaries without importing runtime internals. | +| VIO and anchor replay | Heavy BASALT, FAISS, and matcher dependencies can be represented by test harness fixtures or backend boundaries in test tasks. | +| SITL/MAVLink tests | The MAVLink/GCS gateway exposes validation and status behavior without requiring live hardware for unit-level coverage. | +| Jetson/resource tests | Hardware-specific release gates are environment-dependent and do not require runtime refactoring before test-task implementation. | +| Security/cache tests | Cache, freshness, no-fetch, and generated-tile trust behavior is exposed through public component methods. | + +## Reviewed Test Artifacts + +- `_docs/02_document/tests/blackbox-tests.md` +- `_docs/02_document/tests/e2e-test-suite.md` +- `_docs/02_document/tests/environment.md` +- `_docs/02_document/tests/performance-tests.md` +- `_docs/02_document/tests/resilience-tests.md` +- `_docs/02_document/tests/resource-limit-tests.md` +- `_docs/02_document/tests/security-tests.md` +- `_docs/02_document/tests/test-data.md` +- `_docs/02_document/tests/traceability-matrix.md` + +## Verification + +- `.venv/bin/python -m black --check src tests e2e/replay` passed. +- `.venv/bin/python -m ruff check src tests e2e/replay` passed. +- `.venv/bin/python -m pytest` passed: 49 tests. + +## Next Step + +Proceed to Greenfield Step 9, Decompose Tests. diff --git a/_docs/_autodev_state.md b/_docs/_autodev_state.md index 77631bb..d590842 100644 --- a/_docs/_autodev_state.md +++ b/_docs/_autodev_state.md @@ -4,11 +4,11 @@ flow: greenfield step: 7 name: Implement -status: in_progress +status: not_started tracker: jira sub_step: - phase: 1 - name: batch-loop - detail: "batch 9: AZ-232_safety_anchor_state_machine" + phase: 0 + name: awaiting-invocation + detail: "Product implementation incomplete: AZ-240..AZ-242 remediation tasks are pending. Re-run Step 7 and the Product Implementation Completeness Gate before Step 8 or test tasks." retry_count: 0 cycle: 1