From 880eabcb3fd9ccaa37cfc6b988ec3f70b58a885d Mon Sep 17 00:00:00 2001 From: Oleksandr Bezdieniezhnykh Date: Mon, 11 May 2026 00:39:48 +0300 Subject: [PATCH] Decompose Step 6 snapshot: 140 task specs + contract docs Closes out greenfield Step 6 (Decompose) for all 14 components (C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446 plus the _dependencies_table.md and component contract documents. State file updated to greenfield Step 7 (Implement), not_started. Co-authored-by: Cursor --- .../input_data/flight_derkachi/camera_info.md | 3 + .../c10_provisioning/cache_provisioner.md | 145 +++++++++ .../c10_provisioning/manifest_verifier.md | 134 ++++++++ .../c11_tilemanager/tile_downloader.md | 115 +++++++ .../c11_tilemanager/tile_uploader.md | 114 +++++++ .../operator_command_transport.md | 106 +++++++ .../contracts/c1_vio/vio_strategy_protocol.md | 166 ++++++++++ .../c2_5_rerank/rerank_strategy_protocol.md | 183 +++++++++++ .../contracts/c2_vpr/vpr_strategy_protocol.md | 214 +++++++++++++ .../conditional_refiner_protocol.md | 170 ++++++++++ .../cross_domain_matcher_protocol.md | 170 ++++++++++ .../c4_pose/pose_estimator_protocol.md | 194 ++++++++++++ .../c5_state/state_estimator_protocol.md | 143 +++++++++ .../c6_tile_cache/descriptor_index.md | 122 ++++++++ .../c6_tile_cache/tile_metadata_store.md | 132 ++++++++ .../contracts/c6_tile_cache/tile_store.md | 166 ++++++++++ .../inference_runtime_protocol.md | 176 +++++++++++ .../c8_fc_adapter/fc_adapter_protocol.md | 181 +++++++++++ .../contracts/replay/replay_protocol.md | 161 ++++++++++ .../composition_root_protocol.md | 83 +++++ .../shared_fdr_client/fdr_client_protocol.md | 107 +++++++ .../shared_fdr_client/fdr_record_schema.md | 107 +++++++ .../shared_helpers/descriptor_normaliser.md | 82 +++++ .../shared_helpers/engine_filename_schema.md | 92 ++++++ .../shared_helpers/imu_preintegrator.md | 82 +++++ .../shared_helpers/lightglue_runtime.md | 93 ++++++ .../contracts/shared_helpers/ransac_filter.md | 95 ++++++ .../contracts/shared_helpers/se3_utils.md | 78 +++++ .../shared_helpers/sha256_sidecar.md | 77 +++++ .../contracts/shared_helpers/wgs_converter.md | 88 ++++++ .../shared_logging/log_record_schema.md | 84 +++++ _docs/02_tasks/_dependencies_table.md | 267 ++++++++++++++++ _docs/02_tasks/todo/AZ-266_log_module.md | 106 +++++++ _docs/02_tasks/todo/AZ-267_fdr_log_bridge.md | 100 ++++++ .../todo/AZ-268_log_schema_contract_test.md | 68 ++++ _docs/02_tasks/todo/AZ-269_config_loader.md | 104 +++++++ _docs/02_tasks/todo/AZ-270_compose_root.md | 108 +++++++ .../todo/AZ-271_config_precedence_tests.md | 75 +++++ .../02_tasks/todo/AZ-272_fdr_record_schema.md | 123 ++++++++ .../todo/AZ-273_fdr_client_ringbuf.md | 151 +++++++++ .../todo/AZ-274_fdr_overrun_emission.md | 125 ++++++++ _docs/02_tasks/todo/AZ-275_fake_fdr_sink.md | 128 ++++++++ .../02_tasks/todo/AZ-276_imu_preintegrator.md | 138 +++++++++ _docs/02_tasks/todo/AZ-277_se3_utils.md | 152 +++++++++ .../02_tasks/todo/AZ-278_lightglue_runtime.md | 146 +++++++++ _docs/02_tasks/todo/AZ-279_wgs_converter.md | 153 +++++++++ _docs/02_tasks/todo/AZ-280_sha256_sidecar.md | 154 +++++++++ .../todo/AZ-281_engine_filename_schema.md | 158 ++++++++++ _docs/02_tasks/todo/AZ-282_ransac_filter.md | 162 ++++++++++ .../todo/AZ-283_descriptor_normaliser.md | 175 +++++++++++ .../02_tasks/todo/AZ-291_c13_writer_thread.md | 171 ++++++++++ .../todo/AZ-292_c13_flight_header_footer.md | 150 +++++++++ .../todo/AZ-293_c13_capacity_cap_policy.md | 158 ++++++++++ .../AZ-294_c13_mid_flight_tile_snapshot.md | 165 ++++++++++ .../todo/AZ-295_c13_thumbnail_rate_limiter.md | 172 +++++++++++ .../AZ-296_c13_open_error_takeoff_abort.md | 152 +++++++++ .../todo/AZ-297_c7_runtime_protocol.md | 167 ++++++++++ .../todo/AZ-298_c7_tensorrt_runtime.md | 196 ++++++++++++ .../todo/AZ-299_c7_onnxrt_fallback.md | 167 ++++++++++ .../todo/AZ-300_c7_pytorch_baseline.md | 161 ++++++++++ _docs/02_tasks/todo/AZ-301_c7_engine_gate.md | 161 ++++++++++ .../todo/AZ-302_c7_thermal_publisher.md | 177 +++++++++++ .../todo/AZ-303_c6_storage_interfaces.md | 211 +++++++++++++ .../todo/AZ-304_c6_postgres_schema.md | 271 ++++++++++++++++ .../AZ-305_c6_postgres_filesystem_store.md | 254 +++++++++++++++ .../todo/AZ-306_c6_faiss_descriptor_index.md | 236 ++++++++++++++ .../02_tasks/todo/AZ-307_c6_freshness_gate.md | 202 ++++++++++++ .../todo/AZ-308_c6_cache_budget_eviction.md | 207 +++++++++++++ .../todo/AZ-316_c11_tile_downloader.md | 229 ++++++++++++++ .../todo/AZ-317_c11_flight_state_gate.md | 171 ++++++++++ _docs/02_tasks/todo/AZ-318_c11_signing_key.md | 205 ++++++++++++ .../02_tasks/todo/AZ-319_c11_tile_uploader.md | 250 +++++++++++++++ .../todo/AZ-320_c11_idempotent_retry.md | 214 +++++++++++++ .../todo/AZ-321_c10_engine_compiler.md | 197 ++++++++++++ .../todo/AZ-322_c10_descriptor_batcher.md | 208 +++++++++++++ .../todo/AZ-323_c10_manifest_builder.md | 243 +++++++++++++++ .../todo/AZ-324_c10_manifest_verifier.md | 243 +++++++++++++++ .../todo/AZ-325_c10_cache_provisioner.md | 233 ++++++++++++++ _docs/02_tasks/todo/AZ-326_c12_cli_app.md | 188 +++++++++++ .../todo/AZ-327_c12_companion_bringup.md | 209 +++++++++++++ .../AZ-328_c12_build_cache_orchestrator.md | 220 +++++++++++++ .../todo/AZ-329_c12_post_landing_upload.md | 216 +++++++++++++ .../todo/AZ-330_c12_operator_reloc_service.md | 205 ++++++++++++ .../todo/AZ-331_c1_vio_strategy_protocol.md | 182 +++++++++++ .../todo/AZ-332_c1_okvis2_strategy.md | 202 ++++++++++++ .../todo/AZ-333_c1_vins_mono_strategy.md | 198 ++++++++++++ .../todo/AZ-334_c1_klt_ransac_strategy.md | 220 +++++++++++++ .../todo/AZ-335_c1_warm_start_recovery.md | 191 ++++++++++++ .../todo/AZ-336_c2_vpr_strategy_protocol.md | 183 +++++++++++ _docs/02_tasks/todo/AZ-337_c2_ultra_vpr.md | 237 ++++++++++++++ _docs/02_tasks/todo/AZ-338_c2_net_vlad.md | 217 +++++++++++++ .../02_tasks/todo/AZ-339_c2_megaloc_mixvpr.md | 207 +++++++++++++ .../AZ-340_c2_selavpr_eigenplaces_salad.md | 218 +++++++++++++ .../todo/AZ-341_c2_faiss_retrieve_wiring.md | 197 ++++++++++++ .../AZ-342_c2_5_rerank_strategy_protocol.md | 199 ++++++++++++ .../todo/AZ-343_c2_5_inlier_count_reranker.md | 227 ++++++++++++++ .../todo/AZ-344_c3_matcher_protocol.md | 190 ++++++++++++ .../02_tasks/todo/AZ-345_c3_disk_lightglue.md | 209 +++++++++++++ .../todo/AZ-346_c3_aliked_lightglue.md | 120 ++++++++ _docs/02_tasks/todo/AZ-347_c3_xfeat.md | 140 +++++++++ .../todo/AZ-348_c3_5_refiner_protocol.md | 215 +++++++++++++ .../todo/AZ-349_c3_5_adhop_refiner.md | 209 +++++++++++++ .../02_tasks/todo/AZ-355_c4_pose_protocol.md | 144 +++++++++ .../todo/AZ-358_c4_opencv_gtsam_marginals.md | 194 ++++++++++++ .../todo/AZ-361_c4_jacobian_thermal_hybrid.md | 187 +++++++++++ .../02_tasks/todo/AZ-381_c5_state_protocol.md | 105 +++++++ .../todo/AZ-382_c5_isam2_smoother_wiring.md | 101 ++++++ _docs/02_tasks/todo/AZ-383_c5_factor_adds.md | 87 ++++++ .../todo/AZ-384_c5_marginals_outputs.md | 88 ++++++ .../todo/AZ-385_c5_source_label_spoof_gate.md | 97 ++++++ .../02_tasks/todo/AZ-386_c5_eskf_baseline.md | 96 ++++++ .../todo/AZ-387_c5_smoothed_history_fdr.md | 74 +++++ .../02_tasks/todo/AZ-388_c5_ac52_fallback.md | 85 +++++ .../todo/AZ-389_c5_orthorectifier_c6.md | 88 ++++++ .../todo/AZ-390_c8_adapter_protocol.md | 103 +++++++ .../todo/AZ-391_c8_inbound_subscription.md | 101 ++++++ .../todo/AZ-392_c8_covariance_projector.md | 87 ++++++ .../todo/AZ-393_c8_ardupilot_outbound.md | 104 +++++++ .../02_tasks/todo/AZ-394_c8_inav_outbound.md | 101 ++++++ .../todo/AZ-395_c8_mavlink_signing.md | 105 +++++++ .../todo/AZ-396_c8_source_set_switch.md | 97 ++++++ .../todo/AZ-397_c8_qgc_telemetry_adapter.md | 98 ++++++ .../todo/AZ-398_replay_frame_source_clock.md | 104 +++++++ .../todo/AZ-399_replay_tlog_adapter.md | 109 +++++++ .../02_tasks/todo/AZ-400_replay_jsonl_sink.md | 98 ++++++ _docs/02_tasks/todo/AZ-401_replay_compose.md | 103 +++++++ _docs/02_tasks/todo/AZ-402_replay_cli.md | 101 ++++++ .../todo/AZ-403_replay_dockerfile_ci.md | 95 ++++++ .../todo/AZ-404_replay_e2e_fixture.md | 103 +++++++ .../02_tasks/todo/AZ-405_replay_auto_sync.md | 105 +++++++ .../todo/AZ-406_test_infrastructure.md | 291 ++++++++++++++++++ .../todo/AZ-407_fixture_builders_static.md | 90 ++++++ ...AZ-408_fixture_builders_synth_injectors.md | 82 +++++ .../AZ-409_ft_p_01_still_image_accuracy.md | 88 ++++++ .../todo/AZ-410_ft_p_02_derkachi_drift.md | 84 +++++ .../todo/AZ-411_ft_p_03_14_schema_wgs84.md | 66 ++++ ...Z-412_ft_p_04_derkachi_f2f_registration.md | 69 +++++ .../todo/AZ-413_ft_p_05_06_sat_anchor_mre.md | 72 +++++ .../todo/AZ-414_ft_p_07_ftn_02_sharp_turn.md | 81 +++++ .../AZ-415_ft_p_08_multi_segment_reloc.md | 70 +++++ .../todo/AZ-416_ft_p_09_ap_signing.md | 74 +++++ _docs/02_tasks/todo/AZ-417_ft_p_09_inav.md | 67 ++++ .../todo/AZ-418_ft_p_10_smoothing_lookback.md | 69 +++++ .../todo/AZ-419_ft_p_11_cold_start_init.md | 71 +++++ _docs/02_tasks/todo/AZ-420_ft_p_12_13_gcs.md | 73 +++++ .../AZ-421_ft_p_15_16_18_cache_offline.md | 79 +++++ .../AZ-422_ft_p_17_ftn_06_mid_flight_tiles.md | 78 +++++ .../todo/AZ-423_ft_p_19_sat_reloc_scale.md | 63 ++++ .../todo/AZ-424_ft_n_01_outlier_tolerance.md | 68 ++++ .../todo/AZ-425_ft_n_03_outage_reloc.md | 71 +++++ .../todo/AZ-426_ft_n_04_blackout_spoof.md | 94 ++++++ .../AZ-427_ft_n_05_stale_tile_rejection.md | 63 ++++ .../todo/AZ-428_nft_perf_01_e2e_latency.md | 85 +++++ .../todo/AZ-429_nft_perf_02_streaming.md | 58 ++++ .../02_tasks/todo/AZ-430_nft_perf_03_ttff.md | 70 +++++ .../AZ-431_nft_perf_04_spoof_promotion.md | 61 ++++ .../AZ-432_nft_res_01_imu_only_fallback.md | 68 ++++ .../AZ-433_nft_res_02_companion_reboot.md | 64 ++++ .../todo/AZ-434_nft_res_03_monte_carlo.md | 66 ++++ .../AZ-435_nft_res_04_blackout_escalation.md | 59 ++++ .../todo/AZ-436_nft_sec_01_cache_poisoning.md | 66 ++++ .../todo/AZ-437_nft_sec_02_05_no_egress.md | 66 ++++ .../todo/AZ-438_nft_sec_03_mavlink_signing.md | 72 +++++ .../todo/AZ-439_nft_sec_04_opencv_cve.md | 63 ++++ .../todo/AZ-440_nft_lim_01_jetson_memory.md | 77 +++++ .../todo/AZ-441_nft_lim_02_fdr_size.md | 57 ++++ .../AZ-442_nft_lim_03_05_storage_budget.md | 55 ++++ .../todo/AZ-443_nft_lim_04_thermal.md | 66 ++++ .../todo/AZ-444_tier2_jetson_harness.md | 78 +++++ .../AZ-445_csv_reporter_evidence_bundler.md | 60 ++++ .../todo/AZ-446_csv_reporter_refinements.md | 51 +++ _docs/_autodev_state.md | 41 +-- 172 files changed, 22897 insertions(+), 35 deletions(-) create mode 100644 _docs/00_problem/input_data/flight_derkachi/camera_info.md create mode 100644 _docs/02_document/contracts/c10_provisioning/cache_provisioner.md create mode 100644 _docs/02_document/contracts/c10_provisioning/manifest_verifier.md create mode 100644 _docs/02_document/contracts/c11_tilemanager/tile_downloader.md create mode 100644 _docs/02_document/contracts/c11_tilemanager/tile_uploader.md create mode 100644 _docs/02_document/contracts/c12_operator_tooling/operator_command_transport.md create mode 100644 _docs/02_document/contracts/c1_vio/vio_strategy_protocol.md create mode 100644 _docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md create mode 100644 _docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md create mode 100644 _docs/02_document/contracts/c3_5_adhop/conditional_refiner_protocol.md create mode 100644 _docs/02_document/contracts/c3_matcher/cross_domain_matcher_protocol.md create mode 100644 _docs/02_document/contracts/c4_pose/pose_estimator_protocol.md create mode 100644 _docs/02_document/contracts/c5_state/state_estimator_protocol.md create mode 100644 _docs/02_document/contracts/c6_tile_cache/descriptor_index.md create mode 100644 _docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md create mode 100644 _docs/02_document/contracts/c6_tile_cache/tile_store.md create mode 100644 _docs/02_document/contracts/c7_inference/inference_runtime_protocol.md create mode 100644 _docs/02_document/contracts/c8_fc_adapter/fc_adapter_protocol.md create mode 100644 _docs/02_document/contracts/replay/replay_protocol.md create mode 100644 _docs/02_document/contracts/shared_config/composition_root_protocol.md create mode 100644 _docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md create mode 100644 _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md create mode 100644 _docs/02_document/contracts/shared_helpers/descriptor_normaliser.md create mode 100644 _docs/02_document/contracts/shared_helpers/engine_filename_schema.md create mode 100644 _docs/02_document/contracts/shared_helpers/imu_preintegrator.md create mode 100644 _docs/02_document/contracts/shared_helpers/lightglue_runtime.md create mode 100644 _docs/02_document/contracts/shared_helpers/ransac_filter.md create mode 100644 _docs/02_document/contracts/shared_helpers/se3_utils.md create mode 100644 _docs/02_document/contracts/shared_helpers/sha256_sidecar.md create mode 100644 _docs/02_document/contracts/shared_helpers/wgs_converter.md create mode 100644 _docs/02_document/contracts/shared_logging/log_record_schema.md create mode 100644 _docs/02_tasks/_dependencies_table.md create mode 100644 _docs/02_tasks/todo/AZ-266_log_module.md create mode 100644 _docs/02_tasks/todo/AZ-267_fdr_log_bridge.md create mode 100644 _docs/02_tasks/todo/AZ-268_log_schema_contract_test.md create mode 100644 _docs/02_tasks/todo/AZ-269_config_loader.md create mode 100644 _docs/02_tasks/todo/AZ-270_compose_root.md create mode 100644 _docs/02_tasks/todo/AZ-271_config_precedence_tests.md create mode 100644 _docs/02_tasks/todo/AZ-272_fdr_record_schema.md create mode 100644 _docs/02_tasks/todo/AZ-273_fdr_client_ringbuf.md create mode 100644 _docs/02_tasks/todo/AZ-274_fdr_overrun_emission.md create mode 100644 _docs/02_tasks/todo/AZ-275_fake_fdr_sink.md create mode 100644 _docs/02_tasks/todo/AZ-276_imu_preintegrator.md create mode 100644 _docs/02_tasks/todo/AZ-277_se3_utils.md create mode 100644 _docs/02_tasks/todo/AZ-278_lightglue_runtime.md create mode 100644 _docs/02_tasks/todo/AZ-279_wgs_converter.md create mode 100644 _docs/02_tasks/todo/AZ-280_sha256_sidecar.md create mode 100644 _docs/02_tasks/todo/AZ-281_engine_filename_schema.md create mode 100644 _docs/02_tasks/todo/AZ-282_ransac_filter.md create mode 100644 _docs/02_tasks/todo/AZ-283_descriptor_normaliser.md create mode 100644 _docs/02_tasks/todo/AZ-291_c13_writer_thread.md create mode 100644 _docs/02_tasks/todo/AZ-292_c13_flight_header_footer.md create mode 100644 _docs/02_tasks/todo/AZ-293_c13_capacity_cap_policy.md create mode 100644 _docs/02_tasks/todo/AZ-294_c13_mid_flight_tile_snapshot.md create mode 100644 _docs/02_tasks/todo/AZ-295_c13_thumbnail_rate_limiter.md create mode 100644 _docs/02_tasks/todo/AZ-296_c13_open_error_takeoff_abort.md create mode 100644 _docs/02_tasks/todo/AZ-297_c7_runtime_protocol.md create mode 100644 _docs/02_tasks/todo/AZ-298_c7_tensorrt_runtime.md create mode 100644 _docs/02_tasks/todo/AZ-299_c7_onnxrt_fallback.md create mode 100644 _docs/02_tasks/todo/AZ-300_c7_pytorch_baseline.md create mode 100644 _docs/02_tasks/todo/AZ-301_c7_engine_gate.md create mode 100644 _docs/02_tasks/todo/AZ-302_c7_thermal_publisher.md create mode 100644 _docs/02_tasks/todo/AZ-303_c6_storage_interfaces.md create mode 100644 _docs/02_tasks/todo/AZ-304_c6_postgres_schema.md create mode 100644 _docs/02_tasks/todo/AZ-305_c6_postgres_filesystem_store.md create mode 100644 _docs/02_tasks/todo/AZ-306_c6_faiss_descriptor_index.md create mode 100644 _docs/02_tasks/todo/AZ-307_c6_freshness_gate.md create mode 100644 _docs/02_tasks/todo/AZ-308_c6_cache_budget_eviction.md create mode 100644 _docs/02_tasks/todo/AZ-316_c11_tile_downloader.md create mode 100644 _docs/02_tasks/todo/AZ-317_c11_flight_state_gate.md create mode 100644 _docs/02_tasks/todo/AZ-318_c11_signing_key.md create mode 100644 _docs/02_tasks/todo/AZ-319_c11_tile_uploader.md create mode 100644 _docs/02_tasks/todo/AZ-320_c11_idempotent_retry.md create mode 100644 _docs/02_tasks/todo/AZ-321_c10_engine_compiler.md create mode 100644 _docs/02_tasks/todo/AZ-322_c10_descriptor_batcher.md create mode 100644 _docs/02_tasks/todo/AZ-323_c10_manifest_builder.md create mode 100644 _docs/02_tasks/todo/AZ-324_c10_manifest_verifier.md create mode 100644 _docs/02_tasks/todo/AZ-325_c10_cache_provisioner.md create mode 100644 _docs/02_tasks/todo/AZ-326_c12_cli_app.md create mode 100644 _docs/02_tasks/todo/AZ-327_c12_companion_bringup.md create mode 100644 _docs/02_tasks/todo/AZ-328_c12_build_cache_orchestrator.md create mode 100644 _docs/02_tasks/todo/AZ-329_c12_post_landing_upload.md create mode 100644 _docs/02_tasks/todo/AZ-330_c12_operator_reloc_service.md create mode 100644 _docs/02_tasks/todo/AZ-331_c1_vio_strategy_protocol.md create mode 100644 _docs/02_tasks/todo/AZ-332_c1_okvis2_strategy.md create mode 100644 _docs/02_tasks/todo/AZ-333_c1_vins_mono_strategy.md create mode 100644 _docs/02_tasks/todo/AZ-334_c1_klt_ransac_strategy.md create mode 100644 _docs/02_tasks/todo/AZ-335_c1_warm_start_recovery.md create mode 100644 _docs/02_tasks/todo/AZ-336_c2_vpr_strategy_protocol.md create mode 100644 _docs/02_tasks/todo/AZ-337_c2_ultra_vpr.md create mode 100644 _docs/02_tasks/todo/AZ-338_c2_net_vlad.md create mode 100644 _docs/02_tasks/todo/AZ-339_c2_megaloc_mixvpr.md create mode 100644 _docs/02_tasks/todo/AZ-340_c2_selavpr_eigenplaces_salad.md create mode 100644 _docs/02_tasks/todo/AZ-341_c2_faiss_retrieve_wiring.md create mode 100644 _docs/02_tasks/todo/AZ-342_c2_5_rerank_strategy_protocol.md create mode 100644 _docs/02_tasks/todo/AZ-343_c2_5_inlier_count_reranker.md create mode 100644 _docs/02_tasks/todo/AZ-344_c3_matcher_protocol.md create mode 100644 _docs/02_tasks/todo/AZ-345_c3_disk_lightglue.md create mode 100644 _docs/02_tasks/todo/AZ-346_c3_aliked_lightglue.md create mode 100644 _docs/02_tasks/todo/AZ-347_c3_xfeat.md create mode 100644 _docs/02_tasks/todo/AZ-348_c3_5_refiner_protocol.md create mode 100644 _docs/02_tasks/todo/AZ-349_c3_5_adhop_refiner.md create mode 100644 _docs/02_tasks/todo/AZ-355_c4_pose_protocol.md create mode 100644 _docs/02_tasks/todo/AZ-358_c4_opencv_gtsam_marginals.md create mode 100644 _docs/02_tasks/todo/AZ-361_c4_jacobian_thermal_hybrid.md create mode 100644 _docs/02_tasks/todo/AZ-381_c5_state_protocol.md create mode 100644 _docs/02_tasks/todo/AZ-382_c5_isam2_smoother_wiring.md create mode 100644 _docs/02_tasks/todo/AZ-383_c5_factor_adds.md create mode 100644 _docs/02_tasks/todo/AZ-384_c5_marginals_outputs.md create mode 100644 _docs/02_tasks/todo/AZ-385_c5_source_label_spoof_gate.md create mode 100644 _docs/02_tasks/todo/AZ-386_c5_eskf_baseline.md create mode 100644 _docs/02_tasks/todo/AZ-387_c5_smoothed_history_fdr.md create mode 100644 _docs/02_tasks/todo/AZ-388_c5_ac52_fallback.md create mode 100644 _docs/02_tasks/todo/AZ-389_c5_orthorectifier_c6.md create mode 100644 _docs/02_tasks/todo/AZ-390_c8_adapter_protocol.md create mode 100644 _docs/02_tasks/todo/AZ-391_c8_inbound_subscription.md create mode 100644 _docs/02_tasks/todo/AZ-392_c8_covariance_projector.md create mode 100644 _docs/02_tasks/todo/AZ-393_c8_ardupilot_outbound.md create mode 100644 _docs/02_tasks/todo/AZ-394_c8_inav_outbound.md create mode 100644 _docs/02_tasks/todo/AZ-395_c8_mavlink_signing.md create mode 100644 _docs/02_tasks/todo/AZ-396_c8_source_set_switch.md create mode 100644 _docs/02_tasks/todo/AZ-397_c8_qgc_telemetry_adapter.md create mode 100644 _docs/02_tasks/todo/AZ-398_replay_frame_source_clock.md create mode 100644 _docs/02_tasks/todo/AZ-399_replay_tlog_adapter.md create mode 100644 _docs/02_tasks/todo/AZ-400_replay_jsonl_sink.md create mode 100644 _docs/02_tasks/todo/AZ-401_replay_compose.md create mode 100644 _docs/02_tasks/todo/AZ-402_replay_cli.md create mode 100644 _docs/02_tasks/todo/AZ-403_replay_dockerfile_ci.md create mode 100644 _docs/02_tasks/todo/AZ-404_replay_e2e_fixture.md create mode 100644 _docs/02_tasks/todo/AZ-405_replay_auto_sync.md create mode 100644 _docs/02_tasks/todo/AZ-406_test_infrastructure.md create mode 100644 _docs/02_tasks/todo/AZ-407_fixture_builders_static.md create mode 100644 _docs/02_tasks/todo/AZ-408_fixture_builders_synth_injectors.md create mode 100644 _docs/02_tasks/todo/AZ-409_ft_p_01_still_image_accuracy.md create mode 100644 _docs/02_tasks/todo/AZ-410_ft_p_02_derkachi_drift.md create mode 100644 _docs/02_tasks/todo/AZ-411_ft_p_03_14_schema_wgs84.md create mode 100644 _docs/02_tasks/todo/AZ-412_ft_p_04_derkachi_f2f_registration.md create mode 100644 _docs/02_tasks/todo/AZ-413_ft_p_05_06_sat_anchor_mre.md create mode 100644 _docs/02_tasks/todo/AZ-414_ft_p_07_ftn_02_sharp_turn.md create mode 100644 _docs/02_tasks/todo/AZ-415_ft_p_08_multi_segment_reloc.md create mode 100644 _docs/02_tasks/todo/AZ-416_ft_p_09_ap_signing.md create mode 100644 _docs/02_tasks/todo/AZ-417_ft_p_09_inav.md create mode 100644 _docs/02_tasks/todo/AZ-418_ft_p_10_smoothing_lookback.md create mode 100644 _docs/02_tasks/todo/AZ-419_ft_p_11_cold_start_init.md create mode 100644 _docs/02_tasks/todo/AZ-420_ft_p_12_13_gcs.md create mode 100644 _docs/02_tasks/todo/AZ-421_ft_p_15_16_18_cache_offline.md create mode 100644 _docs/02_tasks/todo/AZ-422_ft_p_17_ftn_06_mid_flight_tiles.md create mode 100644 _docs/02_tasks/todo/AZ-423_ft_p_19_sat_reloc_scale.md create mode 100644 _docs/02_tasks/todo/AZ-424_ft_n_01_outlier_tolerance.md create mode 100644 _docs/02_tasks/todo/AZ-425_ft_n_03_outage_reloc.md create mode 100644 _docs/02_tasks/todo/AZ-426_ft_n_04_blackout_spoof.md create mode 100644 _docs/02_tasks/todo/AZ-427_ft_n_05_stale_tile_rejection.md create mode 100644 _docs/02_tasks/todo/AZ-428_nft_perf_01_e2e_latency.md create mode 100644 _docs/02_tasks/todo/AZ-429_nft_perf_02_streaming.md create mode 100644 _docs/02_tasks/todo/AZ-430_nft_perf_03_ttff.md create mode 100644 _docs/02_tasks/todo/AZ-431_nft_perf_04_spoof_promotion.md create mode 100644 _docs/02_tasks/todo/AZ-432_nft_res_01_imu_only_fallback.md create mode 100644 _docs/02_tasks/todo/AZ-433_nft_res_02_companion_reboot.md create mode 100644 _docs/02_tasks/todo/AZ-434_nft_res_03_monte_carlo.md create mode 100644 _docs/02_tasks/todo/AZ-435_nft_res_04_blackout_escalation.md create mode 100644 _docs/02_tasks/todo/AZ-436_nft_sec_01_cache_poisoning.md create mode 100644 _docs/02_tasks/todo/AZ-437_nft_sec_02_05_no_egress.md create mode 100644 _docs/02_tasks/todo/AZ-438_nft_sec_03_mavlink_signing.md create mode 100644 _docs/02_tasks/todo/AZ-439_nft_sec_04_opencv_cve.md create mode 100644 _docs/02_tasks/todo/AZ-440_nft_lim_01_jetson_memory.md create mode 100644 _docs/02_tasks/todo/AZ-441_nft_lim_02_fdr_size.md create mode 100644 _docs/02_tasks/todo/AZ-442_nft_lim_03_05_storage_budget.md create mode 100644 _docs/02_tasks/todo/AZ-443_nft_lim_04_thermal.md create mode 100644 _docs/02_tasks/todo/AZ-444_tier2_jetson_harness.md create mode 100644 _docs/02_tasks/todo/AZ-445_csv_reporter_evidence_bundler.md create mode 100644 _docs/02_tasks/todo/AZ-446_csv_reporter_refinements.md diff --git a/_docs/00_problem/input_data/flight_derkachi/camera_info.md b/_docs/00_problem/input_data/flight_derkachi/camera_info.md new file mode 100644 index 0000000..b9bcb34 --- /dev/null +++ b/_docs/00_problem/input_data/flight_derkachi/camera_info.md @@ -0,0 +1,3 @@ +Camera model: Topotek KHP20S30 +Daylight Sensor: 1/2.8" CMOS (2.13 Мп). + Full HD (1920x1080), 30/60 fps \ No newline at end of file diff --git a/_docs/02_document/contracts/c10_provisioning/cache_provisioner.md b/_docs/02_document/contracts/c10_provisioning/cache_provisioner.md new file mode 100644 index 0000000..fe870ab --- /dev/null +++ b/_docs/02_document/contracts/c10_provisioning/cache_provisioner.md @@ -0,0 +1,145 @@ +# Contract: CacheProvisioner (C10) + +**Type**: Python Protocol (`@runtime_checkable`) — local in-process API. +**Producer task**: AZ-325_c10_cache_provisioner +**Consumers**: +- C12 Operator Tooling — orchestrates the F1 build sequence `C11 TileDownloader → CacheProvisioner.build_artifacts` and surfaces the `BuildReport` to the operator (E-C12 / AZ-253). +- C13 FDR — out of scope for build (F1 is offline / pre-flight); F2's verify is owned by the `ManifestVerifier` contract. + +## Purpose + +`CacheProvisioner` is the public top-level surface for the C10 build phase. It composes `EngineCompiler` (AZ-321), `DescriptorBatcher` (AZ-322), and `ManifestBuilder` (AZ-323) into a single idempotent operation that the operator runs after `C11 TileDownloader` has populated C6. The Provisioner enforces D-C10-1 idempotence (skip rebuild when the build-identity hash matches the prior Manifest), D-C10-3 ManifestCoverageError (every shipped artifact under `cache_root` MUST be in the Manifest — no smuggled files), and D-C10-6 hardware-tied engine reuse (delegated to AZ-321). It does NOT touch `satellite-provider` (per epic § Architecture notes); tile I/O is C11's responsibility. + +## Public Surface + +```python +from pathlib import Path +from typing import Protocol, runtime_checkable + + +@runtime_checkable +class CacheProvisioner(Protocol): + """Public top-level orchestrator for C10 cache build. + + Idempotent: if the prior Manifest's build-identity hash matches the + request's, returns `outcome=IDEMPOTENT_NO_OP` without rebuilding. + Otherwise composes engine compile + descriptor population + Manifest + write + coverage check. + """ + + def build_cache_artifacts(self, request: BuildRequest) -> BuildReport: ... + def compile_engines_for_corpus(self, request: EngineCompileRequest) -> tuple[EngineCacheEntry, ...]: ... +``` + +### DTOs + +```python +from dataclasses import dataclass +from enum import Enum +from pathlib import Path + + +class SectorClassification(Enum): + ACTIVE_CONFLICT = "active_conflict" + STABLE_REAR = "stable_rear" + + +class BuildOutcome(Enum): + SUCCESS = "success" + FAILURE = "failure" + IDEMPOTENT_NO_OP = "idempotent_no_op" + + +@dataclass(frozen=True) +class Bbox: + lat_min: float + lon_min: float + lat_max: float + lon_max: float + + +@dataclass(frozen=True) +class BuildRequest: + bbox: Bbox + zoom_levels: tuple[int, ...] + sector_class: SectorClassification + calibration_path: Path + cache_root: Path + key_path: Path # operator signing key per C10-ST-01 + + +@dataclass(frozen=True) +class BuildReport: + outcome: BuildOutcome + engines_built: int + engines_reused: int + descriptors_generated: int + manifest_hash: str | None + manifest_path: Path | None + failure_reason: str | None + elapsed_s: float +``` + +(`EngineCompileRequest` and `EngineCacheEntry` are AZ-321's; re-exported for convenience.) + +### Exceptions + +| Exception | When raised | Caller action | +|-----------|------------|---------------| +| `BuildLockHeldError` | Another `build_cache_artifacts` invocation holds the cache_root lockfile (per description.md § 7 race-condition mitigation). | Operator waits / kills the other process; not retried automatically. | +| `ManifestCoverageError` | After build, an orphan file exists under `cache_root` that is not listed in the Manifest. | Build is rolled back to prior-good Manifest (if present); operator inspects the orphan. | +| `EngineBuildError`, `CalibrationCacheError` | Propagated from AZ-321 / AZ-298. | Operator triages GPU / calibration. | +| `DescriptorBatchError` | Propagated from AZ-322. | Operator triages GPU OOM / model. | +| `ManifestWriteError` | Propagated from AZ-323 (key fingerprint mismatch in operator mode, key load failure, atomic-write failure). | Operator inspects key / disk. | + +`BuildOutcome.FAILURE` is reserved for soft failures captured in `BuildReport` (missing tiles in C6, coverage warning when configured non-strict). Hard errors raise. + +## Invariants + +| ID | Invariant | Why | +|----|-----------|-----| +| CP-INV-1 | Idempotence: if `Manifest.json` exists at `cache_root` AND its `manifest_hash` equals the build-identity hash for the new request → `outcome=IDEMPOTENT_NO_OP`, ZERO new compiles, ZERO new embeds, ZERO new Manifest writes; the existing Manifest is left untouched. | D-C10-1; warm re-run ≤ 1 min envelope (C10-PT-01). | +| CP-INV-2 | A failed `build_cache_artifacts` does NOT leave the cache in a worse state than at the start: new engines may exist (cache hits) but the Manifest is either the previous-good one OR rolled back; the FAISS index is either the previous-good one OR atomically replaced. | Operators can retry safely. | +| CP-INV-3 | After a SUCCESS outcome, `ManifestCoverageError` has been verified absent: every file under `cache_root` (recursively, excluding the Manifest itself + sidecars + sig) is listed in the Manifest's artifacts. | D-C10-3 — no smuggled artifacts in the takeoff cache. | +| CP-INV-4 | Concurrent `build_cache_artifacts` calls on the same `cache_root` are mutually exclusive via a filesystem lockfile at `cache_root/.c10.lock`. | description.md § 7 race-condition mitigation. | +| CP-INV-5 | `cache_root` must already exist; `build_cache_artifacts` does NOT create the directory tree (operator workflow places it). | Avoids accidental builds in unintended paths. | +| CP-INV-6 | No network calls (no `satellite-provider`, no Postgres TLS to a remote DB beyond the local instance, no metric push). | Epic § Architecture notes: C10 is workstation-local. | +| CP-INV-7 | The operator key file at `request.key_path` is opened exactly once (via AZ-323's signer) and zeroized when out of scope; this contract does NOT cache the key in memory across calls. | Operator key hygiene. | + +## Non-Goals + +- Tile fetch from `satellite-provider` — owned by E-C11 / C11 TileDownloader. +- Engine deserialization at takeoff — owned by E-C7 / AZ-298 + C5 takeoff arming. +- Manifest verification — owned by AZ-324's `ManifestVerifier` (separate contract). +- Multi-cache management (rotating between sector caches) — operator runs `build_cache_artifacts` per cache_root. +- Garbage collection of stale engines — explicit operator action; not part of the build flow. +- Resumable build (mid-build process kill → resume from last batch) — out of scope; restart from scratch. + +## Versioning + +- v1.0.0 — initial Protocol surface (this document). +- Breaking changes: changing `BuildRequest` shape, removing a `BuildOutcome`, adding a required field — bump major. +- Additive changes: new optional kwarg, new `BuildOutcome` value, new field on `BuildReport` — bump minor. Consumers MUST handle unknown outcomes gracefully (treat as FAILURE). +- Patch: clarifications, doc edits. + +| Version | Date | Notes | Author | +|---------|------|-------|--------| +| 1.0.0 | 2026-05-10 | Initial contract — produced by AZ-325 (E-C10 decomposition) | autodev | + +## Test Cases (consumer side) + +| ID | Scenario | Expected Outcome | +|----|----------|------------------| +| CP-TC-1 | Cold build with all dependencies satisfied | `outcome=SUCCESS`; counts > 0; Manifest at `cache_root/Manifest.json` | +| CP-TC-2 | Warm build, identical request | `outcome=IDEMPOTENT_NO_OP`; counts all 0; Manifest unchanged on disk | +| CP-TC-3 | Warm build, different bbox | `outcome=SUCCESS`; rebuild happens; new Manifest replaces old (atomic) | +| CP-TC-4 | C6 has zero tiles for the requested scope | `outcome=FAILURE`; `failure_reason` directs operator to run C11 first | +| CP-TC-5 | Concurrent invocation while another build in progress | `BuildLockHeldError`; second invocation does not corrupt state | +| CP-TC-6 | An orphan file exists under `cache_root` after build | `ManifestCoverageError`; rolled back to prior Manifest if present | +| CP-TC-7 | Operator key file fingerprint not in allowlist (operator mode) | `ManifestWriteError` (propagated from AZ-323); ZERO file writes | +| CP-TC-8 | `EngineBuildError` mid-compile | Exception propagates; partial cache state consistent (atomic engines on disk for those that succeeded; Manifest NOT updated) | +| CP-TC-9 | `DescriptorBatchError` (persistent CUDA OOM) | Exception propagates; engines may be on disk; Manifest NOT updated | +| CP-TC-10 | Conformance: `isinstance(impl, CacheProvisioner)` | `True` | +| CP-TC-11 | `compile_engines_for_corpus` directly callable for re-compile-only flows | Returns `tuple[EngineCacheEntry, ...]`; no descriptor / Manifest work | +| CP-TC-12 | Cold build wall-clock benchmark on Tier-1 dev workstation, 1k tiles, 3 backbones | ≤ 12 min (NFR C10-PT-01) | +| CP-TC-13 | Warm idempotent re-run benchmark | ≤ 1 min (NFR C10-PT-01) | diff --git a/_docs/02_document/contracts/c10_provisioning/manifest_verifier.md b/_docs/02_document/contracts/c10_provisioning/manifest_verifier.md new file mode 100644 index 0000000..41f180d --- /dev/null +++ b/_docs/02_document/contracts/c10_provisioning/manifest_verifier.md @@ -0,0 +1,134 @@ +# Contract: ManifestVerifier (C10) + +**Type**: Python Protocol (`@runtime_checkable`) — local in-process API. +**Producer task**: AZ-324_c10_manifest_verifier +**Consumers**: +- C5 State Estimator / takeoff-arming gate (F2 phase) — refuses to arm if `verify_manifest` does not return `outcome=pass`. (E-C5 / AZ-249.) +- C12 Operator Tooling — runs verify before flight handoff to surface drift between F1 build time and F2 takeoff (E-C12 / AZ-253). +- C13 FDR — emits a `manifest.verify` record on every airborne verify call (`outcome` field gates downstream). + +## Purpose + +`ManifestVerifier` is the read-only validator for the C10-produced cache Manifest. It is the takeoff trust anchor for AC-NEW-1 ("no engine deserialization at takeoff before manifest verify") and D-C10-3 ("SHA-256 content-hash gate over every shipped artifact"). At F2 takeoff, every artifact listed in the Manifest is re-hashed and compared to its recorded digest; any mismatch fails the verdict and prevents arming. The Ed25519 signature over the Manifest is verified against a pinned operator public key before any artifact is touched — defence-in-depth against a spliced Manifest pointing at attacker-chosen content hashes. + +## Public Surface + +```python +from pathlib import Path +from typing import Protocol, runtime_checkable +from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PublicKey + + +@runtime_checkable +class ManifestVerifier(Protocol): + """Read-only verifier for a C10-produced Manifest.json. + + Fail-closed: any deviation in signature, schema, or per-artifact hash + yields `VerificationResult(outcome=fail, ...)`. Never raises on a verify + failure — operators / takeoff arming code branch on `outcome`. + + Raises only on resource errors (Manifest.json missing, key file + unreadable) — those are environment problems, not verify outcomes. + """ + + def verify_manifest( + self, + *, + manifest_path: Path, + trusted_public_keys: tuple[Ed25519PublicKey, ...], + ) -> VerificationResult: ... +``` + +### DTOs + +```python +from dataclasses import dataclass +from enum import Enum +from pathlib import Path + + +class VerifyOutcome(Enum): + PASS = "pass" + FAIL = "fail" + + +class VerifyFailReason(Enum): + MANIFEST_NOT_FOUND = "manifest_not_found" + SIGNATURE_NOT_FOUND = "signature_not_found" + SIGNATURE_INVALID = "signature_invalid" + UNTRUSTED_PUBLIC_KEY = "untrusted_public_key" + SCHEMA_VIOLATION = "schema_violation" + ARTIFACT_MISSING = "artifact_missing" + ARTIFACT_HASH_MISMATCH = "artifact_hash_mismatch" + TILES_COVERAGE_MISMATCH = "tiles_coverage_mismatch" + MANIFEST_SELF_HASH_MISMATCH = "manifest_self_hash_mismatch" + + +@dataclass(frozen=True) +class ArtifactCheck: + relative_path: str + expected_sha256: str + actual_sha256: str | None # None if file missing + matched: bool + + +@dataclass(frozen=True) +class VerificationResult: + outcome: VerifyOutcome + fail_reasons: tuple[VerifyFailReason, ...] + fail_details: tuple[str, ...] # human-readable diagnostic per reason + signing_public_key_fingerprint: str | None # populated when signature parses, even if untrusted + per_artifact_checks: tuple[ArtifactCheck, ...] + elapsed_ms: int +``` + +## Invariants + +| ID | Invariant | Why | +|----|-----------|-----| +| MV-INV-1 | The verifier is fail-closed: any deviation produces `outcome=FAIL` with at least one `VerifyFailReason`; never returns `PASS` with non-empty `fail_reasons`. | AC-NEW-1 / D-C10-3 — takeoff cannot arm on a partial verify. | +| MV-INV-2 | Signature verification happens BEFORE per-artifact hashing. If the signature is invalid or untrusted, no file content is read beyond the Manifest itself. | Defence-in-depth: a malicious Manifest must not trick the verifier into hashing attacker-chosen file paths. | +| MV-INV-3 | The Manifest's own `Manifest.json.sha256` sidecar (written by AZ-323) must match `sha256(Manifest.json)`; mismatch is `MANIFEST_SELF_HASH_MISMATCH`. | The sidecar is the entry point of the chain of trust — drift here means tampering or atomic-write failure. | +| MV-INV-4 | Per-artifact paths are interpreted relative to `manifest_path.parent`; absolute paths in the Manifest are rejected as `SCHEMA_VIOLATION`. | Prevents a malicious Manifest from pointing outside `cache_root`. | +| MV-INV-5 | `tiles_coverage` mismatch is reported separately from `ARTIFACT_HASH_MISMATCH` because tiles are hashed in aggregate (per AZ-323). The verifier re-derives the aggregate hash from a `TileMetadataStore` query if available, OR (in airborne F2 mode) treats the recorded `tiles_coverage_sha256` as authoritative and only verifies the Manifest signature + non-tile artifacts. | Airborne C5 may not load 100k per-tile rows just to arm; the trust chain is signature → manifest_hash → tiles_coverage_sha256. C12 / operator mode does the full re-derivation. | +| MV-INV-6 | The verifier never writes to disk, never opens network sockets, never calls C13. Telemetry is the caller's responsibility. | Read-only contract — composable in airborne C5 + operator C12 contexts without side-effect surprise. | +| MV-INV-7 | `elapsed_ms` is recorded for every call (pass or fail) so operators and C5 can observe drift in verify cost on slow disks. | NFR for C10-PT-01's takeoff load budget. | + +## Non-Goals + +- **Signature production** — owned by AZ-323's `ManifestSigner`. The verifier never signs. +- **Cache repair** — the verifier reports failures; rebuild is owned by AZ-325 (the orchestrator). +- **Trusted-key distribution / revocation** — `trusted_public_keys` is supplied by the caller; this contract does not define a key registry. +- **Coverage check (orphan files in cache_root)** — owned by AZ-325 (`ManifestCoverageError`); the verifier checks "every Manifest entry exists and matches", not "every cache_root file is in the Manifest". +- **Rollback to prior-good Manifest** — out of scope; caller decides next action on `FAIL`. + +## Versioning + +- v1.0.0 — initial Protocol surface (this document). +- Breaking changes — adding a required argument, removing a `VerifyFailReason`, changing semantics of an existing one — bump major. +- Additive changes — new `VerifyFailReason` value, new optional kwarg on `verify_manifest`, new field on `VerificationResult` — bump minor. Consumers MUST handle unknown reasons gracefully (default to FAIL). +- Patch — clarifications, doc edits, bug-fix tests. + +| Version | Date | Notes | Author | +|---------|------|-------|--------| +| 1.0.0 | 2026-05-10 | Initial contract — produced by AZ-324 (E-C10 decomposition) | autodev | + +## Test Cases (consumer side) + +| ID | Scenario | Expected Outcome | +|----|----------|------------------| +| MV-TC-1 | Valid Manifest + trusted key + all artifacts present + hashes match | `outcome=PASS`, empty `fail_reasons`, `per_artifact_checks` all `matched=True` | +| MV-TC-2 | Manifest.json missing | `outcome=FAIL`, `fail_reasons=(MANIFEST_NOT_FOUND,)`; no further work | +| MV-TC-3 | Manifest.json.sig missing | `outcome=FAIL`, `fail_reasons=(SIGNATURE_NOT_FOUND,)`; signature_public_key_fingerprint=None | +| MV-TC-4 | Signature does not verify | `outcome=FAIL`, `fail_reasons=(SIGNATURE_INVALID,)`; no per-artifact checks performed | +| MV-TC-5 | Signature verifies but key is not in `trusted_public_keys` | `outcome=FAIL`, `fail_reasons=(UNTRUSTED_PUBLIC_KEY,)`; fingerprint populated | +| MV-TC-6 | Schema violation (missing required key, absolute path, wrong types) | `outcome=FAIL`, `fail_reasons=(SCHEMA_VIOLATION,)` with detail naming the field | +| MV-TC-7 | One engine missing on disk | `outcome=FAIL`, `fail_reasons=(ARTIFACT_MISSING,)`; `per_artifact_checks` shows that engine with `actual_sha256=None, matched=False` | +| MV-TC-8 | One engine present but bytes drifted | `outcome=FAIL`, `fail_reasons=(ARTIFACT_HASH_MISMATCH,)`; offending check has `matched=False` | +| MV-TC-9 | Multiple failures (missing + drifted + signature OK) | `fail_reasons` contains BOTH `ARTIFACT_MISSING` and `ARTIFACT_HASH_MISMATCH`; per-artifact checks complete (don't short-circuit on first failure) | +| MV-TC-10 | `Manifest.json.sha256` sidecar mismatch | `outcome=FAIL`, `fail_reasons=(MANIFEST_SELF_HASH_MISMATCH,)`; signature path NOT consulted | +| MV-TC-11 | Tampered Manifest body but matching sidecar | `outcome=FAIL`, `fail_reasons=(SIGNATURE_INVALID,)` (the signature cannot match if body changed even by 1 byte) | +| MV-TC-12 | Conformance: `isinstance(ManifestVerifier, my_impl)` | `True` | +| MV-TC-13 | Tier-2 Tile-coverage check (operator mode with TileMetadataStore) | If recomputed `tiles_coverage_sha256` differs → `TILES_COVERAGE_MISMATCH`; if matches → that part passes | +| MV-TC-14 | Empty `trusted_public_keys` | `outcome=FAIL`, `fail_reasons=(UNTRUSTED_PUBLIC_KEY,)` (every key is untrusted by definition) | +| MV-TC-15 | Pristine Manifest verified inside 100 ms on Tier-2 (excludes per-tile re-walk) | `elapsed_ms ≤ 100` for the signature + non-tile artifact path | diff --git a/_docs/02_document/contracts/c11_tilemanager/tile_downloader.md b/_docs/02_document/contracts/c11_tilemanager/tile_downloader.md new file mode 100644 index 0000000..c29a854 --- /dev/null +++ b/_docs/02_document/contracts/c11_tilemanager/tile_downloader.md @@ -0,0 +1,115 @@ +# Contract: tile_downloader + +**Component**: c11_tilemanager +**Producer task**: AZ-316_c11_tile_downloader +**Consumer tasks**: AZ-253 (E-C12 Operator Pre-flight Tooling — TBD at C12 decompose time) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +The `TileDownloader` Protocol is C11's operator-side download interface. C12 invokes it during F1 (pre-flight cache build) to fetch satellite tiles from the parent suite's `satellite-provider` GET surface, apply RESTRICT-SAT-4 resolution gating at the C11 boundary, and write accepted tiles into C6. Freshness rejections surfacing from C6 (AZ-307) are counted and surfaced in the report. + +C11 is operator-side ONLY; ADR-004 forbids the airborne companion image from importing this module. + +## Shape + +### Function / method API + +```python +from typing import Protocol, runtime_checkable +from pathlib import Path + +@runtime_checkable +class TileDownloader(Protocol): + def download_tiles_for_area(self, request: DownloadRequest) -> DownloadBatchReport: ... + def enumerate_remote_coverage(self, bbox: Bbox, zoom_levels: list[int]) -> list[TileSummary]: ... +``` + +| Name | Signature | Throws / Errors | Blocking? | +|------|-----------|-----------------|-----------| +| `download_tiles_for_area` | `(request: DownloadRequest) -> DownloadBatchReport` | `SatelliteProviderError`, `RateLimitedError`, `ResolutionRejectionError`, `CacheBudgetExceededError`, `TileFsError`, `TileMetadataError` | sync (offline; minutes) | +| `enumerate_remote_coverage` | `(bbox: Bbox, zoom_levels: list[int]) -> list[TileSummary]` | `SatelliteProviderError`, `RateLimitedError` | sync (seconds) | + +### Data DTOs + +```python +@dataclass(frozen=True) +class DownloadRequest: + bbox: Bbox # from c6_tile_cache + zoom_levels: tuple[int, ...] + sector_class: SectorClassification # from c6_tile_cache + satellite_provider_url: str # parent-suite base URL + service_api_key: str # TLS + service-internal + cache_root: Path # operator workstation + flight_id: uuid.UUID # tags downloads in C6 metadata + +@dataclass(frozen=True) +class DownloadBatchReport: + tiles_downloaded: int + tiles_rejected_freshness: int # raised by AZ-307 at C6 boundary + tiles_rejected_resolution: int # rejected by C11 (RESTRICT-SAT-4) + tiles_downgraded: int # stable_rear stale → DOWNGRADED label + freshness_summary: dict[FreshnessLabel, int] + outcome: DownloadOutcome # success | failure | idempotent_no_op + failure_reason: str | None + +@dataclass(frozen=True) +class TileSummary: + tile_id: TileId # from c6_tile_cache + produced_at: datetime + resolution_m_per_px: float + estimated_bytes: int +``` + +| Field | Type | Required | Description | Constraints | +|-------|------|----------|-------------|-------------| +| `DownloadRequest.bbox` | `Bbox` | yes | Operational area | min_lat ≤ max_lat, min_lon ≤ max_lon | +| `DownloadRequest.zoom_levels` | `tuple[int, ...]` | yes | Zoom levels to fetch | each in `[0, 21]`; deduplicated | +| `DownloadRequest.sector_class` | `SectorClassification` | yes | Drives freshness rule applied at C6 | `ACTIVE_CONFLICT \| STABLE_REAR` | +| `DownloadRequest.cache_root` | `Path` | yes | Operator workstation cache dir | must exist; must be writable | +| `DownloadBatchReport.tiles_downloaded` | `int` | yes | Tiles written to C6 successfully | ≥ 0 | +| `DownloadBatchReport.tiles_rejected_resolution` | `int` | yes | Tiles rejected at C11 boundary for < 0.5 m/px | ≥ 0 | +| `DownloadBatchReport.tiles_rejected_freshness` | `int` | yes | Count of `FreshnessRejectionError` raised by C6 (AZ-307) | ≥ 0 | +| `DownloadBatchReport.outcome` | `DownloadOutcome` | yes | Aggregate outcome | enum | + +## Invariants + +- I-1: `tiles_downloaded + tiles_rejected_resolution + tiles_rejected_freshness == sum of attempted tiles`. The report accounts for every tile the downloader attempted; no silent drops. +- I-2: A re-run of `download_tiles_for_area` for the same `(bbox, zoom_levels, sector_class, flight_id)` after a successful prior run is idempotent: `outcome = idempotent_no_op` and no GETs are issued. Idempotence is enforced by C11's download-progress journal under `cache_root/.c11/journal/`. +- I-3: Every accepted tile passes BOTH the C11 resolution gate (≥ 0.5 m/px per RESTRICT-SAT-4) AND the C6 freshness gate (AZ-307). A tile that fails either is excluded from `tiles_downloaded`. +- I-4: TLS + service-internal API key authenticate the GET; auth failure surfaces as `SatelliteProviderError` and aborts the run with `outcome = failure`. The downloader does NOT fall back to plaintext or unauthenticated requests. +- I-5: The downloader writes via the AZ-303 `TileStore`/`TileMetadataStore` Protocols; it does NOT touch C6's filesystem layout directly. +- I-6: A `CacheBudgetExceededError` aborts pre-write with no partial write and `outcome = failure`. The C6 cache budget enforcer (AZ-308) drives the headroom check. + +## Non-Goals + +- Not covered: airborne or in-flight downloads (RESTRICT-SAT-1 forbids them; airborne process cannot import this module per ADR-004). +- Not covered: orchestration of when the operator runs F1 — owned by C12. +- Not covered: cache artifact build (descriptors, FAISS index) — owned by C10 after the downloader populates C6. +- Not covered: tile uploads to `satellite-provider` ingest — owned by `TileUploader` (separate contract). +- Not covered: parsing or validation of `satellite-provider`'s authentication payload beyond what `httpx` provides — out of scope for the onboard side. + +## Versioning Rules + +- **Breaking changes** (renamed method, removed required field, changed return type) require a major version bump. C12 is the sole consumer today; coordinate via Choose A/B/C/D when bumping. +- **Non-breaking additions** (new optional field on the report, new error variant the consumer already catches via the family) require a minor version bump. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| download-happy-path | `DownloadRequest` for Derkachi bbox with mix of fresh active_conflict + stable_rear tiles | `DownloadBatchReport` with `tiles_downloaded > 0`; sum of report counts equals attempt count; tiles present in C6 | C11-IT-01 | +| freshness-rejection-counts | source returns stale tiles in active_conflict sector | `DownloadBatchReport.tiles_rejected_freshness > 0`; matches C6's AZ-307 rejection count for that batch | C11-IT-02 | +| resolution-gate-rejects | source returns tile with `resolution_m_per_px = 0.3` (< 0.5) | tile excluded from `tiles_downloaded`; `tiles_rejected_resolution += 1`; no C6 write attempted | RESTRICT-SAT-4 | +| auth-failure-aborts | invalid `service_api_key` | first GET raises `SatelliteProviderError`; `outcome = failure`; no tiles written | I-4 | +| budget-exceeded-aborts | pre-write check shows insufficient headroom | `CacheBudgetExceededError`; `outcome = failure`; zero partial writes | I-6 | +| idempotent-rerun | second call with identical request after success | `outcome = idempotent_no_op`; zero GETs observed | I-2 | +| rate-limited-honors-retry-after | source returns 429 with `Retry-After: 30` | downloader sleeps ≥ 30s before retry; no `RateLimitedError` raised on success path | RFC 6585 | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract — produced by AZ-316 (E-C11 decomposition) | autodev | diff --git a/_docs/02_document/contracts/c11_tilemanager/tile_uploader.md b/_docs/02_document/contracts/c11_tilemanager/tile_uploader.md new file mode 100644 index 0000000..b83fdb6 --- /dev/null +++ b/_docs/02_document/contracts/c11_tilemanager/tile_uploader.md @@ -0,0 +1,114 @@ +# Contract: tile_uploader + +**Component**: c11_tilemanager +**Producer task**: AZ-319_c11_tile_uploader +**Consumer tasks**: AZ-253 (E-C12 Operator Pre-flight Tooling — TBD at C12 decompose time) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +The `TileUploader` Protocol is C11's operator-side post-landing upload interface. C12 invokes it during F10 (post-landing) to read mid-flight tiles flagged pending-upload from C6 (`source = onboard_ingest`, `voting_status = pending`), package them per the D-PROJ-2 ingest contract sketch, sign each tile payload with the per-flight ephemeral key (AZ-318), and POST to `satellite-provider`'s `/api/satellite/tiles/ingest` endpoint. Acknowledged tiles are marked uploaded in C6. + +The uploader gates on `flight_state == ON_GROUND` (AZ-317) before any network egress. C11 is operator-side ONLY; ADR-004 forbids the airborne companion image from importing this module. + +## Shape + +### Function / method API + +```python +from typing import Protocol, runtime_checkable + +@runtime_checkable +class TileUploader(Protocol): + def upload_pending_tiles(self, request: UploadRequest) -> UploadBatchReport: ... + def enumerate_pending_tiles(self, flight_id: uuid.UUID | None = None) -> list[TileMetadata]: ... + def confirm_flight_state(self) -> FlightStateSignal: ... +``` + +| Name | Signature | Throws / Errors | Blocking? | +|------|-----------|-----------------|-----------| +| `upload_pending_tiles` | `(request: UploadRequest) -> UploadBatchReport` | `FlightStateNotOnGroundError`, `SatelliteProviderError`, `RateLimitedError`, `SignatureRejectedError`, `TileMetadataError` | sync (post-landing; minutes) | +| `enumerate_pending_tiles` | `(flight_id: uuid.UUID \| None) -> list[TileMetadata]` | `TileMetadataError` | sync (seconds) | +| `confirm_flight_state` | `() -> FlightStateSignal` | `FlightStateNotOnGroundError` | sync (≤ 1 ms) | + +### Data DTOs + +```python +@dataclass(frozen=True) +class UploadRequest: + flight_id: uuid.UUID | None # None = all flights with pending + batch_size: int # tiles per HTTP POST + satellite_provider_url: str # parent-suite ingest base URL + +@dataclass(frozen=True) +class UploadBatchReport: + batch_uuid: uuid.UUID # assigned by parent-suite ingest + per_tile_status: tuple[PerTileStatus, ...] + retry_count: int + next_retry_at_s: int | None # set when partial-success + outcome: UploadOutcome # success | partial | failure + public_key_fingerprint: str # 16-hex; from AZ-318 + +@dataclass(frozen=True) +class PerTileStatus: + tile_id: TileId # from c6_tile_cache + status: IngestStatus # queued | rejected | duplicate | superseded + rejection_reason: str | None +``` + +| Field | Type | Required | Description | Constraints | +|-------|------|----------|-------------|-------------| +| `UploadRequest.flight_id` | `UUID \| None` | no | Restricts batch to one flight | None = all pending across flights | +| `UploadRequest.batch_size` | `int` | yes | Tiles per HTTP POST | `1 ≤ batch_size ≤ 200` | +| `UploadBatchReport.batch_uuid` | `UUID` | yes | Parent-suite batch identifier | Server-assigned per D-PROJ-2 | +| `UploadBatchReport.per_tile_status` | `tuple[PerTileStatus, ...]` | yes | Per-tile result | Length = number of tiles attempted in this report | +| `UploadBatchReport.outcome` | `UploadOutcome` | yes | Aggregate outcome | `success` (all queued/duplicate/superseded) \| `partial` (some rejected/timeout) \| `failure` (gate blocked or full failure) | +| `UploadBatchReport.public_key_fingerprint` | `str` | yes | Identifies the per-flight signing key | 16 hex chars from AZ-318 | +| `PerTileStatus.status` | `IngestStatus` | yes | Server response status | `queued` \| `rejected` \| `duplicate` \| `superseded` | + +## Invariants + +- I-1: `confirm_flight_state` is called by `upload_pending_tiles` BEFORE any C6 read or network egress; if `FlightStateNotOnGroundError` is raised, NO tiles are read, NO POSTs are issued, NO C6 mutation occurs. The gate is closed by default. +- I-2: Every uploaded tile carries a signature produced by the AZ-318 per-flight key manager's `sign(payload)`. The parent suite verifies against the public key it received via the safety officer's pre-flight enrolment OR the `kind="c11.upload.session.key.public"` FDR record. +- I-3: A tile acknowledged as `queued`, `duplicate`, or `superseded` by the parent suite is marked `uploaded` in C6 (`mark_uploaded(tile_id)`); a tile acknowledged as `rejected` is NOT marked uploaded — it remains `pending` for human review. +- I-4: The per-flight signing key is zeroised at the end of `upload_pending_tiles` regardless of success or failure (try/finally in the caller; AZ-318's `end_session()`). +- I-5: A `SignatureRejectedError` from the parent suite triggers an FDR alert (AZ-318's `record_signature_rejection`); it is NEVER silently caught. +- I-6: The uploader writes via the AZ-303 `TileMetadataStore.mark_uploaded` Protocol; it does NOT update the metadata table directly. +- I-7: Partial-success batches are reported (not raised as failures) so the caller can re-invoke for the unacked tiles; idempotent retry behaviour is owned by the AZ-320 decorator that wraps this Protocol's impl. +- I-8: The signed payload includes `capture_timestamp` per the D-PROJ-2 contract sketch; the parent suite's nonce / timestamp validation owns replay defence. + +## Non-Goals + +- Not covered: airborne or in-flight uploads (RESTRICT-SAT-1 forbids them; airborne process cannot import this module per ADR-004). +- Not covered: orchestration of when the operator runs F10 — owned by C12. +- Not covered: tile downloads from `satellite-provider` — owned by `TileDownloader` (separate contract). +- Not covered: parent-suite voting / trust-promotion of uploaded tiles — owned by D-PROJ-2 design task #2 (`satellite-provider`). +- Not covered: HSM / TPM-backed key storage — out of scope this cycle (in-memory key with zeroisation). +- Not covered: mid-upload key rotation — one key per session. +- Not covered: idempotent retry across partial-success batches — separate task in this epic decorates this contract. + +## Versioning Rules + +- **Breaking changes** (renamed method, removed required field, changed return type, changed signature contract) require a major version bump. C12 is the sole consumer today; coordinate via Choose A/B/C/D when bumping. +- **Non-breaking additions** (new optional field on the report, new `IngestStatus` enum value the consumer already tolerates via `_ = status`) require a minor version bump. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| upload-happy-path | 50 pending tiles, ON_GROUND, parent-suite returns 202 with all `queued` | `UploadBatchReport.outcome = success`; all 50 marked `uploaded` in C6; signature verifies on each | C11-IT-03 | +| flight-state-blocks | `FlightStateSource` returns `IN_FLIGHT` | `FlightStateNotOnGroundError`; zero C6 reads; zero POSTs | C11-IT-04 | +| signature-rejected | Parent suite returns `rejected` for 1 tile with reason `"invalid signature"` | `PerTileStatus.status = rejected`; `outcome = partial`; FDR `c11.upload.signature_rejected` emitted; the tile NOT marked uploaded | I-5 | +| duplicate-acknowledged | Parent suite returns `duplicate` for 5 tiles (already ingested in a prior batch) | All 5 marked `uploaded`; `outcome = success` | I-3 | +| signing-key-zeroised | Run a successful upload, then assert the AZ-318 manager's `_private_key is None` | Always zeroised; FDR `c11.upload.session.key.zeroised` recorded | I-4 | +| signing-key-zeroised-on-failure | Network drop mid-batch raises `SatelliteProviderError`, then assert key zeroised | Always zeroised even on failure | I-4 | +| empty-pending-set | No pending tiles | `outcome = success` with empty `per_tile_status`; zero POSTs; zero key generation | edge case | +| public-key-in-fdr-before-first-post | Capture FDR records | `kind="c11.upload.session.key.public"` precedes any `c11.upload.tile.*` records | safety-officer correlation | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract — produced by AZ-319 (E-C11 decomposition) | autodev | diff --git a/_docs/02_document/contracts/c12_operator_tooling/operator_command_transport.md b/_docs/02_document/contracts/c12_operator_tooling/operator_command_transport.md new file mode 100644 index 0000000..4d62bb4 --- /dev/null +++ b/_docs/02_document/contracts/c12_operator_tooling/operator_command_transport.md @@ -0,0 +1,106 @@ +# Contract: operator_command_transport + +**Component**: c12_operator_tooling +**Producer task**: AZ-330 — `_docs/02_tasks/todo/AZ-330_c12_operator_reloc_service.md` +**Consumer tasks**: TBD — a future E-C8 (AZ-261) task implements `MavlinkOperatorCommandTransport` against pymavlink +**Version**: 1.0.0 +**Status**: frozen +**Last Updated**: 2026-05-10 + +## Purpose + +Defines the operator-workstation ↔ companion command channel for AC-3.4 operator-relocalization. C12 owns the Protocol shape; E-C8 (AZ-261) ships the pymavlink-backed concrete implementation that encodes the hint into a MAVLink message and transmits it over the GCS link to the airborne companion. Decoupling the two sides through this Protocol prevents C12 from having to know MAVLink details, and prevents E-C8 from having to know operator-tool internals — they meet at this contract. + +## Shape + +### DTOs + +```python +@dataclass(frozen=True) +class LatLonAlt: + latitude_deg: float # -90 ≤ value ≤ 90 + longitude_deg: float # -180 < value ≤ 180 + altitude_m: float # WGS84 ellipsoidal height; no documented bound + # If shared_helpers/wgs_converter.md already defines LatLonAlt, this contract REUSES that definition. The shape above is the canonical fallback if no shared definition exists. + +@dataclass(frozen=True) +class ReLocHint: + approximate_position_wgs84: LatLonAlt # operator's best guess of current aircraft position + confidence_radius_m: float # > 0; operator's uncertainty radius around the position + reason: str # non-empty; free-text operator note for forensics + # Validates `confidence_radius_m > 0` and `reason != ""` in __post_init__. +``` + +| Field | Type | Required | Description | Constraints | +|-------|------|----------|-------------|-------------| +| `LatLonAlt.latitude_deg` | `float` | yes | WGS84 latitude in degrees | `-90 ≤ x ≤ 90` | +| `LatLonAlt.longitude_deg` | `float` | yes | WGS84 longitude in degrees | `-180 < x ≤ 180` | +| `LatLonAlt.altitude_m` | `float` | yes | WGS84 ellipsoidal altitude in metres | no bound | +| `ReLocHint.approximate_position_wgs84` | `LatLonAlt` | yes | Operator's best guess | per `LatLonAlt` constraints | +| `ReLocHint.confidence_radius_m` | `float` | yes | Operator's uncertainty radius | `> 0` strictly | +| `ReLocHint.reason` | `str` | yes | Free-text operator note | non-empty; no length cap; no charset restriction | + +### Protocol + +```python +@runtime_checkable +class OperatorCommandTransport(Protocol): + def send_reloc_hint(self, hint: ReLocHint) -> None: ... +``` + +| Name | Signature | Throws / Errors | Blocking? | +|------|-----------|-----------------|-----------| +| `send_reloc_hint` | `(hint: ReLocHint) -> None` | `GcsLinkError` (any failure to transmit: signal lost, link timeout, framing error, mavlink encode error) | sync | + +### Errors + +```python +class GcsLinkError(Exception): + reason: str # operator-friendly one-line description (e.g. "link signal lost") + wrapped_exception_repr: str | None # repr() of the underlying transport exception, if any + remediation: str = "Check GCS link signal strength; re-issue the re-loc command when the link recovers." +``` + +The transport implementation MUST raise `GcsLinkError` (and only `GcsLinkError`) on any failure to transmit. C12's `OperatorReLocService` catches and re-raises with C12-specific context. + +## Invariants + +- **INV-1 (validation already done)**: when `send_reloc_hint(hint)` is called, the `hint` is already validated (`confidence_radius_m > 0`, `reason != ""`, lat/lon in range). The transport MAY skip re-validation but MUST NOT perform a different validation pass that rejects values C12 considers valid. +- **INV-2 (single transmission attempt)**: `send_reloc_hint` MUST attempt transmission exactly once. The transport MUST NOT retry internally — best-effort semantics per description.md § 7 are enforced at the C12 / operator level, not at the transport layer. +- **INV-3 (no return value contract)**: `send_reloc_hint` returning normally means the transport believes the hint left the operator workstation; it does NOT mean the airborne companion received or processed it (no ack mechanism in v1.0.0). +- **INV-4 (preserve `reason` byte-for-byte)**: the transport MUST encode `reason` such that the airborne side decodes the identical UTF-8 byte sequence, up to the MAVLink message's documented field-length limit. If `reason` exceeds the MAVLink message capacity, the transport MUST raise `GcsLinkError(reason="reason field exceeds MAVLink encoding capacity: bytes > bytes")` rather than silently truncate. +- **INV-5 (no side effects beyond transmission)**: `send_reloc_hint` MUST NOT write to the local filesystem, emit FDR records, or change any operator-workstation state beyond the network transmission. C12 owns side effects (FDR record, log). +- **INV-6 (thread-safety)**: a single `OperatorCommandTransport` instance MAY be called from at most one thread per session. Concurrent calls from multiple threads are undefined behaviour and MAY raise `GcsLinkError(reason="concurrent send")`. + +## Non-Goals + +- **Acknowledgement / round-trip** — v1.0.0 is fire-and-forget. A future v2.0.0 may add an ack channel via FDR + STATUSTEXT; out of scope here. +- **Encryption / signing of the re-loc payload** — covered by the MAVLink 2.0 message-signing on the wired channel per ADR-009 / D-C8-9; this Protocol does not re-specify it. +- **Multiple companions** — one transport instance addresses one companion; multi-companion broadcast is out of scope. +- **Retry / backoff** — best-effort per description.md § 7. The operator decides when to re-issue. +- **Backpressure / flow control** — `send_reloc_hint` is sync and unbounded; if the operator issues 100 re-loc commands in 1 s, the transport sends 100 messages. The MAVLink physical layer's bandwidth is the natural bound. +- **GCS-link health probing** — this Protocol does NOT expose a `is_link_healthy()` method. Liveness is observed via `GcsLinkError` raised by `send_reloc_hint`. + +## Versioning Rules + +- **Breaking changes** (renaming `send_reloc_hint`, removing it, changing its signature, changing `ReLocHint` field types or names, removing `confidence_radius_m`, etc.) require a new major version (v2.0.0). The producer (this contract owner, AZ-330's owner) bumps the version, updates the Change Log, and notifies all consumers via the autodev tracker leftovers mechanism. +- **Non-breaking additions** (new optional kwarg with default, new method on the Protocol that consumers don't need to implement, new optional field in `ReLocHint` with a documented default) require a minor version bump (v1.1.0). Existing implementations remain valid. +- **Patch changes** (clarifying invariants, adding test cases, fixing typos) require a patch version bump (v1.0.1). +- A breaking change requires a deprecation period of at least one Plan cycle (one major release) before consumers may stop supporting the old version. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| TC-1 valid-minimal | `ReLocHint(LatLonAlt(49.99, 36.12, 100.0), confidence_radius_m=50.0, reason="lost track at WP3")` + healthy transport | `send_reloc_hint` returns `None`; airborne side decodes `reason="lost track at WP3"` and `confidence_radius_m=50.0` byte-identical | minimal happy path; verifies INV-4 | +| TC-2 invalid-radius | `ReLocHint(..., confidence_radius_m=0.0, ...)` constructed first (raises `ValueError` at DTO `__post_init__`); the transport is NEVER called | `ValueError` at construction; transport spy shows zero calls | producer-side validation (INV-1) — transport is not the gatekeeper | +| TC-3 link-failure | Healthy hint + transport whose underlying link drops mid-encode | `send_reloc_hint` raises `GcsLinkError(reason="link signal lost", wrapped_exception_repr="...")` | INV-2 (single attempt, no internal retry); INV-3 (return semantics) | +| TC-4 reason-too-long | `ReLocHint(..., reason="x" * 10000)` against a transport whose MAVLink encoding capacity is, say, 2000 bytes | `send_reloc_hint` raises `GcsLinkError(reason="reason field exceeds MAVLink encoding capacity: 10000 bytes > 2000 bytes")` | INV-4 enforcement; no silent truncation | +| TC-5 lat-lon-out-of-range | `LatLonAlt(latitude_deg=91.0, ...)` constructed first | `ValueError` at construction; transport never reached | producer-side validation; transport never called | +| TC-6 concurrent-call | Two threads calling `send_reloc_hint` on the same instance simultaneously | EITHER both succeed in some order, OR one raises `GcsLinkError(reason="concurrent send")` | INV-6 — undefined-behaviour-with-bounds; either outcome is contract-conformant; deterministic single-threaded use is the recommended pattern | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract — frozen Protocol shape + DTO + error type + 6 test cases. | autodev (AZ-330 decompose) | diff --git a/_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md b/_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md new file mode 100644 index 0000000..a32f30a --- /dev/null +++ b/_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md @@ -0,0 +1,166 @@ +# Contract: VioStrategy Protocol + +**Component**: c1_vio +**Producer task**: AZ-331 — `_docs/02_tasks/todo/AZ-331_c1_vio_strategy_protocol.md` +**Consumer tasks**: +- AZ-332 (OKVIS2 implementation — implements) +- AZ-333 (VINS-Mono implementation — implements) +- AZ-334 (KLT/RANSAC implementation — implements) +- AZ-335 (warm-start + F8 reboot recovery wiring — invokes `reset_to_warm_start`) +- E-C5 state estimator tasks under AZ-260 (consume `VioOutput`) +- E-C13 FDR writer tasks under AZ-248 (consume `VioHealth`) +- `runtime_root` composition under AZ-270 (selects strategy by config) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Defines the typed boundary between the on-Jetson visual / visual-inertial odometry runtime and every downstream consumer (C5 state estimator, C13 FDR, runtime_root composition). The Protocol is the single point of contact that lets ADR-001 select between three concrete strategies (OKVIS2 production-default, VINS-Mono research-only, KLT/RANSAC mandatory simple-baseline) at startup without consumers caring which is wired. Per-frame DTOs (`VioOutput`, `VioHealth`) are frozen here so C5 fusion and C13 FDR records do not drift across implementations. + +## Shape + +### Protocol surface + +The Protocol is `typing.Protocol` (PEP 544 structural typing) with `runtime_checkable=True`. + +| Method | Signature | Throws / Errors | Blocking? | +|--------|-----------|-----------------|-----------| +| `process_frame` | `(frame: NavCameraFrame, imu: ImuWindow, calibration: CameraCalibration) -> VioOutput` | `VioInitializingError`, `VioDegradedError`, `VioFatalError` | sync (camera-ingest hot path; bound by C1-PT-01 latency budget) | +| `reset_to_warm_start` | `(hint: WarmStartPose) -> None` | `VioFatalError` (only on irrecoverable backend init failure) | sync | +| `health_snapshot` | `() -> VioHealth` | — | sync | +| `current_strategy_label` | `() -> Literal["okvis2", "vins_mono", "klt_ransac"]` | — | sync | + +### DTOs + +`NavCameraFrame`, `ImuSample`, `ImuWindow`, `ImuBias`, `CameraCalibration` are owned by `gps_denied_onboard._types.nav` (AZ-263). This contract owns `WarmStartPose`, `VioOutput`, `VioHealth`, `FeatureQuality`, and the `VioState` enum, all `@dataclass(frozen=True)` (or `enum.Enum`). `VioOutput` and `VioHealth` are placed in `_types/nav.py` for cross-component access; the `VioStrategy` Protocol itself lives in `components/c1_vio/interface.py`. + +```python +from dataclasses import dataclass +from enum import Enum +from typing import Protocol, Literal, runtime_checkable +from gps_denied_onboard._types.nav import ( + NavCameraFrame, ImuWindow, ImuBias, CameraCalibration, +) +from gps_denied_onboard._types.geom import SE3, Vector3, Matrix6 + + +class VioState(str, Enum): + INIT = "init" + TRACKING = "tracking" + DEGRADED = "degraded" + LOST = "lost" + + +@dataclass(frozen=True) +class WarmStartPose: + body_T_world: SE3 + velocity_b: Vector3 + bias: ImuBias + captured_at_ns: int # monotonic_ns when the hint was produced + + +@dataclass(frozen=True) +class FeatureQuality: + tracked: int + new: int + lost: int + mean_parallax: float + mre_px: float + + +@dataclass(frozen=True) +class VioOutput: + frame_id: str # echoes NavCameraFrame.frame_id + relative_pose_T: SE3 + pose_covariance_6x6: Matrix6 + imu_bias: ImuBias + feature_quality: FeatureQuality + emitted_at_ns: int + + +@dataclass(frozen=True) +class VioHealth: + state: VioState + consecutive_lost: int + bias_norm: float + + +@runtime_checkable +class VioStrategy(Protocol): + def process_frame( + self, + frame: NavCameraFrame, + imu: ImuWindow, + calibration: CameraCalibration, + ) -> VioOutput: ... + + def reset_to_warm_start(self, hint: WarmStartPose) -> None: ... + + def health_snapshot(self) -> VioHealth: ... + + def current_strategy_label(self) -> Literal["okvis2", "vins_mono", "klt_ransac"]: ... +``` + +### Error hierarchy + +All under `gps_denied_onboard.components.c1_vio.errors`: + +``` +VioError (base; subclasses Exception) +├── VioInitializingError (state == INIT; no VioOutput emitted; C5 falls back to FC IMU prior) +├── VioDegradedError (state == DEGRADED; output IS still emitted with inflated covariance — see Invariants) +└── VioFatalError (state == LOST after configurable consecutive frames; AC-5.2 fallback path) +``` + +`VioDegradedError` is documented but is **not raised** during normal `process_frame` returns when degraded — degraded operation returns a `VioOutput` with inflated covariance and `VioHealth.state = DEGRADED`. The error type exists for the rare case where degradation transitions to fatality and consumer wrappers want to catch the family. + +### Composition-root selection + +```python +def build_vio_strategy(config: Config, *, fdr_client: FdrClient) -> VioStrategy: ... +``` + +Lives at `src/gps_denied_onboard/runtime_root/vio_factory.py`. Selects the strategy by `config.vio.strategy` (`okvis2 | vins_mono | klt_ransac`) and respects compile-time `BUILD_*` gating (`BUILD_OKVIS2`, `BUILD_VINS_MONO`, `BUILD_KLT_RANSAC`). Requesting a strategy whose `BUILD_*` flag is OFF raises `StrategyNotAvailableError` at composition time (NOT at first frame). Lazy-imports the concrete strategy module so a Tier-0 workstation build without OKVIS2 native libs still composes successfully when only KLT/RANSAC is requested. + +## Invariants + +- **6×6 SPD covariance always returned**: `pose_covariance_6x6` is symmetric and positive-definite for every `VioOutput`. Implementations MUST NOT return a "tightened" covariance (smaller Frobenius norm) during a degradation event; honest covariance is the safety floor for AC-NEW-4 and AC-NEW-7. A test (covariance-monotonicity contract test, deferred to Step 9 / E-BBT) asserts this across all three strategies. +- **`frame_id` echo**: `VioOutput.frame_id` equals the input `NavCameraFrame.frame_id`. C5 relies on this for time-aligned factor insertion. +- **Single-threaded by contract**: each `VioStrategy` instance is bound to one writer thread (the camera ingest thread). Concurrent calls to `process_frame` on the same instance are undefined behaviour. The composition root binds one instance per ingest thread. +- **`reset_to_warm_start` is destructive**: clears the strategy's keyframe window, IMU integration state, and feature track buffer; subsequent `process_frame` calls re-initialise from the hint. Calling `reset_to_warm_start` mid-flight is allowed (F8 reboot recovery) but must not be issued concurrently with a `process_frame` call on the same instance. +- **`current_strategy_label()` is constant per instance**: returns the same string for the lifetime of the instance and matches `config.vio.strategy` exactly. The label is FDR-stamped on every `VioHealth` event for AC-NEW-3 audit. +- **No ambient state**: implementations MUST NOT read environment variables, wall clock, or filesystem inside `process_frame`; calibration arrives via constructor + per-call argument; logging uses the injected logger only. +- **Error envelope is closed**: `process_frame` raises only members of `VioError` (the family). Lower-level exceptions from OpenCV / OKVIS2 / VINS-Mono / GTSAM MUST be caught and rewrapped. + +## Non-Goals + +- IMU preintegration mathematics — owned by AZ-276 / `helpers.imu_preintegrator`. Strategies feed `ImuWindow` to the helper; they do NOT implement preintegration internally. +- Bias estimation policy — each strategy decides when to update its bias; the contract does not prescribe a schedule. +- WarmStartPose persistence (write to disk after takeoff, read after F8 reboot) — owned by the warm-start + F8 reboot recovery wiring task in this same epic. The contract here only defines the in-memory DTO and the `reset_to_warm_start` method. +- C5 fusion semantics — owned by E-C5; this contract only delivers `VioOutput`. +- Multi-camera strategies — out of scope this cycle (single nav-camera per ADR / RESTRICT-UAV-3). + +## Versioning Rules + +- **Breaking changes** (method renamed/removed, parameter type changed, return type changed, invariant relaxed) require a new major version + a deprecation pass through every consumer task in the header. +- **Non-breaking additions** (new optional method, new diagnostic accessor that does not mutate state, new `VioState` enum variant added at the end) require a minor version bump. +- The `VioState` enum is treated as a closed set for switch-style consumer code (C5 fusion); adding a new variant is a minor bump but consumers MUST handle the new state defensively (default branch → treat as LOST). + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| protocol-conformance | three concrete strategy classes | `isinstance(impl, VioStrategy)` returns True for each | Catches drift between impl and Protocol surface | +| frozen-dto-mutation | a constructed `VioOutput` instance and an attempt to set `.relative_pose_T` | `dataclasses.FrozenInstanceError` raised | Confirms DTOs are immutable | +| error-family-catchable | each of `VioInitializingError`, `VioDegradedError`, `VioFatalError` raised | `except VioError` catches all three; `except ValueError` does NOT | Confirms error envelope | +| factory-build-flag-respected | `config.vio.strategy = "vins_mono"` and `BUILD_VINS_MONO=OFF` | `StrategyNotAvailableError` raised at composition; `sys.modules` has no `vins_mono` entry | Confirms lazy-import gating | +| current-strategy-label-exact-match | each strategy constructed via factory with matching config | `current_strategy_label()` returns the literal config value | AC-NEW-3 audit gate | +| frame-id-echoed | a `NavCameraFrame` with a known UUID fed into `process_frame` | the returned `VioOutput.frame_id` equals the input UUID | C5 alignment invariant | +| covariance-spd | inspect 100 emitted `VioOutput.pose_covariance_6x6` matrices | every matrix is symmetric and positive-definite (eigenvalues > 0) | AC-1.4 floor | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from `_docs/02_document/components/01_c1_vio/description.md` § 2 + AZ-254 epic child issue #1 | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md b/_docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md new file mode 100644 index 0000000..611378b --- /dev/null +++ b/_docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md @@ -0,0 +1,183 @@ +# Contract: `ReRankStrategy` Protocol + +**Owner**: c2_5_rerank (epic AZ-256 / E-C2.5) +**Producer task**: AZ-342 (`ReRankStrategy` Protocol + factory + composition) +**Consumer tasks**: AZ-343 (`InlierCountReRanker` impl); downstream c3_matcher (epic AZ-257 / E-C3 — TBD at AZ-257 decompose time) which consumes `RerankResult` +**Version**: 1.0.0 +**Status**: draft, awaiting AZ-342 implementation +**Last Updated**: 2026-05-10 +**Module-layout home**: `src/gps_denied_onboard/components/c2_5_rerank/interface.py` (Protocol), `src/gps_denied_onboard/components/c2_5_rerank/__init__.py` (re-exports), `src/gps_denied_onboard/runtime_root/rerank_factory.py` (factory) + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract — Protocol surface, DTOs, error hierarchy, factory signature, 8 invariants, drop-and-continue contract (INV-8) | autodev / decompose Step 2 | + +## Purpose + +Defines the public interface for the C2.5 inlier-based re-rank strategy: `rerank` consumes a C2 `VprResult` (top-K=10) and produces a `RerankResult` (top-N=3) ranked by single-pair LightGlue inlier count against each candidate's tile pixels. The re-rank step is the architectural boundary between cheap descriptor retrieval (C2) and expensive cross-domain matching (C3) — it pays a small extra GPU cost so C3 only operates on the most promising candidates. + +`ReRankStrategy` is a Strategy interface with a single concrete implementation today (`InlierCountReRanker`). Future re-rank algorithms (e.g., learned re-rankers) can be added as additional implementations behind the same interface, gated by `BUILD_RERANK_` build flags per ADR-002. + +The shared `LightGlueRuntime` helper (AZ-278 / `helpers.lightglue_runtime`) is constructor-injected — neither C2.5 nor C3 owns the helper. This resolves R14 (apparent C2.5↔C3 cycle) by making both components sibling consumers of the helper. + +## Public API + +### Protocol: `ReRankStrategy` + +```python +from typing import Protocol, runtime_checkable +from gps_denied_onboard._types import NavCameraFrame, CameraCalibration, VprResult, RerankResult + + +@runtime_checkable +class ReRankStrategy(Protocol): + """Single-camera re-rank strategy. Stateless per-frame; the only persistent state is the constructor-injected `LightGlueRuntime` helper handle and the `TileStore` Public API reference.""" + + def rerank( + self, + frame: NavCameraFrame, + vpr_result: VprResult, + n: int, + calibration: CameraCalibration, + ) -> RerankResult: + """Re-rank the top-K candidates from `vpr_result` down to top-N by single-pair LightGlue inlier count. + + For each candidate in `vpr_result.candidates`: + 1. Fetch tile pixels via `TileStore.get_tile_pixels(candidate.tile_id)`. + 2. Run a single-pair LightGlue forward via the shared `LightGlueRuntime` (frame ↔ tile). + 3. Record the inlier count. + Sort candidates descending by inlier count; return the top-N as a `RerankResult`. + + Drop-and-continue semantics: if a per-candidate failure occurs (`TileFetchError` from C6 OR `RerankBackboneError` from LightGlue), the candidate is dropped from the rerank set and a per-candidate ERROR log + FDR record is emitted. Sorting and top-N selection proceed against the surviving candidates. + + If FEWER than N candidates survive, the strategy returns `RerankResult` with whatever it has (length 1..N-1); C3 proceeds with reduced N. If ZERO candidates survive, the strategy raises `RerankAllCandidatesFailedError`; downstream C5 falls back to VIO-only with provenance `visual_propagated` (AC-3.5). + + Raises: + RerankAllCandidatesFailedError: every candidate's LightGlue or tile-fetch failed; no rerank result possible. + """ + ... +``` + +**Invariants** (every implementation MUST guarantee): + +1. **Single-threaded by contract** — each instance is bound to one ingest thread (composition root enforces). The shared `LightGlueRuntime` requires serial access (per description.md § 7); concurrent `rerank` calls on a single instance race the GPU stream. +2. **Stateless per-frame** — no implicit dependency on prior frames; reordering `rerank` calls (which the live path NEVER does, but tests do) MUST yield identical `RerankResult` content (same surviving candidates in same order, given same inputs). +3. **Top-N ordering by inlier count descending** — `RerankResult.candidates` is sorted descending by `inlier_count`. Ties broken deterministically by `descriptor_distance` ascending (carried forward from C2). Stable, reproducible across runs. +4. **`RerankResult.candidates` length is bounded** — `0 < len <= n` when returned (zero raises `RerankAllCandidatesFailedError`); never exceeds `n`; never exceeds `len(vpr_result.candidates)`. +5. **`descriptor_distance` is carried forward unchanged** — re-rank does NOT compute a new descriptor distance; the C2-stage value is preserved on every surviving `RerankCandidate` for FDR provenance. +6. **`tile_pixels_handle` is a reference, NOT a copy** — `RerankCandidate.tile_pixels_handle` is the same handle returned by `TileStore.get_tile_pixels` (page-cache backed). Copying tile pixels at re-rank time would defeat AC-4.1's latency budget. +7. **Deterministic per (frame, vpr_result, corpus, helper) tuple** — given identical inputs and an identical `LightGlueRuntime` helper state, two calls return bit-identical `RerankResult` (same inlier counts, same ordering, same surviving candidates). +8. **Drop-and-continue is the ONLY per-candidate failure mode** — a per-candidate exception NEVER propagates out of `rerank` unless every candidate fails. This is the contract that lets C3 absorb partial failures gracefully. + +### DTOs (in `_types/rerank.py`) + +```python +from dataclasses import dataclass +from uuid import UUID +import numpy as np + + +@dataclass(frozen=True, slots=True) +class RerankCandidate: + """One re-rank survivor. Carries the C2-stage descriptor_distance forward for FDR provenance plus the new inlier_count from single-pair LightGlue.""" + + tile_id: tuple # composite (zoomLevel, lat, lon); see C6 TileRecord + inlier_count: int # single-pair LightGlue inliers; > 0 for any survivor + descriptor_distance: float # carried forward from C2's VprCandidate + descriptor_dim: int # carried forward from C2 for sanity assertions + tile_pixels_handle: object # opaque page-cache-backed pixel reference; see C6 TileStore contract + + +@dataclass(frozen=True, slots=True) +class RerankResult: + """Top-N survivors from `ReRankStrategy.rerank`. Consumed by C3 CrossDomainMatcher.""" + + frame_id: UUID + candidates: list[RerankCandidate] # 0 < len <= n; sorted descending by inlier_count, ties broken by descriptor_distance ascending + reranked_at: int # monotonic_ns + rerank_label: str # non-empty; matches BUILD_RERANK_ lowercase (e.g., "inlier_count") + candidates_input: int # len(vpr_result.candidates) at entry — for FDR observability + candidates_dropped: int # candidates_input - len(candidates) +``` + +### Error Hierarchy (in `c2_5_rerank/errors.py`) + +```python +class RerankError(Exception): + """Base for all C2.5 re-rank errors. Caught at the runtime root; downstream effect: C5 falls back to VIO-only with provenance `visual_propagated` (AC-3.5) only when `RerankAllCandidatesFailedError` is raised.""" + + +class RerankBackboneError(RerankError): + """Per-candidate LightGlue forward-pass failure (CUDA OOM, TRT engine deserialize mismatch). Logged at ERROR; per-occurrence FDR record. Drop-and-continue: the candidate is dropped from the rerank set, NOT the whole batch.""" + + +class RerankAllCandidatesFailedError(RerankError): + """Every candidate's LightGlue or tile fetch failed; zero survivors. Logged at ERROR; per-occurrence FDR record `kind=rerank.all_failed`. C5 falls back to VIO-only.""" +``` + +`TileFetchError` is owned by C6 (`components.c6_tile_cache`); C2.5 catches it inside the per-candidate loop and treats it identically to `RerankBackboneError` (drop-and-continue + ERROR log + FDR record `kind=rerank.tile_fetch_error`). + +## Composition-Root Factory + +```python +# src/gps_denied_onboard/runtime_root/rerank_factory.py + +from gps_denied_onboard.config import Config +from gps_denied_onboard.components.c2_5_rerank import ReRankStrategy +from gps_denied_onboard.components.c6_tile_cache import TileStore +from gps_denied_onboard.helpers.lightglue_runtime import LightGlueRuntime + + +def build_rerank_strategy( + config: Config, + tile_store: TileStore, + lightglue_runtime: LightGlueRuntime, +) -> ReRankStrategy: + """Composition-root factory. Reads `config.rerank.strategy` (currently only `"inlier_count"` is defined; future strategies extend the table); lazy-imports the concrete strategy module gated by its CMake `BUILD_RERANK_` flag; refuses to instantiate a strategy whose flag is OFF (raises `ConfigurationError` pointing at the offending strategy name + missing flag). + + Strategy resolution table: + + | config.rerank.strategy | Implementation | Module | Build flag | + |------------------------|-----------------------|---------------------------------------------------|---------------------------| + | "inlier_count" | InlierCountReRanker | components.c2_5_rerank.inlier_based_reranker | BUILD_RERANK_INLIER_COUNT | + + The shared `LightGlueRuntime` is constructor-injected; the factory does NOT own its lifecycle. The runtime root constructs ONE `LightGlueRuntime` instance and passes the same reference to both this factory (for C2.5) and the C3 matcher factory. + + Returns a fully-constructed strategy ready for `rerank` invocation. The caller (runtime root) is responsible for binding the instance to one ingest thread. + """ + ... +``` + +## Versioning + +- The `ReRankStrategy` Protocol's method signature is part of the cross-component public API. Any change (new method, removed method, parameter rename, return-type change) is a major bump and requires updating every concrete implementation in lockstep. +- DTO field additions are minor (frozen dataclasses with new optional fields default to None); field removals are major. +- The drop-and-continue contract (Invariant 8) is non-negotiable; changing it would break C3's tolerance of partial input. + +## Test Cases (protocol conformance — runs against every concrete strategy) + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| INV-1 (single-thread) | Composition root rejects multi-thread binding | `RuntimeError` on second binding attempt | +| INV-2 (stateless) | `rerank(frame_A)` then `rerank(frame_B)` then `rerank(frame_A)` again with the same `vpr_result` | First and third call return identical `RerankResult` (same surviving candidates, same order) | +| INV-3 (top-N order) | Mixed inlier counts (e.g., [412, 198, 287, 0, 153, ...]) on K=10 input with N=3 | Returned candidates sorted descending by inlier_count: [412, 287, 198] | +| INV-3 (tie-break) | Two candidates with identical inlier_count but different descriptor_distance | Lower descriptor_distance ranked first | +| INV-4 (length bound) | N=3 with K=10 input, all 10 succeeding | `len(result.candidates) == 3` | +| INV-4 (length under failure) | N=3 with K=10 input, 8 candidates fail | `len(result.candidates) == 2`; `candidates_dropped == 8` | +| INV-5 (descriptor_distance carried) | Each survivor's `descriptor_distance` | Equals the C2-stage value from `vpr_result.candidates[i].descriptor_distance` | +| INV-6 (handle is reference) | Mutate the underlying tile pixel buffer and re-read via `tile_pixels_handle` | Mutation visible (proves no copy) | +| INV-7 (deterministic) | `rerank(same inputs)` × 3 | All three return bit-identical `RerankResult` (same inlier_counts, same ordering, same surviving tile_ids) | +| INV-8 (drop-and-continue) | One candidate raises `RerankBackboneError`; nine succeed | Result has 3 survivors from the surviving 9; ONE ERROR log per failed candidate; the success path is NOT interrupted | +| AC-2.5-IT-01 (top-1 promotion rate) | `rerank` against fixture corpus where C2 top-1 was correct | Top-1 promotion rate ≥ 0.98 (C2's top-1 is preserved as result top-1 in ≥ 98% of frames) | +| AC-2.5-IT-02 (drop-and-continue smoke) | Inject `RerankBackboneError` for one candidate | Drop semantics hold; surviving candidates re-ranked | +| AC-2.5-IT-03 (helper serial-access) | Two `rerank` calls on the same instance from a single thread | Second call sees no `LightGlueRuntime` state corruption from the first; results bit-identical to single-threaded baseline | +| All-fail | Inject `RerankBackboneError` for every candidate | `RerankAllCandidatesFailedError` raised; per-candidate ERROR logs + final `kind=rerank.all_failed` FDR record | + +## Open Questions / Risks + +- **Risk: the shared `LightGlueRuntime` helper's serial-access invariant must be enforced upstream** — by the composition root binding both C2.5 and C3 to the same single ingest thread. *Mitigation*: AZ-278 (helper) ships with an internal assertion on each call that the calling thread matches the binding thread; AZ-342 (this Protocol task) consumes the helper as a constructor dependency and does NOT need to add a per-call check. +- **Risk: `tile_pixels_handle` semantics drift between C6's `TileStore` Public API and C2.5's expectation** — C2.5 expects a page-cache-backed reference, NOT a copy; C6's `get_tile_pixels` MUST guarantee that. *Mitigation*: cross-referenced in AZ-303 (`tile_store` contract) — the contract test for `get_tile_pixels` asserts the returned object is the same identity across two calls within a TTL window. +- **Risk: `n` parameter clamping vs. epic spec** — the epic fixes K=10, N=3; the Protocol leaves `n` parametric for testability. *Mitigation*: composition root binds `n=3` from `config.rerank.top_n` (default 3); the Protocol accepts arbitrary `n` so tests can use smaller values. +- **Risk: drop-and-continue can mask a backbone-wide regression** — if every flight has 3/10 candidates failing silently, recall degrades without any single failure being investigated. *Mitigation*: `RerankResult.candidates_dropped` is published per-frame; an FDR aggregate alert (post-flight tooling) flags flights with `candidates_dropped` p95 > 1. diff --git a/_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md b/_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md new file mode 100644 index 0000000..84be2a3 --- /dev/null +++ b/_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md @@ -0,0 +1,214 @@ +# Contract: `VprStrategy` Protocol + `BackbonePreprocessor` Protocol + +**Owner**: c2_vpr (epic AZ-255 / E-C2) +**Producer task**: AZ-336 (`VprStrategy` Protocol + factory + composition) +**Consumer tasks**: AZ-337 (UltraVPR), AZ-338 (NetVLAD baseline), AZ-339 (MegaLoc + MixVPR), AZ-340 (SelaVPR + EigenPlaces + SALAD), AZ-341 (FAISS HNSW retrieve wiring), and downstream c2_5_rerank (AZ-256 / E-C2.5) +**Module-layout home**: `src/gps_denied_onboard/components/c2_vpr/interface.py` (Protocols), `src/gps_denied_onboard/components/c2_vpr/__init__.py` (re-exports), `src/gps_denied_onboard/runtime_root/vpr_factory.py` (factory) +**Status**: draft, awaiting AZ-336 implementation + +## Purpose + +Defines the public interface for every C2 VPR backbone strategy: `embed_query` produces a `VprQuery` from a `NavCameraFrame`, `retrieve_topk` runs the FAISS HNSW lookup against the C6-owned descriptor index, and `descriptor_dim` advertises the embedding dimensionality so the composition root can pre-validate index/strategy compatibility. Every concrete backbone (UltraVPR, NetVLAD, MegaLoc, MixVPR, SelaVPR, EigenPlaces, SALAD) implements this Protocol; the composition root selects exactly one at startup based on `config.vpr.strategy` and refuses to wire a strategy whose `BUILD_VPR_` flag is OFF (ADR-002 + ADR-009). + +`BackbonePreprocessor` is the C2-internal helper Protocol for resize/crop/normalise per backbone's input contract. It lives next to the strategy (NOT in `helpers/`) because preprocessing parameters are tightly coupled to the backbone weights; sharing across backbones is forbidden — each strategy owns its own concrete preprocessor. + +## Public API + +### Protocol: `VprStrategy` + +```python +from typing import Protocol, runtime_checkable +from gps_denied_onboard._types import NavCameraFrame, CameraCalibration, VprQuery, VprResult + + +@runtime_checkable +class VprStrategy(Protocol): + """Single-camera visual place recognition strategy. Stateless per-frame; the only persistent state is the loaded backbone weights and the C6-owned FAISS index handle (passed in via constructor).""" + + def embed_query( + self, + frame: NavCameraFrame, + calibration: CameraCalibration, + ) -> VprQuery: + """Run the backbone forward pass on the provided frame and return a `VprQuery` carrying the descriptor embedding. + + Calibration is consumed for input preprocessing (resize / crop / normalise per the backbone's input contract — owned by the strategy's internal `BackbonePreprocessor`). + + Raises: + VprBackboneError: backbone forward pass failed (CUDA OOM, TRT engine deserialize mismatch, etc.). + """ + ... + + def retrieve_topk(self, query: VprQuery, k: int) -> VprResult: + """Run the FAISS HNSW top-K lookup against the corpus descriptor index. + + The strategy holds the FAISS index handle (constructor-injected from C6's `TileStore` Public API). Top-K candidates are returned in ascending `descriptor_distance` order. + + Raises: + IndexUnavailableError: FAISS index handle invalid (e.g., post-F8 reboot before warm-up, or out-of-band file replacement caught by the underlying mmap defence). + VprBackboneError: descriptor distance computation failed unexpectedly. + """ + ... + + def descriptor_dim(self) -> int: + """Backbone embedding dimensionality (e.g., 512 for UltraVPR, 4096 for NetVLAD-VGG16). Stable for the strategy's lifetime; consumed by the composition root to pre-validate index compatibility (the C6 index file declares its own dim in its sidecar; mismatch → `ConfigurationError` at startup, NOT at first frame).""" + ... +``` + +**Invariants** (every implementation MUST guarantee): + +1. **Single-threaded by contract** — each instance is bound to one ingest thread (composition root enforces; concurrent `embed_query` calls on a single instance race the GPU stream). +2. **Stateless per-frame** — no implicit dependency on prior frames; reordering `embed_query` calls (which the live path NEVER does, but tests do) MUST yield identical embeddings. +3. **L2-normalised embeddings** — the `VprQuery.embedding` MUST be L2-normalised (via `helpers.descriptor_normaliser`) so cosine similarity aligns with Euclidean distance for FAISS HNSW lookup. Strategies that produce raw embeddings (e.g., NetVLAD) MUST normalise before returning. +4. **`retrieve_topk` returns exactly `k` candidates, sorted ascending by `descriptor_distance`** — never fewer, never more, never unordered. If the corpus has fewer than `k` tiles, the strategy raises `IndexUnavailableError` (production deployments stage corpora with ≥1000 tiles; `k=10`). +5. **`backbone_label` is non-empty** — every `VprResult` carries the strategy's name (e.g., `"ultra_vpr"`, `"net_vlad"`) for FDR provenance. This MUST match the `BUILD_VPR_` flag's lowercase form. +6. **`embed_query` and `retrieve_topk` are deterministic** — given the same frame + calibration + corpus, identical embeddings and identical top-K candidates (in identical order). This is required for the C2-IT-02 invariant test and post-flight forensics. +7. **`descriptor_dim()` is stable for the strategy's lifetime** — never changes after construction; the value reflects the loaded weights' output dim, NOT a config knob. + +### DTOs (in `_types/vpr.py`) + +```python +from dataclasses import dataclass +from uuid import UUID +import numpy as np + + +@dataclass(frozen=True, slots=True) +class VprQuery: + """Backbone embedding for a single nav-camera frame. Produced by `VprStrategy.embed_query`; consumed by `VprStrategy.retrieve_topk` (same instance) or — in the C10 corpus-build path — by `DescriptorIndexBuilder` to populate the corpus descriptor matrix.""" + + frame_id: UUID + embedding: np.ndarray # shape (D,), dtype float16 or float32; L2-normalised + produced_at: int # monotonic_ns + + +@dataclass(frozen=True, slots=True) +class VprCandidate: + """One retrieval candidate from the top-K result.""" + + tile_id: tuple # composite (zoomLevel, lat, lon); see C6 TileRecord + descriptor_distance: float # backbone-specific metric (cosine for L2-normalised; Euclidean for raw) + descriptor_dim: int + + +@dataclass(frozen=True, slots=True) +class VprResult: + """Top-K candidates from `VprStrategy.retrieve_topk`. Consumed by C2.5 ReRanker.""" + + frame_id: UUID + candidates: list[VprCandidate] # length == k, sorted ascending by descriptor_distance + retrieved_at: int # monotonic_ns + backbone_label: str # non-empty; matches BUILD_VPR_ lowercase +``` + +### Protocol: `BackbonePreprocessor` (C2-internal; lives in `c2_vpr/_preprocessor.py`) + +```python +from typing import Protocol, runtime_checkable +from gps_denied_onboard._types import NavCameraFrame, CameraCalibration +import numpy as np + + +@runtime_checkable +class BackbonePreprocessor(Protocol): + """Resize / crop / normalise per backbone's input contract. Each `VprStrategy` implementation owns its concrete preprocessor (NOT shared across backbones — preprocessing parameters are tightly coupled to weights).""" + + def preprocess( + self, + frame: NavCameraFrame, + calibration: CameraCalibration, + ) -> np.ndarray: + """Return the preprocessed input tensor in the layout the backbone's forward pass expects (e.g., (1, 3, H, W) NCHW float16 for TRT). + + Raises: + VprPreprocessError: input frame violates the backbone's contract (wrong colour channels, calibration mismatch). + """ + ... + + def input_shape(self) -> tuple[int, ...]: + """The (H, W) resize target the backbone expects. Stable for the preprocessor's lifetime; consumed by tests to assert preprocessing fidelity.""" + ... +``` + +### Error Hierarchy (in `c2_vpr/errors.py`) + +```python +class VprError(Exception): + """Base for all C2 VPR errors. Caught at the runtime root; downstream effect: C5 falls back to VIO-only with provenance `visual_propagated` (AC-1.4).""" + + +class VprBackboneError(VprError): + """Backbone forward pass failed (CUDA OOM, TRT engine deserialize mismatch, ONNX runtime IO mismatch). Logged at ERROR; per-occurrence FDR record.""" + + +class VprPreprocessError(VprError): + """Input frame violates backbone's preprocessing contract (wrong colour channels, calibration mismatch). Logged at ERROR; per-occurrence FDR record.""" + + +class IndexUnavailableError(VprError): + """FAISS index handle invalid (post-F8 reboot before warm-up; out-of-band file replacement). Logged at ERROR; recovery: F8 reboot path re-mmaps the index. Per C2-ST-01 the strategy MUST raise this rather than return stale candidates.""" +``` + +## Composition-Root Factory + +```python +# src/gps_denied_onboard/runtime_root/vpr_factory.py + +from typing import TYPE_CHECKING +from gps_denied_onboard.config import Config +from gps_denied_onboard.components.c2_vpr import VprStrategy +from gps_denied_onboard.components.c6_tile_cache import TileStore +from gps_denied_onboard.components.c7_inference import InferenceRuntime + + +def build_vpr_strategy( + config: Config, + tile_store: TileStore, + inference_runtime: InferenceRuntime, +) -> VprStrategy: + """Composition-root factory. Reads `config.vpr.strategy` and `config.vpr.backbone_weights_path`; lazy-imports the concrete strategy module gated by its CMake `BUILD_VPR_` flag; refuses to instantiate a strategy whose flag is OFF (raises `ConfigurationError` pointing at the offending strategy name + missing flag). + + Strategy resolution table: + + | config.vpr.strategy | Implementation | Module | Build flag | + |---------------------|----------------------|-----------------------------------------------|-------------------| + | "ultra_vpr" | UltraVprStrategy | components.c2_vpr.ultra_vpr | BUILD_VPR_ULTRA_VPR | + | "net_vlad" | NetVladStrategy | components.c2_vpr.net_vlad | BUILD_VPR_NETVLAD | + | "mega_loc" | MegaLocStrategy | components.c2_vpr.mega_loc | BUILD_VPR_MEGALOC | + | "mix_vpr" | MixVprStrategy | components.c2_vpr.mix_vpr | BUILD_VPR_MIXVPR | + | "sela_vpr" | SelaVprStrategy | components.c2_vpr.sela_vpr | BUILD_VPR_SELAVPR | + | "eigen_places" | EigenPlacesStrategy | components.c2_vpr.eigen_places | BUILD_VPR_EIGENPLACES | + | "salad" | SaladStrategy | components.c2_vpr.salad | BUILD_VPR_SALAD | + + Pre-flight validation: after constructing the strategy, the factory queries `strategy.descriptor_dim()` and asserts it matches the C6 corpus index's declared `descriptor_dim` (read from the FAISS index sidecar). Mismatch → `ConfigurationError` at startup, NOT at first frame. + + Returns a fully-constructed strategy ready for `embed_query` / `retrieve_topk` invocation. The caller (runtime root) is responsible for binding the instance to one ingest thread. + """ + ... +``` + +## Versioning + +- The `VprStrategy` Protocol's method signatures are part of the cross-component public API. Any change (new method, removed method, parameter rename, return-type change) is a major bump and requires updating every concrete implementation in lockstep. +- DTO field additions are minor (frozen dataclasses with new optional fields default to None); field removals are major. +- `BackbonePreprocessor` is C2-internal; backwards-compat is per-strategy, not cross-strategy. + +## Test Cases (protocol conformance — runs against every concrete strategy) + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| INV-1 (single-thread) | Concurrent `embed_query` from 2 threads on one instance | Documented as forbidden in test docstring; test asserts composition root rejects multi-thread binding | +| INV-2 (stateless) | `embed_query(frame_A)` then `embed_query(frame_B)` then `embed_query(frame_A)` again | First and third call return identical embeddings (bit-exact for float embeddings; ULP-tolerant for float16) | +| INV-3 (L2-normalised) | `||VprQuery.embedding||_2` after `embed_query` | Equal to 1.0 ± 1e-3 (tolerance for float16) | +| INV-4 (top-K size + order) | `retrieve_topk(query, k=10)` against a 100-tile fixture corpus | `len(candidates) == 10`; distances are non-strictly-ascending | +| INV-5 (backbone_label non-empty) | Every `VprResult` from `retrieve_topk` | `backbone_label` is a non-empty string and matches the strategy's `BUILD_VPR_` lowercase | +| INV-6 (deterministic) | `embed_query(same frame)` × 3 then `retrieve_topk(same query)` × 3 | All three pairs return bit-exact embeddings + identical top-K (tile_ids in same order) | +| INV-7 (descriptor_dim stable) | `descriptor_dim()` × 100 calls | Returns the same value every call | +| AC-2.1b (recall floor) | UltraVPR + NetVLAD on Derkachi normal-segment corpus | UltraVPR recall@10 ≥ 0.95; NetVLAD recall@10 ≥ 0.85 (engine rule check; AZ-338) | +| AC-NEW-7 (poisoned tile) | Top-1 distance to poisoned tile in NFT-SEC-01 corpus | Within AC-NEW-7 relaxed CI | +| C2-ST-01 (stale index) | Out-of-band corpus file replacement | `retrieve_topk` raises `IndexUnavailableError`; no candidates returned | + +## Open Questions / Risks + +- **Risk: backbone weights' descriptor_dim drifts across upstream code drops** (e.g., a new UltraVPR release changes embedding dim from 512 to 768). *Mitigation*: the factory's pre-flight `descriptor_dim()` × C6 sidecar match catches this at startup; the operator must rebuild the C6 corpus before the new weights can be used. +- **Risk: SALAD is mentioned in description.md but NOT in the original epic's child issues** — included here for completeness because module-layout.md `BUILD_VPR_` table lists SALAD. *Decision*: SALAD lives in AZ-340 (with SelaVPR + EigenPlaces). If the team decides SALAD is out of scope this cycle, that task drops one backbone with no other changes needed. diff --git a/_docs/02_document/contracts/c3_5_adhop/conditional_refiner_protocol.md b/_docs/02_document/contracts/c3_5_adhop/conditional_refiner_protocol.md new file mode 100644 index 0000000..620619b --- /dev/null +++ b/_docs/02_document/contracts/c3_5_adhop/conditional_refiner_protocol.md @@ -0,0 +1,170 @@ +# Contract: `ConditionalRefiner` Protocol + +**Owner**: c3_5_adhop (epic AZ-258 / E-C3.5) +**Producer task**: AZ-348 (Protocol + factory + DTOs + composition + `PassthroughRefiner`) +**Consumer tasks**: AZ-349 (`AdHoPRefiner` real refinement); downstream c4_pose (epic AZ-259) which consumes the (possibly refined) `MatchResult` +**Version**: 1.0.0 +**Status**: draft, awaiting Producer task implementation +**Last Updated**: 2026-05-10 +**Module-layout home**: `src/gps_denied_onboard/components/c3_5_adhop/interface.py` (Protocol), `src/gps_denied_onboard/components/c3_5_adhop/__init__.py` (re-exports), `src/gps_denied_onboard/runtime_root/refiner_factory.py` (factory) + +> **Public API symbol naming.** The component's public interface symbol is named `ConditionalRefiner` in `description.md` § 2 and `AdHoPRefinementStrategy` in `module-layout.md` § c3_5_adhop. Both refer to the SAME Protocol; the canonical class name in code is `ConditionalRefiner` — it is the role description-first name and matches the method `refine_if_needed`. The producer task ALSO updates `module-layout.md` to align (`AdHoPRefinementStrategy` → `ConditionalRefiner`) so the two documents agree. + +## Purpose + +Defines the public interface for every C3.5 refinement strategy: `refine_if_needed(frame, mr, residual_threshold_px)` returns a `MatchResult` that is either (a) the input unchanged ("passthrough") OR (b) enriched with refined inlier correspondences from OrthoLoC AdHoP perspective preconditioning. The conditional gate is a configurable residual threshold: if the input `MatchResult.reprojection_residual_px` ≤ threshold the refiner returns the input unchanged; otherwise the refiner runs the AdHoP backbone and returns an enriched `MatchResult`. `was_invoked()` exposes the last-call decision for FDR provenance and for NFT-PERF-01 invocation-rate accounting. + +Two concrete strategies are linked into the production binary by default: `AdHoPRefiner` (production-default; conditional invocation) and `PassthroughRefiner` (always passes through; non-conditional baseline used by smoke tests and by IT-12's "no refinement" comparison). Both implementations co-exist at build time per ADR-001 — gating is at runtime via `config.refiner.strategy`. Build-time exclusion (ADR-002) is NOT used here because both strategies are tiny (passthrough is a no-op; AdHoP's backbone is a single TRT engine shared with C7). + +The shared `RansacFilter` helper (AZ-282) is constructor-injected — `c3_5_adhop` imports the SAME helper used by `c3_matcher` and `c4_pose`; the runtime root constructs ONE instance and identity-shares it across all three components. + +## Public API + +### Protocol: `ConditionalRefiner` + +```python +from typing import Protocol, runtime_checkable +from gps_denied_onboard._types import ( + NavCameraFrame, MatchResult, +) + + +@runtime_checkable +class ConditionalRefiner(Protocol): + """Conditional refinement strategy invoked between C3 (matcher) and C4 (pose). Stateless per-frame; the only persistent state is the constructor-injected backbone runtime handle + the last-invocation flag.""" + + def refine_if_needed( + self, + frame: NavCameraFrame, + mr: MatchResult, + residual_threshold_px: float, + ) -> MatchResult: + """If `mr.reprojection_residual_px <= residual_threshold_px` (the steady-state path), return `mr` unchanged AND set `was_invoked()` to False. Otherwise, run the strategy's refinement procedure and return an enriched `MatchResult` with `refinement_label` set, AND set `was_invoked()` to True. + + On `RefinerBackboneError` (AdHoP backbone failure during the invoked path), the refiner MUST fall through to passthrough — return `mr` unchanged with `refinement_label = "passthrough"` AND `was_invoked()` = True (the attempt counts towards the invocation rate even on failure). The error is logged at ERROR level + emitted to FDR; downstream pose estimation may then trigger F6 satellite re-localisation if quality gates fail. + + Determinism: same inputs MUST produce the same output. The conditional gate is a `<=` comparison only — no probabilistic gating, no time-based gating. + """ + ... + + def was_invoked(self) -> bool: + """Return True iff the last call to `refine_if_needed` actually entered the refinement procedure (regardless of whether it produced a refined result or fell through to passthrough on backbone error). Reset to False at the start of every `refine_if_needed` call. Used by FDR per-frame provenance and by NFT-PERF-01 / C3.5-IT-03 invocation-rate accounting.""" + ... +``` + +**Invariants**: + +1. **Single-threaded by contract** — each instance is bound to one ingest thread (composition root enforces; same thread as C3 because they share the C-frame ingest path). +2. **Stateless per-frame for `refine_if_needed`** — except for the `was_invoked()` flag, no implicit dependency on prior frames; reordering `refine_if_needed` calls (tests only) MUST yield identical output `MatchResult` content. +3. **Conditional gate is a pure comparison** — `mr.reprojection_residual_px <= threshold` → passthrough; `>` → invoke. No tolerance, no smoothing, no hysteresis. The threshold is a parameter (NOT a hidden internal constant) so operator tooling can tune pre-flight per AC-NEW-5 / R10. +4. **Passthrough fall-through on backbone error** — `RefinerBackboneError` raised inside the invoked path is caught by the strategy and converted to passthrough output (input `MatchResult` returned unchanged with `refinement_label = "passthrough"`); the error is logged at ERROR level. The exception is NEVER re-raised out of `refine_if_needed` (downstream pose estimation gets a usable `MatchResult` and decides whether to trigger F6). +5. **Bit-identical correspondences on passthrough** — when `refinement_label == "passthrough"`, every `inlier_correspondences` ndarray in the output equals the input ndarray bit-for-bit (`np.array_equal` AND same dtype). Refinement may NEVER silently rewrite correspondences when the gate decided not to invoke. +6. **`refinement_label` is `"adhop"` OR `"passthrough"`** — exactly one of those two values; matches the strategy's selected variant. The label distinguishes "AdHoP ran successfully" from "passthrough or AdHoP-fell-through-to-passthrough"; readers check `was_invoked()` for the latter discrimination. +7. **`refinement_added_latency_ms` is the STRATEGY-INTERNAL added latency** — not the matcher's or pose estimator's; covers exactly the work done inside `refine_if_needed`. Always ≥ 0; near-zero on passthrough; up to ~90 ms on AdHoP invoke per AC C3.5-PT-01. +8. **`was_invoked()` semantics** — set to True iff the strategy entered the refinement procedure (post-gate, regardless of whether AdHoP succeeded or fell through). On passthrough strategy + every gate-decided-passthrough call: False. +9. **Threshold validation** — the strategy MUST reject `residual_threshold_px <= 0` (raise `ValueError`); the composition root validates the config-loaded threshold at startup so this in-method check is defensive. + +### DTOs (in `_types/refiner.py` — additions; reuse `MatchResult` from `_types/matcher.py`) + +The output of `refine_if_needed` is a `MatchResult` (same DTO as C3 produces) with the following NEW optional fields populated by C3.5: + +```python +# Additions to existing MatchResult in _types/matcher.py (NOT a new DTO; in-place extension) + +@dataclass(frozen=True, slots=True) +class MatchResult: + # ... existing fields from C3 ... + # NEW (populated by C3.5; default values for non-refined frames): + refinement_label: str = "passthrough" # "adhop" | "passthrough" + refinement_added_latency_ms: float = 0.0 # added latency due to refinement; 0 on pure passthrough +``` + +Rationale: `MatchResult` is consumed by C3 producers and C3.5 (which may rewrite); since `MatchResult` is a frozen dataclass, C3.5 produces a NEW `MatchResult` instance via `dataclasses.replace(...)` whenever it enriches. The new fields default to the passthrough values so a C3 producer that never goes through C3.5 still yields a valid downstream-readable `MatchResult`. + +> **Cross-task coordination.** AZ-344 (C3 Protocol task) defines the `MatchResult` DTO with the C3 fields. The C3.5 Producer task (TBD) extends `MatchResult` with the two NEW fields (with their defaults) in the SAME `_types/matcher.py` file. Because the fields default to passthrough values, the addition is backward-compatible for AZ-344's tests; AZ-344's `MatchResult` constructor stays valid. The C3.5 Producer task is responsible for updating AZ-344's frozen-dataclass tests (if any) to assert the new field defaults. + +### Error hierarchy (in `c3_5_adhop/errors.py`) + +```python +class RefinerError(Exception): + """Base class for all C3.5 refinement-strategy errors.""" + + +class RefinerBackboneError(RefinerError): + """AdHoP backbone forward failed (TensorRT exception, OOM, NaN, shape mismatch). Caught inside `refine_if_needed`; converted to passthrough fall-through; never re-raised out of the strategy.""" + + +class RefinerConfigError(RefinerError): + """Composition-root rejected the refiner config (unknown strategy, invalid threshold). Raised at startup ONLY; never per-frame.""" +``` + +The error hierarchy is intentionally small — drop-and-continue at the C3 matcher level handles per-candidate failures already; at C3.5 the only failure mode is the AdHoP backbone, and it is contained within the strategy via passthrough fall-through (Invariant 4). + +### Composition-root factory + +```python +# In src/gps_denied_onboard/runtime_root/refiner_factory.py + +from gps_denied_onboard._types import config +from gps_denied_onboard.helpers.ransac_filter import RansacFilter +from gps_denied_onboard.components.c7_inference.interface import InferenceRuntime +from gps_denied_onboard.components.c3_5_adhop.interface import ConditionalRefiner + + +def build_refiner_strategy( + config: config.AppConfig, + ransac_filter: RansacFilter, + inference_runtime: InferenceRuntime, +) -> ConditionalRefiner: + """Construct the configured C3.5 strategy at composition-root time. Selects between `AdHoPRefiner` and `PassthroughRefiner` per `config.refiner.strategy`. Both strategies are imported eagerly (no `BUILD_REFINER_*` flag gating — both are linked unconditionally) — runtime selection only. + + Raises: + RefinerConfigError: unknown strategy name OR invalid threshold (≤ 0). + """ + ... +``` + +Strategy resolution table: + +| `config.refiner.strategy` | Module path | Class | Notes | +|---|---|---|---| +| `"adhop"` | `gps_denied_onboard.components.c3_5_adhop.adhop_refiner` | `AdHoPRefiner` | production-default; conditional invocation. | +| `"passthrough"` | `gps_denied_onboard.components.c3_5_adhop.passthrough_refiner` | `PassthroughRefiner` | always-passthrough; baseline / smoke / IT-12 comparison. | + +Config-load-time validation (in AZ-269): + +- `config.refiner.strategy` (enum, required): `"adhop"` | `"passthrough"`. +- `config.refiner.residual_threshold_px` (float, default `2.5`): must be > 0. +- `config.refiner.invocation_rate_warn_threshold` (float, default `0.25`): rolling-60s threshold above which a WARN log is emitted (per description.md § 9). Must be in `(0, 1)`. + +## Test expectations summarised by Invariant + +| Invariant | Test name | Assertion | +|---|---|---| +| 1 | thread-binding | composition root binds the strategy to ONE ingest thread; second binding raises `RuntimeError`. | +| 2 | stateless reorder | shuffle 10 frames → output content identical to in-order pass; `was_invoked()` flags identical positionwise. | +| 3 | gate semantics | residual = threshold → passthrough (`<=` is inclusive); residual = threshold + 1e-6 → invoked. | +| 4 | backbone-error fall-through | monkey-patch backbone to raise `RefinerBackboneError`; `refine_if_needed` returns input unchanged with `refinement_label = "passthrough"`; ERROR log emitted; `was_invoked()` is True. | +| 5 | bit-identical on passthrough | when `refinement_label == "passthrough"`, every `inlier_correspondences` array satisfies `np.array_equal(out, in_) and out.dtype == in_.dtype`. | +| 6 | label values | every output's `refinement_label` is in `{"adhop", "passthrough"}`. | +| 7 | added-latency monotonic | every output's `refinement_added_latency_ms >= 0`; passthrough p95 ≤ 0.5 ms; AdHoP-invoked p95 ≤ 90 ms. | +| 8 | `was_invoked()` semantics | gate-passthrough: False; AdHoP-success: True; AdHoP-fall-through: True; PassthroughRefiner: always False. | +| 9 | threshold validation | `residual_threshold_px = 0` → `ValueError` raised by the strategy; `RefinerConfigError` raised by `build_refiner_strategy` at startup. | + +## What this contract does NOT define + +- The AdHoP TRT engine compile path — owned by AZ-321 (engine compiler). +- The AdHoP forward pass implementation — owned by C7 `InferenceRuntime` consumers. +- The `RansacFilter` API — owned by AZ-282; this contract only consumes it. +- The downstream pose estimator's behaviour when `refinement_added_latency_ms` is high — owned by E-C4 (D-CROSS-LATENCY-1 hybrid is C4-internal). + +## Producer-task / consumer-task split + +- The Protocol task (TBD) ships: Protocol, DTO extension to `MatchResult`, error hierarchy, composition-root factory, config schema extension, AND the `PassthroughRefiner` (because it is a 1-pt no-op that naturally accompanies the Protocol task and acts as the reference implementation for tests). +- The AdHoPRefiner task (TBD) ships: `AdHoPRefiner` only (TRT engine load, perspective preconditioning, conditional gate, backbone-error fall-through to passthrough). Composition-root wiring path for `config.refiner.strategy = "adhop"`. + +## Versioning + change policy + +- Protocol method-signature changes (signatures of `refine_if_needed` or `was_invoked`) are MAJOR-version bumps. Every concrete strategy must be updated lockstep. +- DTO field additions (e.g., a future `refinement_iterations: int`) are MINOR. Field removals are MAJOR. +- Adding a third strategy (e.g., a learned-conditional refiner) is a feature-cycle change; it adds an entry to the resolution table without changing this contract's surface. diff --git a/_docs/02_document/contracts/c3_matcher/cross_domain_matcher_protocol.md b/_docs/02_document/contracts/c3_matcher/cross_domain_matcher_protocol.md new file mode 100644 index 0000000..7525668 --- /dev/null +++ b/_docs/02_document/contracts/c3_matcher/cross_domain_matcher_protocol.md @@ -0,0 +1,170 @@ +# Contract: `CrossDomainMatcher` Protocol + +**Owner**: c3_matcher (epic AZ-257 / E-C3) +**Producer task**: AZ-344 (`CrossDomainMatcher` Protocol + factory + composition) +**Consumer tasks**: AZ-345 (DISK+LightGlue primary), AZ-346 (ALIKED+LightGlue secondary), AZ-347 (XFeat alternate); downstream c3_5_adhop (epic AZ-258) which consumes `MatchResult` +**Version**: 1.0.0 +**Status**: draft, awaiting AZ-344 implementation +**Last Updated**: 2026-05-10 +**Module-layout home**: `src/gps_denied_onboard/components/c3_matcher/interface.py` (Protocol), `src/gps_denied_onboard/components/c3_matcher/__init__.py` (re-exports), `src/gps_denied_onboard/runtime_root/matcher_factory.py` (factory) + +## Purpose + +Defines the public interface for every C3 cross-domain matcher strategy: `match(frame, rerank_result, calibration)` produces a `MatchResult` containing per-candidate inlier counts + RANSAC-filtered correspondences + median reprojection residual; `health_snapshot()` returns rolling matcher health for AC-NEW-7 cache-poisoning detection. Every concrete matcher (DISK+LightGlue, ALIKED+LightGlue, XFeat) implements this Protocol; the composition root selects exactly one at startup based on `config.matcher.strategy` and refuses to wire a strategy whose `BUILD_MATCHER_` flag is OFF (ADR-002 + ADR-009). + +The shared `LightGlueRuntime` helper (AZ-278) is constructor-injected — neither C2.5 nor C3 owns its lifecycle (R14 fix); the runtime root constructs ONE instance and passes the same reference to both. The shared `RansacFilter` helper (AZ-282) is also constructor-injected and consumed by C3, C3.5, and C4. + +## Public API + +### Protocol: `CrossDomainMatcher` + +```python +from typing import Protocol, runtime_checkable +from gps_denied_onboard._types import ( + NavCameraFrame, CameraCalibration, RerankResult, MatchResult, MatcherHealth, +) + + +@runtime_checkable +class CrossDomainMatcher(Protocol): + """Cross-domain (nav-camera ↔ satellite-imagery) matcher strategy. Stateless per-frame; the only persistent state is the constructor-injected backbone runtime handles + the rolling health window.""" + + def match( + self, + frame: NavCameraFrame, + rerank_result: RerankResult, + calibration: CameraCalibration, + ) -> MatchResult: + """Run feature extraction + matching + RANSAC + reprojection-residual computation against each top-N=3 candidate in `rerank_result`. Pick the best candidate by inlier count (deterministic tie-break: lower median residual ranked higher). + + Drop-and-continue per candidate: per-candidate `MatcherBackboneError` (backbone forward failure) → candidate dropped, ERROR log + FDR record, success path continues. If ALL candidates fail OR every candidate's inlier count falls below `config.matcher.min_inliers_threshold`: raise `InsufficientInliersError`; downstream C5 falls back to VIO-only with provenance `visual_propagated` (AC-3.5). + + Raises: + InsufficientInliersError: every candidate failed or every candidate's inlier count is below the configured floor. + """ + ... + + def health_snapshot(self) -> MatcherHealth: + """Return a rolling-window snapshot of matcher health: consecutive low-inlier frames, mean inliers over the last 60 s. Used by C5's spoof-promotion gate (AC-NEW-2 / AC-NEW-7) and by post-flight forensics.""" + ... +``` + +**Invariants**: + +1. **Single-threaded by contract** — each instance is bound to one ingest thread (composition root enforces; same thread as C2.5 because they share `LightGlueRuntime`). +2. **Stateless per-frame for `match`** — except for the rolling health window, no implicit dependency on prior frames; reordering `match` calls (tests only) MUST yield identical `MatchResult` content. +3. **Best-candidate selection is deterministic** — `MatchResult.best_candidate_idx == argmax(inlier_count)` over `per_candidate`; ties broken by `per_candidate_residual_px` ascending (lower residual wins). +4. **Drop-and-continue per candidate** — per-candidate exceptions never propagate out of `match` unless every candidate fails. Mirrors C2.5 INV-8. +5. **`per_candidate` length is bounded** — `0 < len <= len(rerank_result.candidates)` (zero raises `InsufficientInliersError`); never exceeds the input N. +6. **`matcher_label` is non-empty** — every `MatchResult` carries the strategy's name (e.g., `"disk_lightglue"`) for FDR provenance. MUST match `BUILD_MATCHER_` lowercase. +7. **`inlier_correspondences` shape contract** — `ndarray[I, 4, dtype=float32]`, columns `(px_query, py_query, px_tile, py_tile)`; rows are RANSAC inliers only; `I == inlier_count`. +8. **`reprojection_residual_px` is the BEST candidate's median residual** — not the mean, not a max; downstream C3.5's threshold gate compares against this value. +9. **`health_snapshot()` is cheap** — O(1); reads the rolling window's pre-computed accumulators. Never recomputes over the window contents. + +### DTOs (in `_types/matcher.py`) + +```python +from dataclasses import dataclass +from uuid import UUID +import numpy as np + + +@dataclass(frozen=True, slots=True) +class CandidateMatchSet: + """Per-candidate matching outcome inside a MatchResult.""" + tile_id: tuple # composite (zoomLevel, lat, lon) + inlier_count: int + inlier_correspondences: np.ndarray # shape (I, 4) float32; (px_query, py_query, px_tile, py_tile) + ransac_outlier_count: int + per_candidate_residual_px: float # median residual on inliers + + +@dataclass(frozen=True, slots=True) +class MatchResult: + """Cross-domain match outcome for one frame. Consumed by C3.5 ConditionalRefiner.""" + frame_id: UUID + per_candidate: list[CandidateMatchSet] # 0 < len <= N=3, ranked by inlier_count descending; ties broken by per_candidate_residual_px ascending + best_candidate_idx: int # 0 by construction (sorted) + reprojection_residual_px: float # best candidate's median residual + matched_at: int # monotonic_ns + matcher_label: str # non-empty; matches BUILD_MATCHER_ lowercase + candidates_input: int # len(rerank_result.candidates) at entry + candidates_dropped: int # candidates_input - len(per_candidate) + + +@dataclass(frozen=True, slots=True) +class MatcherHealth: + """Rolling-window matcher health snapshot.""" + consecutive_low_inlier: int # consecutive frames where inlier_count < min_inliers_threshold + mean_inliers_60s: float # rolling 60 s mean of best-candidate inlier_count + backbone_error_count_60s: int # rolling 60 s count of MatcherBackboneError occurrences +``` + +### Error Hierarchy (in `c3_matcher/errors.py`) + +```python +class MatcherError(Exception): + """Base for all C3 matcher errors. Caught at the runtime root; downstream effect: C5 falls back to VIO-only with provenance `visual_propagated` (AC-3.5).""" + + +class MatcherBackboneError(MatcherError): + """Per-candidate backbone forward-pass failure (CUDA OOM, TRT engine deserialize mismatch). Drop-and-continue inside `match`.""" + + +class InsufficientInliersError(MatcherError): + """Every candidate failed OR every candidate's inlier count is below `config.matcher.min_inliers_threshold`. Raised by `match`. C5 falls back to VIO-only.""" +``` + +## Composition-Root Factory + +```python +# src/gps_denied_onboard/runtime_root/matcher_factory.py + +def build_matcher_strategy( + config: Config, + lightglue_runtime: LightGlueRuntime, + ransac_filter: RansacFilter, + inference_runtime: InferenceRuntime, +) -> CrossDomainMatcher: + """Composition-root factory. Reads `config.matcher.strategy` and lazy-imports the concrete module gated by `BUILD_MATCHER_`. + + Strategy resolution table: + + | config.matcher.strategy | Implementation | Module | Build flag | + |-------------------------|----------------------------|-----------------------------------------------|-----------------------------| + | "disk_lightglue" | DiskLightGlueMatcher | components.c3_matcher.disk_lightglue | BUILD_MATCHER_DISK_LIGHTGLUE | + | "aliked_lightglue" | AlikedLightGlueMatcher | components.c3_matcher.aliked_lightglue | BUILD_MATCHER_ALIKED_LIGHTGLUE | + | "xfeat" | XFeatMatcher | components.c3_matcher.xfeat | BUILD_MATCHER_XFEAT | + + The shared `LightGlueRuntime` and `RansacFilter` are constructor-injected; the factory does NOT own their lifecycles. The runtime root constructs ONE `LightGlueRuntime` and passes the SAME reference to both this factory and the C2.5 ReRank factory (per AZ-342 AC-10). + """ + ... +``` + +## Versioning + +- The `CrossDomainMatcher` Protocol's method signatures are part of the cross-component public API. Any change is a major bump and requires updating every concrete implementation in lockstep. +- DTO field additions are minor; field removals are major. The drop-and-continue contract (Invariant 4) is non-negotiable. + +## Test Cases (protocol conformance — runs against every concrete strategy) + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| INV-1 (single-thread) | Composition root rejects multi-thread binding | `RuntimeError` on second binding attempt | +| INV-2 (stateless `match`) | Reorder calls; replay calls | `MatchResult.per_candidate` content is identical (ignoring `matched_at`) | +| INV-3 (best-candidate det.) | Mixed inlier counts with one tie | Best candidate is the tied one with lower median residual | +| INV-4 (drop-and-continue) | One candidate's backbone raises | Result has remaining survivors; ERROR log + FDR record per failure | +| INV-5 (length bound) | N=3 input, 2 candidates fail | `len(per_candidate) == 1` | +| INV-6 (matcher_label) | Every MatchResult | `matcher_label` non-empty + matches `BUILD_MATCHER_` lowercase | +| INV-7 (correspondences shape) | Each `CandidateMatchSet` | `inlier_correspondences.shape == (I, 4)`, `dtype == float32`, `I == inlier_count` | +| INV-8 (median residual) | Median of inliers' residual list | `per_candidate_residual_px` matches numpy.median computed independently | +| INV-9 (`health_snapshot` cheap) | Microbench `health_snapshot` × 1000 | p99 ≤ 50 µs | +| AC-1.1 floor | Inlier count p5 across a fixture | ≥ 80 (AC-1.1 partition) | +| All-fail | Every candidate's backbone raises | `InsufficientInliersError`; all-failed FDR record | +| Below-threshold | Every candidate's inlier_count < `config.matcher.min_inliers_threshold` | `InsufficientInliersError` | + +## Open Questions / Risks + +- **Risk: D-C3-1 IT-12 verdict may shift the production-default backbone** from DISK+LightGlue to ALIKED+LightGlue or another. *Mitigation*: every backbone implements the same Protocol; switching is a config change. The contract holds. +- **Risk: `LightGlueRuntime` shared with C2.5** — both must serialise through one ingest thread. *Mitigation*: composition root binds both to the same ingest thread; helper has internal thread-binding assertion (AZ-278). +- **Risk: `min_inliers_threshold` is not yet calibrated** — the AC-1.1 floor (p5 ≥ 80) is the production target; the threshold may need to be lower (e.g., 40) to leave headroom. *Mitigation*: `config.matcher.min_inliers_threshold` is config-driven (default 60); FT-P-19 telemetry will tune it. diff --git a/_docs/02_document/contracts/c4_pose/pose_estimator_protocol.md b/_docs/02_document/contracts/c4_pose/pose_estimator_protocol.md new file mode 100644 index 0000000..b5e9edc --- /dev/null +++ b/_docs/02_document/contracts/c4_pose/pose_estimator_protocol.md @@ -0,0 +1,194 @@ +# Contract: `PoseEstimator` Protocol + +**Owner**: c4_pose (epic AZ-259 / E-C4) +**Producer task**: AZ-355 (Protocol + DTO + factory + composition) +**Consumer tasks**: AZ-358 (`OpenCVGtsamPoseEstimator` Marginals path), AZ-361 (D-CROSS-LATENCY-1 hybrid: Jacobian fallback + thermal-state-driven mode switch). Downstream c5_state (epic AZ-260) which consumes `PoseEstimate`. +**Version**: 1.0.0 +**Status**: draft, awaiting Producer task implementation +**Last Updated**: 2026-05-10 +**Module-layout home**: `src/gps_denied_onboard/components/c4_pose/interface.py` (Protocol), `src/gps_denied_onboard/components/c4_pose/__init__.py` (re-exports), `src/gps_denied_onboard/runtime_root/pose_factory.py` (factory) + +## Purpose + +Defines the public interface for the C4 pose estimator: `estimate(match_result, calibration, thermal_state) -> PoseEstimate` produces a WGS84 position + 6×6 covariance + provenance label by running OpenCV `solvePnPRansac` (`SOLVEPNP_IPPE`) and recovering the posterior 6×6 covariance via GTSAM `Marginals.marginalCovariance(pose_key)` against C5's shared iSAM2 graph. Under thermal throttle (D-CROSS-LATENCY-1 / ADR-006), the implementation switches per-frame to Jacobian-derived covariance accepting ~5–10% accuracy loss to preserve the AC-4.1 latency budget. `current_covariance_mode()` exposes the per-frame decision for FDR provenance and AC-NEW-5 verification. + +There is exactly ONE concrete implementation (`OpenCVGtsamPoseEstimator`); the Protocol exists for ADR-009 (interface-first DI) so consumers (C5, runtime root) hold a typed reference rather than the concrete class. ADR-002 build-time exclusion does NOT apply (one strategy only) — but lazy-import via the factory remains the entry-point pattern for symmetry with C2 / C2.5 / C3 / C3.5. + +The shared `RansacFilter` (AZ-282), `WgsConverter` (AZ-279), and `SE3Utils` (AZ-277) helpers are constructor-injected. The C5 iSAM2 graph handle is constructor-injected from the runtime root; C4 NEVER owns the graph (ADR-003 shared substrate). + +## Public API + +### Protocol: `PoseEstimator` + +```python +from typing import Protocol, runtime_checkable +from gps_denied_onboard._types import ( + MatchResult, CameraCalibration, ThermalState, PoseEstimate, CovarianceMode, +) + + +@runtime_checkable +class PoseEstimator(Protocol): + """Single-pose estimator producing WGS84 + 6×6 covariance + provenance label. Stateless per-frame except for the constructor-injected shared GTSAM substrate (owned by C5).""" + + def estimate( + self, + match_result: MatchResult, + calibration: CameraCalibration, + thermal_state: ThermalState, + ) -> PoseEstimate: + """Run PnP → factor add → covariance recovery. Per-frame thermal decision: `thermal_state.throttle == True` → Jacobian path (cheap, ~5–10% accuracy loss); `False` → Marginals path (production-default). + + Raises: + PnpFailureError: RANSAC convergence failure or degenerate match geometry. C5 falls back to VIO-only with `source_label = "visual_propagated"`. NEVER converted to a fallback PoseEstimate; C5 is the place where the fallback decision is taken. + """ + ... + + def current_covariance_mode(self) -> CovarianceMode: + """Return the mode used for the LAST `estimate` call: `CovarianceMode.MARGINALS` or `CovarianceMode.JACOBIAN`. Used by C5 for FDR provenance and by C4-IT-03 to verify the per-frame switch.""" + ... +``` + +**Invariants**: + +1. **Single-threaded by contract** — bound to the SAME ingest thread as C5 (composition root enforces; shared GTSAM substrate per ADR-003 is non-thread-safe). +2. **Stateless w.r.t. flight history for `estimate`** — relies solely on inputs + the shared iSAM2 graph (which carries history but is C5-owned). +3. **Per-frame mode decision** — `thermal_state.throttle` is read at call entry; the choice between Marginals/Jacobian is made on EVERY call independently. NO hysteresis, NO smoothing, NO operator-tooling override at this layer (R10 covers operator tuning at a higher layer via `config`). +4. **Mode-switch latency ≤ 1 frame** — switching from JACOBIAN to MARGINALS or back happens immediately on the next `estimate` call when the thermal flag flips. C4-IT-03 verifies. +5. **`PoseEstimate.covariance_6x6` is always SPD** — both paths produce SPD matrices; non-SPD is a bug. C4-IT-02 verifies. +6. **`PoseEstimate.covariance_mode` matches the path actually taken** — never reports MARGINALS while computing Jacobian. +7. **`source_label` is set by C4 to `"satellite_anchored"`** unconditionally on success; C5 is the component that may downgrade it to `"visual_propagated"` or `"dead_reckoned"` when the gate decides. C4 never emits `"visual_propagated"` from `estimate` directly. +8. **`last_satellite_anchor_age_ms` is provided BY C5 and PASSED THROUGH** — C4 receives the current value via the runtime root + caches it; on emit, the value reflects the time since C5's last anchor add. C4 does not compute this metric independently. +9. **`PnpFailureError` is the ONLY non-warning exception escaping `estimate`** — `CovarianceDegradedWarning` is a Python `Warning` (filterwarnings-compatible), NOT an exception. + +### DTOs (in `_types/pose.py`) + +```python +from dataclasses import dataclass +from enum import Enum +from uuid import UUID +import numpy as np + + +class CovarianceMode(Enum): + MARGINALS = "marginals" + JACOBIAN = "jacobian" + + +class PoseSourceLabel(Enum): + SATELLITE_ANCHORED = "satellite_anchored" + VISUAL_PROPAGATED = "visual_propagated" + DEAD_RECKONED = "dead_reckoned" + + +@dataclass(frozen=True, slots=True) +class LatLonAlt: + """WGS84 position. lat/lon in degrees, alt in metres MSL.""" + lat_deg: float + lon_deg: float + alt_m_msl: float + + +@dataclass(frozen=True, slots=True) +class Quat: + """Unit quaternion (w, x, y, z); scalar-first.""" + w: float + x: float + y: float + z: float + + +@dataclass(frozen=True, slots=True) +class PoseEstimate: + """Pose estimate emitted by C4 to C5.""" + frame_id: UUID + position_wgs84: LatLonAlt + orientation_world_T_body: Quat + covariance_6x6: np.ndarray # shape (6, 6) float64; SPD; position (3x3) | orientation (3x3) blocks + covariance_mode: CovarianceMode + source_label: PoseSourceLabel # C4 always emits SATELLITE_ANCHORED on success + last_satellite_anchor_age_ms: int + emitted_at: int # monotonic_ns +``` + +### Error hierarchy (in `c4_pose/errors.py`) + +```python +class PoseEstimatorError(Exception): + """Base class.""" + + +class PnpFailureError(PoseEstimatorError): + """RANSAC convergence failure or degenerate match geometry. NEVER converted to a fallback PoseEstimate by C4 itself; C5 owns the fallback decision.""" + + +class CovarianceDegradedWarning(Warning): + """Per-frame thermal-state-driven Jacobian-path engagement. NOT an exception. Emitted via `warnings.warn(...)` at the start of every Jacobian-path frame; users SHOULD filter to one warning per 60 s window via `warnings.simplefilter("once")` to avoid log flooding.""" +``` + +### Composition-root factory + +```python +# In src/gps_denied_onboard/runtime_root/pose_factory.py + +def build_pose_estimator( + config: AppConfig, + ransac_filter: RansacFilter, + wgs_converter: WgsConverter, + se3_utils: SE3Utils, + isam2_graph_handle: ISam2GraphHandle, # owned by C5, constructor-injected +) -> PoseEstimator: + """Construct the configured C4 estimator at composition-root time. Currently only `"opencv_gtsam"` is defined; the Protocol exists for ADR-009. + + Raises: + PoseEstimatorConfigError: invalid config; missing camera calibration; invalid `isam2_graph_handle`. + """ + ... +``` + +Strategy resolution table: + +| `config.pose.strategy` | Module path | Class | Notes | +|---|---|---|---| +| `"opencv_gtsam"` | `gps_denied_onboard.components.c4_pose.opencv_gtsam_estimator` | `OpenCVGtsamPoseEstimator` | production-default; only strategy. | + +Config-load-time validation: + +- `config.pose.strategy` (enum, default `"opencv_gtsam"`). +- `config.pose.ransac_iterations` (int, default 200). +- `config.pose.ransac_reprojection_threshold_px` (float, default 4.0). +- `config.pose.thermal_throttle_threshold_celsius` (float, default 75.0) — informational only; the actual `ThermalState.throttle` decision is owned by C7, not C4. + +## Test expectations summarised by Invariant + +| Invariant | Test name | Assertion | +|---|---|---| +| 1 | thread-binding | composition root binds to the same thread as C5; second binding raises `RuntimeError`. | +| 2 | stateless reorder | shuffle 10 frames → same outputs (modulo iSAM2 graph state which is C5-owned). | +| 3 | per-frame mode decision | thermal flag flipped between consecutive frames → mode flips immediately. | +| 4 | mode-switch latency | switch happens on the NEXT `estimate` call after the flag changes (no buffering). | +| 5 | covariance SPD | every emitted `covariance_6x6` is symmetric AND positive-definite (Cholesky succeeds). | +| 6 | mode reporting honesty | when Jacobian path runs, `covariance_mode == JACOBIAN` AND `current_covariance_mode()` returns `JACOBIAN`. | +| 7 | source_label = SATELLITE_ANCHORED on success | C4 always emits SATELLITE_ANCHORED; downgrade is C5's job. | +| 8 | `last_satellite_anchor_age_ms` pass-through | matches the last value from C5's broadcast. | +| 9 | only `PnpFailureError` escapes | `CovarianceDegradedWarning` is via `warnings.warn` not `raise`. | + +## What this contract does NOT define + +- The OpenCV `solvePnPRansac` configuration — owned by the producer task. +- The GTSAM `Marginals` factor add path — owned by the Marginals task. +- The Jacobian covariance derivation — owned by the hybrid task. +- The C5 iSAM2 graph internals — owned by E-C5 (AZ-260). +- The `ThermalState` source — owned by E-C7 (AZ-249 / AZ-302). + +## Producer-task / consumer-task split + +- **Protocol task (TBD)**: Protocol, `PoseEstimate` + `LatLonAlt` + `Quat` + `CovarianceMode` + `PoseSourceLabel` DTOs, error hierarchy, factory, config schema extension. +- **Marginals task (TBD)**: `OpenCVGtsamPoseEstimator` core (PnP + IPPE + GTSAM `Marginals` factor add against C5's iSAM2 graph). Steady-state path only; fails fast if `thermal_state.throttle` is True (raises `NotImplementedError` until the hybrid task lands). +- **Hybrid task (TBD)**: D-CROSS-LATENCY-1 — Jacobian fallback + per-frame thermal-state-driven mode switch. Adds the JACOBIAN code path; replaces the Marginals task's `NotImplementedError` with the actual Jacobian implementation; verifies AC-NEW-5 (workstation portion). + +## Versioning + change policy + +- Protocol method-signature changes are MAJOR version bumps (lockstep update of consumers). +- DTO field additions are MINOR; field removals are MAJOR. +- Adding a third covariance mode (e.g., a learned-prior covariance) is a feature-cycle change; it adds an entry to `CovarianceMode` without changing the Protocol surface. diff --git a/_docs/02_document/contracts/c5_state/state_estimator_protocol.md b/_docs/02_document/contracts/c5_state/state_estimator_protocol.md new file mode 100644 index 0000000..363a143 --- /dev/null +++ b/_docs/02_document/contracts/c5_state/state_estimator_protocol.md @@ -0,0 +1,143 @@ +# Contract: `StateEstimator` Protocol + +**Owner**: c5_state (epic AZ-260 / E-C5) +**Producer task**: AZ-381 (Protocol + DTOs + factory + composition + concrete `ISam2GraphHandle`) +**Consumer tasks**: AZ-382 (iSAM2 + IncrementalFixedLagSmoother wiring), AZ-383 (Factor adds), AZ-384 (Marginals + outputs), AZ-385 (Source-label + spoof gate), AZ-386 (ESKF baseline), AZ-387 (Smoothed history → FDR), AZ-388 (AC-5.2 fallback), AZ-389 (Orthorectifier → C6). +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 +**Module-layout home**: `src/gps_denied_onboard/components/c5_state/interface.py`, `src/gps_denied_onboard/components/c5_state/__init__.py`, `src/gps_denied_onboard/runtime_root/state_factory.py` + +## Purpose + +Defines the public interface for the C5 state estimator: fuses `VioOutput` (C1), `PoseEstimate` (C4), and FC `ImuWindow` (C8 inbound) into the posterior pose with native 6×6 covariance. Two concrete strategies linked at build time per ADR-002: `GtsamIsam2StateEstimator` (production-default; iSAM2 + IncrementalFixedLagSmoother K=10–20 per D-C5-3) and `EskfStateEstimator` (mandatory simple-baseline per IT-12 engine rule). Selected at startup via `config.state.strategy` with `BUILD_STATE_` flag gating per ADR-002. + +C5 owns the GTSAM iSAM2 graph (ADR-003 shared substrate); C4's `OpenCVGtsamPoseEstimator` adds factors to this graph via the `ISam2GraphHandle` Protocol (defined by AZ-355 stub; concrete impl owned by AZ-381 — first child of E-C5). Single-writer thread invariant: composition root binds C5 to the same ingest thread as C4. + +The shared `ImuPreintegrator` (AZ-276), `SE3Utils` (AZ-277), and `WgsConverter` (AZ-279) helpers are constructor-injected. + +## Public API + +### Protocol: `StateEstimator` + +```python +@runtime_checkable +class StateEstimator(Protocol): + def add_vio(self, vio: VioOutput) -> None: ... + def add_pose_anchor(self, pose: PoseEstimate) -> None: ... + def add_fc_imu(self, imu_window: ImuWindow) -> None: ... + def current_estimate(self) -> EstimatorOutput: ... + def smoothed_history(self, n_keyframes: int) -> list[EstimatorOutput]: ... + def health_snapshot(self) -> EstimatorHealth: ... +``` + +**Invariants**: + +1. **Single-writer thread** — every `add_*` and `current_estimate`/`smoothed_history` runs on the same ingest thread; ADR-003 GTSAM substrate is non-thread-safe. +2. **`add_*` calls are timestamp-ordered** — composition root provides a merge queue; out-of-order arrivals are rejected with `EstimatorDegradedError`. +3. **`add_pose_anchor(pose)` MUST inspect `pose.covariance_mode`** — `JACOBIAN` mode adds the pose to the running estimate but DOES NOT add an iSAM2 factor (per AZ-361 cross-task interaction); `MARGINALS` mode triggers the full factor add + iSAM2 update. +4. **`current_estimate()` ALWAYS returns a fresh `EstimatorOutput`** — never None on the steady-state path; `EstimatorFatalError` propagates if iSAM2 is unrecoverable. +5. **`source_label` reflects gate state** — `SATELLITE_ANCHORED` only when the spoof-promotion gate confirms (≥10 s `STABLE_NON_SPOOFED` AND visual-consistent next anchor); else `VISUAL_PROPAGATED` or `DEAD_RECKONED`. +6. **`smoothed_history(n)` returns up to K keyframes** — K bounded by `IncrementalFixedLagSmoother` window (D-C5-3 K=10–20); out-of-window keyframes are NOT recoverable. +7. **`smoothed_history(n)` entries have `smoothed=True`** — distinguishes from `current_estimate()` which has `smoothed=False`. +8. **Spoof-rejection events ALWAYS land in FDR + GCS STATUSTEXT** — never silent (R07; C5-ST-01). +9. **AC-5.2 fallback on 3 s no-estimate** — if `current_estimate()` would raise OR the keyframe window is empty for ≥3 s, downstream C8 emits FC IMU-only. +10. **`covariance_6x6` is always SPD** — both strategies enforce; on numerical failure raise `EstimatorFatalError`. + +### DTOs (in `_types/state.py`) + +```python +@dataclass(frozen=True, slots=True) +class EstimatorOutput: + frame_id: UUID + position_wgs84: LatLonAlt + orientation_world_T_body: Quat + velocity_world_mps: tuple[float, float, float] + covariance_6x6: np.ndarray + source_label: PoseSourceLabel + last_satellite_anchor_age_ms: int + smoothed: bool + emitted_at: int + + +class IsamState(Enum): + INIT = "init" + TRACKING = "tracking" + DEGRADED = "degraded" + LOST = "lost" + + +@dataclass(frozen=True, slots=True) +class EstimatorHealth: + isam2_state: IsamState + keyframe_count: int + cov_norm_growing_for_s: float + spoof_promotion_blocked: bool +``` + +### Error hierarchy (in `c5_state/errors.py`) + +```python +class StateEstimatorError(Exception): pass +class EstimatorDegradedError(StateEstimatorError): pass # poor convergence; emit degraded estimate +class EstimatorFatalError(StateEstimatorError): pass # numerical failure; AC-5.2 path +class StateEstimatorConfigError(StateEstimatorError): pass # composition-time +``` + +### Composition-root factory + +```python +def build_state_estimator( + config: AppConfig, + imu_preintegrator: ImuPreintegrator, + se3_utils: SE3Utils, + wgs_converter: WgsConverter, + fdr_client: FdrClient, +) -> tuple[StateEstimator, ISam2GraphHandle]: + """Construct the configured state estimator + return the iSAM2 graph handle for C4 to inject. Selects between gtsam_isam2 / eskf via config; ADR-002 BUILD_STATE_ gating.""" + ... +``` + +Strategy resolution table: + +| `config.state.strategy` | Module path | Class | +|---|---|---| +| `"gtsam_isam2"` | `gps_denied_onboard.components.c5_state.gtsam_isam2_estimator` | `GtsamIsam2StateEstimator` | +| `"eskf"` | `gps_denied_onboard.components.c5_state.eskf_baseline` | `EskfStateEstimator` | + +Config schema additions: + +- `config.state.strategy` (enum; required) +- `config.state.keyframe_window_size` (int, default 15) — D-C5-3 K=10–20 +- `config.state.spoof_promotion_min_stable_s` (float, default 10.0) — AC-NEW-2 +- `config.state.spoof_promotion_visual_consistency_tol_m` (float, default 30.0) — AC-NEW-8 +- `config.state.no_estimate_fallback_s` (float, default 3.0) — AC-5.2 + +## Test expectations summarised by Invariant + +| Invariant | Test | Assertion | +|---|---|---| +| 1 | Thread-binding | second binding from a different thread → `RuntimeError` | +| 2 | Timestamp ordering | out-of-order `add_*` → `EstimatorDegradedError` | +| 3 | `add_pose_anchor` mode dispatch | JACOBIAN: no iSAM2 factor add; MARGINALS: factor + update | +| 4 | `current_estimate` shape | always returns fresh `EstimatorOutput` on steady state | +| 5 | Spoof gate | label reflects gate state | +| 6 | Smoothed history bounded | `len(smoothed_history(100))` ≤ K | +| 7 | Smoothed flag | every `smoothed_history` entry has `smoothed=True`; `current_estimate` has `smoothed=False` | +| 8 | Spoof-rejection logging | FDR + GCS STATUSTEXT both fire on every gate decision | +| 9 | AC-5.2 timeout | 3 s no estimate → fallback signal emitted | +| 10 | SPD covariance | every emitted `covariance_6x6` is SPD | + +## Producer-task / consumer-task split + +1. **Protocol + composition + ISam2GraphHandle concrete** (Producer; 3 pts): Protocol, DTOs, error hierarchy, factory, config schema, Concrete `ISam2GraphHandle` impl extending the AZ-355 stub. +2. **iSAM2 + IncrementalFixedLagSmoother K wiring** (5 pts): GTSAM graph construction, K=10–20 window, key management. +3. **Factor adds (VIO + Pose + IMU)** (5 pts): `BetweenFactorPose3`, `GenericProjectionFactorCal3DS2`, `CombinedImuFactor` per the input DTO. +4. **Marginals + outputs** (3 pts): `current_estimate` / `smoothed_history` / `health_snapshot` body using `Marginals`. +5. **Source-label + spoof-promotion gate** (5 pts): `SourceLabelStateMachine` + AC-NEW-2 / AC-NEW-8 logic. +6. **ESKF baseline** (5 pts): `EskfStateEstimator` mandatory simple-baseline (IT-12 engine rule). +7. **Smoothed history → FDR** (3 pts): writer path + AC-4.5 invariant (NOT to FC). +8. **AC-5.2 fallback** (3 pts): 3 s no-estimate detector + signal emission. +9. **Orthorectifier → C6 mid-flight tile** (3 pts): orthorectifier sub-path. + +Total: 35 pts (within the XL 34–55 band). diff --git a/_docs/02_document/contracts/c6_tile_cache/descriptor_index.md b/_docs/02_document/contracts/c6_tile_cache/descriptor_index.md new file mode 100644 index 0000000..951ecf8 --- /dev/null +++ b/_docs/02_document/contracts/c6_tile_cache/descriptor_index.md @@ -0,0 +1,122 @@ +# Contract: DescriptorIndex Protocol + +**Component**: c6_tile_cache +**Producer task**: AZ-303 — `_docs/02_tasks/todo/AZ-303_c6_storage_interfaces.md` +**Consumer tasks**: +- AZ-TBD-c6-faiss-descriptor-index (implements: FAISS HNSW) +- TBD at decompose time: E-C2 (AZ-255 — sole runtime consumer; per-frame top-K=10 retrieval), E-C10 (AZ-252 — F1 pre-flight index build via `rebuild_from_descriptors`) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Defines the typed boundary to the per-flight descriptor retrieval index. C2 VPR queries the index per frame at 3 Hz (top-K=10) to nominate candidate tiles for C2.5 ReRanker. The concrete impl is FAISS HNSW (`FaissDescriptorIndex`), but consumers depend only on this Protocol so a future swap (e.g., ScaNN, custom index) does not ripple. C10 CacheProvisioner (`AZ-252`) is the F1 pre-flight write-side caller — it builds the `.index` file once per provisioning; in flight the index is read-only mmap. + +## Shape + +### Protocol surface + +`typing.Protocol` (PEP 544) with `runtime_checkable=True`. All methods are sync; the index is held in memory-mapped form. + +| Method | Signature | Throws / Errors | Blocking? | +|--------|-----------|-----------------|-----------| +| `search_topk` | `(query: np.ndarray, k: int) -> list[tuple[TileId, float]]` | `IndexUnavailableError` | sync (HNSW; ≤ 5 ms p95 warm; first call ≤ 1 s cold for mmap page-in) | +| `descriptor_dim` | `() -> int` | — | sync; constant-time | +| `mmap_handle` | `() -> Path` | `IndexUnavailableError` | sync; returns the `.index` file path (consumers needing custom mmap-aware tooling — e.g., operator post-flight inspection — call this) | +| `rebuild_from_descriptors` | `(descriptors: np.ndarray, tile_ids: list[TileId], hnsw_params: HnswParams) -> None` | `IndexBuildError`, `TileFsError` | sync (offline; minutes for a full-area corpus). Atomic file replacement via the AZ-280 sidecar pattern. | +| `index_metadata` | `() -> IndexMetadata` | `IndexUnavailableError` | sync; reads the sidecar metadata block | + +### DTOs + +```python +from dataclasses import dataclass +from datetime import datetime +from pathlib import Path +from typing import Optional + + +@dataclass(frozen=True) +class HnswParams: + """HNSW build hyperparameters. See description.md § 5; defaults from the + FAISS team's HNSW32+M=32 / efConstruction=200 / efSearch=64 baseline.""" + m: int = 32 # # of connections per node + ef_construction: int = 200 # build-time candidate list size + ef_search: int = 64 # query-time candidate list size + metric: str = "L2" # "L2" | "INNER_PRODUCT" + + +@dataclass(frozen=True) +class IndexMetadata: + descriptor_dim: int # dimension of the indexed vectors + n_vectors: int # number of indexed tiles + backbone_label: str # producer backbone — e.g. "ultra_vpr_v0" + backbone_sha256_hex: str # producer backbone weights hash (D-C10-3 chain) + built_at: datetime # ISO 8601 UTC + hnsw_params: HnswParams + sidecar_sha256_hex: str # canonical content hash of the .index file + file_path: Path # absolute path to the .index file +``` + +### Numpy contract + +- `query`: shape `(descriptor_dim,)`, dtype `float32`, C-contiguous. The Protocol does NOT auto-pad batches; per-frame is a single query (C2's per-frame call site). +- `descriptors`: shape `(N, descriptor_dim)`, dtype `float32`, C-contiguous; `N == len(tile_ids)`. The Protocol does NOT validate shape mismatch — the impl raises `IndexBuildError` on dtype/shape violation. + +### Errors + +``` +TileCacheError (shared with TileStore / TileMetadataStore) +└── IndexUnavailableError # mmap handle invalid, file missing, or sidecar mismatched + +IndexBuildError # raised only by rebuild_from_descriptors; NOT in the read-side envelope +``` + +`IndexBuildError` is intentionally NOT a subclass of `TileCacheError` — the build path is offline, lives in C10's pre-flight provisioning, and has different fault semantics than the runtime-read path. C2 (the only runtime consumer) catches `IndexUnavailableError`; C10 catches `IndexBuildError`. + +## Invariants + +- **I-1 (immutable in flight):** once an `.index` file is opened via the impl's loader, the file's content MUST NOT change for the lifetime of the impl instance. F1 pre-flight is the only legal write path; a mid-flight rebuild is forbidden (the impl raises `IndexUnavailableError` if it detects a content-hash mismatch on a periodic sidecar re-check — out-of-band tampering signal). +- **I-2 (top-K is best-effort):** `search_topk(query, k=K)` MAY return fewer than K results when the corpus has fewer than K vectors. Consumers (C2) tolerate fewer-than-K results. +- **I-3 (descriptor-dim is fixed at build):** `descriptor_dim()` returns the value baked into the `.index` file at build time; if a consumer's query vector dimension does not match, the impl raises `IndexUnavailableError` (NOT a separate `DimensionMismatchError` — keeps the read-side envelope to a single error type). +- **I-4 (no GPU resident memory):** the impl MUST hold the index in CPU mmap'd memory only. FAISS GPU index variants are explicitly excluded — the F3 hot path's GPU is reserved for `c7_inference` engines (per NFT-LIM-01 / D-CROSS-LATENCY-1). +- **I-5 (atomic rebuild):** `rebuild_from_descriptors` MUST write to a temporary path, sync to disk, atomically rename to the target path, write the sidecar `.sha256`, and only then return. A crash mid-rebuild leaves the prior index intact. +- **I-6 (sidecar coherence):** `mmap_handle()` returns a path whose `.sha256` sidecar matches the file's actual content hash; if the sidecar is missing or mismatched, `IndexUnavailableError` is raised on the FIRST `search_topk` of the flight (not lazily on the read that hits the corrupted region). C10's pre-flight gate is the canonical place this is validated; this Protocol carries the runtime-side check as defence-in-depth. +- **I-7 (frozen DTOs):** `HnswParams`, `IndexMetadata` are `@dataclass(frozen=True)`. +- **I-8 (single-thread search):** `search_topk` is NOT re-entrant; the F3 hot path is single-threaded per the description.md assumption. Future multi-threaded callers MUST use a per-thread impl instance (out of scope this cycle). + +## Non-Goals + +- **Not covered: tile pixel I/O.** That's `TileStore`. +- **Not covered: tile metadata bbox queries.** That's `TileMetadataStore`. +- **Not covered: incremental updates / online learning.** F1 pre-flight is full-rebuild only. Future task if needed. +- **Not covered: GPU FAISS variants.** I-4 forbids them this cycle. +- **Not covered: cross-flight index sharing.** Each flight provisions its own per-area `.index`; cross-flight is a parent-suite concern (D-PROJ-2). +- **Not covered: descriptor compression / PQ quantisation.** HNSW32 raw float32 is the only supported variant this cycle. Future task if AC-8.3 (10 GB cap) becomes binding. +- **Not covered: backbone retraining.** This Protocol is consumer-facing; the producer side (C10's compile of an UltraVPR engine) lives in E-C7 / E-C10. + +## Versioning Rules + +Same rules as `tile_store.md` § Versioning Rules. Note that `IndexMetadata.backbone_sha256_hex` ties this contract's lifecycle to the C7 engine cache (AZ-298 / AZ-301 / AZ-281): a backbone weights bump invalidates every prior `.index` AND requires a coordinated update — recorded as a major version of THIS contract only when the field's shape changes; backbone-weight refreshes within the existing schema are non-breaking content updates handled by C10. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| protocol-conformance-full | A class implementing all 5 methods | `isinstance(impl, DescriptorIndex) == True` | Producer AC-1 | +| protocol-conformance-partial | A class missing `index_metadata` | `isinstance == False` | CI drift gate | +| search-topk-warm | Query vector of correct dim against a 10k-vector index, OS page cache warm | Returns `[(tile_id, distance), ...]` length ≤ k; p95 ≤ 5 ms | I-2 / Consumer C2-PT-01 | +| search-topk-fewer-than-k | k=20 against a 10-vector index | Returns 10 results, ordered by distance ascending | I-2 | +| search-topk-dim-mismatch | Query vector of wrong dim | `IndexUnavailableError` | I-3 | +| search-topk-corrupted-sidecar | Index file present, sidecar missing | First `search_topk` raises `IndexUnavailableError`; subsequent calls also raise (no silent recovery) | I-6 | +| descriptor-dim | After a rebuild with `descriptors.shape == (N, 512)` | `descriptor_dim() == 512` | I-3 | +| rebuild-atomic-on-crash | Simulated `os._exit` mid-rebuild | The original `.index` file is intact and still loadable; partial temp file is cleaned up at next start | I-5 | +| rebuild-sidecar-content-hash | Successful rebuild | `.sha256` sidecar matches `sha256(.index)` | I-6 / AZ-280 contract | +| index-metadata | After rebuild | Returns `IndexMetadata` with matching `descriptor_dim`, `n_vectors`, `built_at` (within 1 s of call), `hnsw_params` (mirrors input), `sidecar_sha256_hex` (matches sidecar content) | I-7 | +| frozen-dto-mutation | `HnswParams(m=32, ...).m = 64` | `FrozenInstanceError` | I-7 | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract — 5-method Protocol + HNSW params DTO + IndexMetadata sidecar shape + immutable-in-flight + atomic-rebuild invariants. | autodev (decompose Step 2 of AZ-250 / E-C6) | diff --git a/_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md b/_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md new file mode 100644 index 0000000..6c6a38e --- /dev/null +++ b/_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md @@ -0,0 +1,132 @@ +# Contract: TileMetadataStore Protocol + +**Component**: c6_tile_cache +**Producer task**: AZ-303 — `_docs/02_tasks/todo/AZ-303_c6_storage_interfaces.md` +**Consumer tasks**: +- AZ-TBD-c6-postgres-filesystem-store (implements) +- AZ-TBD-c6-freshness-gate (insert hook + sector classification reader) +- AZ-TBD-c6-cache-budget-eviction (LRU candidate enumeration + delete coordination) +- TBD at decompose time: E-C10 (AZ-252 — manifest + provisioning), E-C11 (AZ-251 — both `TileDownloader` insert and `TileUploader` reader queries), E-C12 (AZ-253 — operator pre-flight tooling) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Defines the typed boundary to the Postgres-backed spatial index over `TileMetadata`. Concrete impls (today only `PostgresFilesystemStore` — same class also implements `TileStore`) own row insert / bbox query / voting-state transitions. Pre-flight cache builders (C10 / C11 / C12), the F4 mid-flight orthorectifier path (via C5 → C6), and post-landing tooling (C11 `TileUploader`) all consume this surface. + +## Shape + +### Protocol surface + +`typing.Protocol` (PEP 544) with `runtime_checkable=True`. All methods are sync; the Postgres connection pool is owned inside the impl. + +| Method | Signature | Throws / Errors | Blocking? | +|--------|-----------|-----------------|-----------| +| `query_by_bbox` | `(bbox: Bbox, zoom: int, *, voting_filter: Optional[VotingStatus] = None, source_filter: Optional[TileSource] = None) -> list[TileMetadata]` | `TileMetadataError` | sync (btree index; ≤ 50 ms typical) | +| `insert_metadata` | `(metadata: TileMetadata) -> None` | `TileMetadataError`, `FreshnessRejectionError` | sync (single-row insert) | +| `update_voting_status` | `(tile_id: TileId, status: VotingStatus) -> None` | `TileMetadataError`, `TileNotFoundError` | sync | +| `mark_uploaded` | `(tile_id: TileId, uploaded_at: datetime) -> None` | `TileMetadataError`, `TileNotFoundError` | sync | +| `pending_uploads` | `() -> list[TileMetadata]` | `TileMetadataError` | sync (filtered query: `source = ONBOARD_INGEST AND uploaded_at IS NULL`) | +| `record_lru_access` | `(tile_id: TileId, accessed_at: datetime) -> None` | `TileMetadataError` | sync (timestamp update only — no row-level read) | +| `lru_candidates` | `(*, max_count: int) -> list[TileMetadata]` | `TileMetadataError` | sync (oldest-`accessed_at`-first; bounded result set) | +| `total_disk_bytes` | `() -> int` | `TileMetadataError` | sync (sum of `disk_bytes` column; ≤ 100 ms even at 100k rows) | +| `get_by_id` | `(tile_id: TileId) -> Optional[TileMetadata]` | `TileMetadataError` | sync; returns `None` if absent (NOT `TileNotFoundError`) | + +### DTOs + +Reuses `TileId`, `TileMetadata`, `TileQualityMetadata`, `TileSource`, `FreshnessLabel`, `VotingStatus` from `tile_store.md`. The same DTOs are shared across both Protocols by design (single source of truth in `c6_tile_cache._types`). + +```python +from dataclasses import dataclass + + +@dataclass(frozen=True) +class Bbox: + """Axis-aligned WGS84 bounding box. Inclusive on min, exclusive on max.""" + min_lat: float + min_lon: float + max_lat: float + max_lon: float +``` + +In addition, `TileMetadata` is extended with two columns owned by the metadata store (NOT meaningful to `TileStore`; see Invariants): + +```python +@dataclass(frozen=True) +class TileMetadataPersistent: + metadata: TileMetadata # the read-only DTO from tile_store.md + accessed_at: datetime # LRU clock — last read time + uploaded_at: Optional[datetime] # set when C11 TileUploader has confirmed upload + disk_bytes: int # JPEG body size on disk; tracked for cache-budget enforcement +``` + +The Protocol returns `TileMetadata` from queries. `TileMetadataPersistent` is the in-process view of LRU and disk-budget state, accessible only via `lru_candidates` / `record_lru_access` / `total_disk_bytes`. + +### Sector classification (read-only input to the freshness gate) + +```python +class SectorClassification(str, Enum): + ACTIVE_CONFLICT = "active_conflict" + STABLE_REAR = "stable_rear" + + +@dataclass(frozen=True) +class SectorBoundary: + bbox: Bbox + classification: SectorClassification +``` + +`SectorClassification` is set pre-flight by the operator via C12; the metadata store reads `SectorBoundary` rows from a sibling table (`sector_boundaries`) at insert-time to decide which freshness rule to apply. The Protocol does NOT expose insert-side methods for `SectorBoundary` rows — that surface lives in C12. + +## Invariants + +- **I-1 (composite key uniqueness):** `(zoom_level, lat, lon, source)` is the unique key in the `tiles` table. Re-inserting the same key with different content_sha256 raises `TileMetadataError` — no silent overwrite. +- **I-2 (freshness gate at insert):** `insert_metadata` rejects (raises `FreshnessRejectionError`) iff the tile's `(lat, lon)` falls inside an `ACTIVE_CONFLICT` sector AND `capture_timestamp < now() - active_conflict_max_age`. The freshness rules table is configured per-flight (default 6 months for active_conflict; 12 months for stable_rear which downgrades rather than rejects). +- **I-3 (downgrade marking):** when a tile in a `STABLE_REAR` sector is older than `stable_rear_max_age`, the row is inserted with `freshness_label=DOWNGRADED` (NOT rejected). `query_by_bbox` returns the downgrade flag intact so consumers (C2 / C3 spoof-rejection) can act on it. +- **I-4 (LRU clock):** `record_lru_access` updates `accessed_at = max(current accessed_at, supplied timestamp)`; clock skew never sets `accessed_at` backward. `lru_candidates` returns oldest-first. +- **I-5 (disk-budget invariant):** `total_disk_bytes` MUST equal `SUM(disk_bytes)` over all rows where `voting_status != REJECTED`. Rejected rows are tombstones — they keep the on-disk file deleted but retain the row for the manifest's content-hash check (D-C10-3). +- **I-6 (frozen DTOs):** `Bbox`, `SectorBoundary`, `TileMetadataPersistent` are `@dataclass(frozen=True)`. +- **I-7 (transactional writes):** `insert_metadata` is a single transaction over the `tiles` table; the freshness check + the row insert MUST be atomic (a parallel sector-boundary update MUST NOT race the gate). +- **I-8 (no silent voting-status downgrade):** `update_voting_status` accepts only forward transitions (`PENDING → TRUSTED`, `PENDING → REJECTED`); a backward transition raises `TileMetadataError`. `TRUSTED → REJECTED` is allowed (covers the cache-poisoning recall path). +- **I-9 (`pending_uploads` is the single source for C11 TileUploader):** the uploader MUST NOT scan the filesystem for pending tiles; it MUST drive its loop off `pending_uploads()`. The metadata store is the bookkeeping. + +## Non-Goals + +- **Not covered: filesystem JPEG I/O.** That's `TileStore`. +- **Not covered: descriptor index queries.** That's `DescriptorIndex`. +- **Not covered: sector boundary insert / update.** Owned by C12 operator-tooling against a sibling table; this Protocol is read-only on `SectorBoundary` and does NOT expose CRUD. +- **Not covered: cross-flight aggregation / voting threshold computation.** That's `satellite-provider`'s D-PROJ-2 trust layer (parent suite); C6 just stamps the per-row `voting_status`. +- **Not covered: full-text search / arbitrary-WHERE queries.** Only the methods above; ad-hoc queries go through DBA tooling, not this Protocol. +- **Not covered: schema migrations.** Migration scripts live in `c6_tile_cache/_alembic/`; the Protocol is shape-only. + +## Versioning Rules + +Same rules as `tile_store.md` § Versioning Rules. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| protocol-conformance-full | A class implementing all 9 methods | `isinstance(impl, TileMetadataStore) == True` | Producer AC-1 | +| query-by-bbox-basic | bbox covering 100 inserted tiles at zoom=18 | Returns exactly the 100 tiles; `voting_filter=None` returns all statuses | Smoke | +| query-by-bbox-voting-filter | Same with `voting_filter=TRUSTED` | Returns only TRUSTED tiles in bbox | Used by C10 manifest builder | +| insert-duplicate-key | Insert (z=18, lat, lon, src=GOOGLEMAPS) twice with different content_sha256 | First succeeds; second raises `TileMetadataError` | I-1 | +| insert-active-conflict-stale | Insert into ACTIVE_CONFLICT sector, capture_timestamp = now - 7 months | `FreshnessRejectionError`; row not committed | I-2 / C6-IT-02 | +| insert-stable-rear-stale | Insert into STABLE_REAR sector, capture_timestamp = now - 13 months | Row inserted with `freshness_label=DOWNGRADED` | I-3 | +| update-voting-status-forward | PENDING → TRUSTED | Succeeds | I-8 | +| update-voting-status-backward | TRUSTED → PENDING | `TileMetadataError` | I-8 | +| update-voting-status-trusted-to-rejected | TRUSTED → REJECTED | Succeeds (recall path) | I-8 | +| pending-uploads-empty | No ONBOARD_INGEST tiles | Returns `[]` | I-9 | +| pending-uploads-after-mark | Insert + `mark_uploaded` for half | Returns the unmarked half | I-9 | +| record-lru-access-monotonic | `record_lru_access(t, ts1)` then `record_lru_access(t, ts0 < ts1)` | `accessed_at` stays at `ts1` | I-4 | +| lru-candidates-order | Mixed `accessed_at` for 100 rows; `lru_candidates(max_count=10)` | Returns the 10 oldest in ascending `accessed_at` order | I-4 | +| total-disk-bytes-sum | Insert 5 tiles with known disk_bytes, mark 1 REJECTED | `total_disk_bytes()` excludes the rejected row | I-5 | +| get-by-id-missing | Random tile_id never inserted | Returns `None` (not `TileNotFoundError`) | Documented null-return semantic | +| frozen-dto-mutation | `Bbox(0, 0, 1, 1).min_lat = 5.0` | `FrozenInstanceError` | I-6 | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract — 9-method Protocol + LRU/disk-budget extensions + freshness gate semantics + composite-key uniqueness invariant. | autodev (decompose Step 2 of AZ-250 / E-C6) | diff --git a/_docs/02_document/contracts/c6_tile_cache/tile_store.md b/_docs/02_document/contracts/c6_tile_cache/tile_store.md new file mode 100644 index 0000000..4bbaab8 --- /dev/null +++ b/_docs/02_document/contracts/c6_tile_cache/tile_store.md @@ -0,0 +1,166 @@ +# Contract: TileStore Protocol + +**Component**: c6_tile_cache +**Producer task**: AZ-303 — `_docs/02_tasks/todo/AZ-303_c6_storage_interfaces.md` +**Consumer tasks**: +- AZ-TBD-c6-postgres-filesystem-store (implements) +- AZ-TBD-c6-freshness-gate (insert hook collaborator) +- AZ-TBD-c6-cache-budget-eviction (uses `tile_exists` + `delete_tile`) +- TBD at decompose time: E-C2.5 (AZ-256), E-C3 (AZ-257), E-C11 (AZ-251 — both `TileDownloader` and `TileUploader`) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Defines the typed boundary between filesystem-resident tile pixel I/O and every component that produces or consumes JPEG tile bytes. Concrete impls (today only `PostgresFilesystemStore`) write JPEGs to a layout byte-identical to `satellite-provider`'s on-disk format so the C11 `TileUploader` post-landing upload (F10) is a straight copy. + +## Shape + +### Protocol surface + +`typing.Protocol` (PEP 544 structural typing) with `runtime_checkable=True`. + +| Method | Signature | Throws / Errors | Blocking? | +|--------|-----------|-----------------|-----------| +| `read_tile_pixels` | `(tile_id: TileId) -> TilePixelHandle` | `TileNotFoundError`, `TileFsError` | sync (mmap, ≤ 0.5 ms warm; ≤ 50 ms cold) | +| `write_tile` | `(tile_blob: bytes, metadata: TileMetadata) -> None` | `TileFsError`, `TileMetadataError`, `ContentHashMismatchError`, `FreshnessRejectionError` | sync (atomic fs write + sidecar) | +| `tile_exists` | `(tile_id: TileId) -> bool` | — | sync (page-cache lookup; ≤ 1 ms) | +| `delete_tile` | `(tile_id: TileId) -> bool` | `TileFsError` | sync (returns `True` if a file was removed; `False` if missing — no-error path for the cache-eviction caller) | + +### DTOs + +```python +from dataclasses import dataclass +from datetime import datetime +from enum import Enum +from pathlib import Path +from typing import Optional + + +@dataclass(frozen=True) +class TileId: + zoom_level: int # 0..21 — `satellite-provider` legal range + lat: float # WGS84 centre latitude + lon: float # WGS84 centre longitude + + +class TileSource(str, Enum): + GOOGLEMAPS = "googlemaps" + ONBOARD_INGEST = "onboard_ingest" + + +class FreshnessLabel(str, Enum): + FRESH = "fresh" + STALE_ACTIVE_CONFLICT = "stale_active_conflict" + STALE_REAR = "stale_rear" + DOWNGRADED = "downgraded" + + +class VotingStatus(str, Enum): + PENDING = "pending" + TRUSTED = "trusted" + REJECTED = "rejected" + + +@dataclass(frozen=True) +class TileQualityMetadata: + estimator_label: str # "satellite_anchored" | "visual_propagated" | "dead_reckoned" + covariance_2x2: tuple[tuple[float, float], tuple[float, float]] + last_anchor_age_ms: int + mre_px: float + imu_bias_norm: float + + +@dataclass(frozen=True) +class TileMetadata: + tile_id: TileId + tile_size_meters: float + tile_size_pixels: int + capture_timestamp: datetime # ISO 8601 UTC + source: TileSource + content_sha256_hex: str # canonical sha256 of the JPEG body + freshness_label: FreshnessLabel + flight_id: Optional[str] # uuid; set for ONBOARD_INGEST + companion_id: Optional[str] # set for ONBOARD_INGEST + quality_metadata: Optional[TileQualityMetadata] # set for ONBOARD_INGEST + voting_status: VotingStatus # default PENDING for ONBOARD_INGEST + + +class TilePixelHandle: + """Opaque handle: filesystem path + mmap pointer. Consumer MUST NOT copy the bytes + or close the underlying mapping; the handle's lifetime is bounded by the caller's + use-site `with` block.""" + + @property + def filesystem_path(self) -> Path: ... + def __enter__(self) -> memoryview: ... + def __exit__(self, *exc) -> None: ... +``` + +### Error types + +All under `c6_tile_cache.errors`: + +``` +TileCacheError (Exception subclass) +├── TileNotFoundError # tile_id not present on disk +├── TileFsError # I/O error on read/write/rename +├── TileMetadataError # row missing despite file present, or vice-versa (consistency violation) +├── ContentHashMismatchError # supplied JPEG bytes don't match declared content_sha256 +└── FreshnessRejectionError # rejected by the C6 freshness gate (raised on insert in active_conflict) +``` + +`IndexUnavailableError` lives under the same package but is exclusively raised by `DescriptorIndex` — it is not part of `TileStore`'s envelope. + +### Filesystem layout + +JPEG body lands at `/tiles/{zoom_level}/{x}/{y}.jpg` where `(x, y)` is derived from `(lat, lon, zoom_level)` per the same Web-Mercator tile-coordinate function `satellite-provider` uses (see `satellite-provider/README.md`). A sidecar file `/tiles/{zoom_level}/{x}/{y}.jpg.sha256` carries the canonical content hash (produced by `helpers.sha256_sidecar.atomic_write_with_sidecar` per AZ-280 contract). + +## Invariants + +- **I-1 (byte-identity with satellite-provider):** for any `(zoom_level, lat, lon)`, the filesystem path computed by C6 `write_tile` MUST equal the path that `satellite-provider` would compute for the same coordinate; any deviation breaks AC-8.4 / F10 upload. +- **I-2 (atomic write + sidecar invariant):** a successful `write_tile` returns only after BOTH the JPEG file AND its `.sha256` sidecar are durable on disk; partial states (file without sidecar or sidecar without file) MUST NOT be observable to readers. +- **I-3 (content-hash gate):** `write_tile` rejects (raises `ContentHashMismatchError`) if `sha256(tile_blob) != metadata.content_sha256_hex`; the cache-poisoning safety budget (D-C10-3 + AC-NEW-7) is bound to this check. +- **I-4 (read mmap is read-only):** `TilePixelHandle.__enter__()` returns a read-only `memoryview`; consumers MUST NOT mutate; a writer that mutates through the mmap is a `Reliability` finding (Critical) at code-review time. +- **I-5 (race-free reads under concurrent F4 writes):** C2 / C2.5 / C3 readers see either the pre-write tile bytes or the post-write tile bytes — never partial bytes. Enforced by `atomicwrites` rename semantics on the writer side. +- **I-6 (idempotent delete):** `delete_tile` returns `False` when the tile is missing; it does NOT raise. The cache-eviction caller relies on this no-error path because it deletes by LRU and may race with a concurrent eviction sweep. +- **I-7 (frozen DTOs):** `TileId`, `TileMetadata`, `TileQualityMetadata` are `@dataclass(frozen=True)`. Mutation raises `FrozenInstanceError`. +- **I-8 (fail-fast on consistency violation):** if a row exists in the metadata store but the JPEG file is missing (or vice-versa), `read_tile_pixels` raises `TileMetadataError` — NOT `TileNotFoundError`. The two errors are the operator's signal that the cache is in a degraded state and needs reprovisioning. + +## Non-Goals + +- **Not covered: tile descriptor index.** Descriptor mmap + HNSW search is `DescriptorIndex` — separate Protocol, separate contract. +- **Not covered: spatial bbox queries.** `query_by_bbox` is on `TileMetadataStore` — separate Protocol. +- **Not covered: HTTP transport to satellite-provider.** C11 `TileDownloader` / `TileUploader` own transport; they call `TileStore.write_tile` / `TileStore.read_tile_pixels` for the local-side persistence step. +- **Not covered: eviction policy.** `delete_tile` is the eviction primitive; the LRU policy lives in the cache-budget enforcer (separate task). +- **Not covered: multi-process readers writing concurrently.** Single-process producer/consumer per flight; multi-process scenarios are out of scope this cycle. + +## Versioning Rules + +- **Breaking** (renamed method, removed field, type change, required→optional flip, error class removed from family) requires a major version bump and a coordinated update of every consumer task listed in this header. Producer task MUST surface the change to the user via Choose format before merging. +- **Non-breaking additions** (new optional method via Protocol structural compatibility, new optional field on a DTO with a default, new error variant added to `TileCacheError`) require a minor version bump. +- **Patch** (clarification only, no shape change) is documentation-only. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| protocol-conformance-full | A class implementing all four methods with matching signatures | `isinstance(impl, TileStore) == True` | I-1 / Producer AC-1 | +| protocol-conformance-partial | A class missing `delete_tile` | `isinstance == False` | Drift detection at CI time | +| frozen-dto-mutation | `TileId(zoom_level=18, lat=49.94, lon=36.31).lat = 0.0` | `FrozenInstanceError` | I-7 | +| write-tile-byte-identical | `write_tile(blob, metadata)` for `(zoom=18, lat=49.94, lon=36.31)` | Filesystem path equals `satellite-provider`'s path for same coord; JPEG bytes equal `blob`; sidecar contains `sha256(blob)` | I-1 / I-2 / C6-IT-01 | +| write-tile-content-hash-mismatch | `write_tile(blob, metadata.with(content_sha256_hex="0x00..."))` | `ContentHashMismatchError`; no file written; no sidecar written | I-3 / C6-ST-01 | +| write-tile-freshness-reject | active_conflict sector + stale tile | `FreshnessRejectionError`; no file/row written | Hand-off to freshness-gate task | +| read-tile-pixels-warm | `read_tile_pixels(tile_id)` after a prior write; OS page cache warm | `TilePixelHandle.__enter__()` returns within 0.5 ms; bytes equal the written JPEG body | C6-PT-01 | +| read-tile-pixels-missing | `read_tile_pixels(tile_id)` for never-written tile | `TileNotFoundError` | I-8 (the row-missing-and-file-missing case) | +| read-tile-pixels-row-without-file | metadata row exists; JPEG file deleted out-of-band | `TileMetadataError` (not `TileNotFoundError`) | I-8 | +| concurrent-write-and-read | F4 writer + 9 Hz C2.5 reader on same tile | Reader sees either pre-write or post-write bytes — never partial | I-5 | +| delete-tile-missing | `delete_tile` for a never-written tile | Returns `False`; no exception raised | I-6 | +| delete-tile-existing | `delete_tile` after a prior write | Returns `True`; subsequent `tile_exists` returns `False`; sidecar also removed | I-6 | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract — Protocol + DTOs + 5-error family + filesystem byte-identity invariant. | autodev (decompose Step 2 of AZ-250 / E-C6) | diff --git a/_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md b/_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md new file mode 100644 index 0000000..adfe569 --- /dev/null +++ b/_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md @@ -0,0 +1,176 @@ +# Contract: InferenceRuntime Protocol + +**Component**: c7_inference +**Producer task**: AZ-297 — `_docs/02_tasks/todo/AZ-297_c7_runtime_protocol.md` +**Consumer tasks**: +- AZ-298 (TensorrtRuntime — implements) +- AZ-299 (OnnxTrtEpRuntime — implements) +- AZ-300 (PytorchFp16Runtime — implements) +- AZ-301 (EngineGate — uses error types) +- AZ-302 (ThermalState publisher — extends `ThermalState` DTO with `is_telemetry_available`) +- TBD at decompose time: E-C2 (AZ-250), E-C2.5 (AZ-251), E-C3 (AZ-252), E-C3.5 (AZ-253), E-C4 (AZ-254 — `ThermalState` consumer), E-C10 (AZ-257 — `compile_engine` caller) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Defines the typed boundary between the on-Jetson inference runtime (engine compilation, deserialisation, per-call inference, GPU memory management, thermal-throttle telemetry) and every downstream component that depends on GPU inference. The Protocol is the single point of contact that lets ADR-001 select between three concrete strategies (TensorRT 10.3 production, ONNX Runtime + TRT EP fallback, PyTorch FP16 simple-baseline) at startup without consumers caring which is wired. + +## Shape + +### Protocol surface + +The Protocol is `typing.Protocol` (PEP 544 structural typing) with `runtime_checkable=True`. + +| Method | Signature | Throws / Errors | Blocking? | +|--------|-----------|-----------------|-----------| +| `compile_engine` | `(model_path: Path, build_config: BuildConfig) -> EngineCacheEntry` | `EngineBuildError`, `CalibrationCacheError` | sync (offline; minutes for INT8) | +| `deserialize_engine` | `(entry: EngineCacheEntry) -> EngineHandle` | `EngineDeserializeError`, `EngineHashMismatchError`, `EngineSchemaMismatchError`, `EngineSidecarMissingError`, `OutOfMemoryError` | sync | +| `infer` | `(handle: EngineHandle, inputs: dict[str, np.ndarray]) -> dict[str, np.ndarray]` | `InferenceError`, `OutOfMemoryError` | sync (GPU stream sync) | +| `release_engine` | `(handle: EngineHandle) -> None` | — (idempotent) | sync | +| `thermal_state` | `() -> ThermalState` | `TelemetryUnavailableError` (only on cold-start fail; steady-state defaults to `is_telemetry_available=False`) | sync | +| `current_runtime_label` | `() -> Literal["tensorrt", "onnx_trt_ep", "pytorch_fp16"]` | — | sync | + +### DTOs + +All DTOs are stdlib `@dataclass(frozen=True)` (`EngineHandle` is the exception — opaque marker class). + +```python +from dataclasses import dataclass +from enum import Enum +from pathlib import Path +from typing import Optional + + +class PrecisionMode(str, Enum): + FP16 = "fp16" + INT8 = "int8" + MIXED = "mixed" + + +@dataclass(frozen=True) +class OptimizationProfile: + input_name: str + min_shape: tuple[int, ...] + opt_shape: tuple[int, ...] + max_shape: tuple[int, ...] + + +@dataclass(frozen=True) +class BuildConfig: + precision: PrecisionMode + workspace_mb: int + calibration_dataset: Optional[Path] # required for INT8; None for FP16/Mixed + optimization_profiles: tuple[OptimizationProfile, ...] + use_trtexec: bool = False # TRT-only hint; ignored by ORT / PyTorch + + +@dataclass(frozen=True) +class EngineCacheEntry: + engine_path: Path # `.engine` for TRT/ORT; `.onnx` for ORT-direct; `.pt` for PyTorch + sha256_hex: str # canonical sha256 of engine_path + sm: Optional[int] # None for PyTorch (hardware-portable) + jp: Optional[str] # JetPack version, e.g. "6.2" + trt: Optional[str] # TensorRT version, e.g. "10.3" + precision: PrecisionMode + extras: dict[str, str] # implementation-specific (e.g., calibration cache path) + + +class EngineHandle: + """Opaque marker class. Consumers MUST NOT introspect; pass back to the same runtime.""" + pass + + +@dataclass(frozen=True) +class ThermalState: + cpu_temp_c: Optional[float] + gpu_temp_c: Optional[float] + thermal_throttle_active: bool # default False on telemetry unavailability + measured_clock_mhz: Optional[int] + measured_at_ns: int # monotonic_ns of poll + is_telemetry_available: bool # False if the source is hung/absent (default-safe path) +``` + +### Error hierarchy + +All errors live under `c7_inference.errors`: + +``` +RuntimeError (Exception subclass — NOT stdlib RuntimeError) +├── EngineBuildError +├── EngineDeserializeError +├── EngineHashMismatchError +├── EngineSchemaMismatchError +├── EngineSidecarMissingError +├── CalibrationCacheError +├── InferenceError +├── OutOfMemoryError +└── TelemetryUnavailableError + +RuntimeNotAvailableError (composition-root only; NOT a Protocol family error) +ConfigSchemaError (config-load only; NOT a Protocol family error) +``` + +Consumers catch the family with `except c7_inference.errors.RuntimeError as e`. Implementations MUST raise only members of this family from Protocol methods; third-party library errors (TRT C++ exceptions, ORT internal errors, PyTorch CUDA errors) MUST be caught and rewrapped. + +### Composition-root factory + +Defined in `runtime_root/inference_factory.py` (NOT in `c7_inference` itself; the factory is the wiring layer): + +```python +def build_inference_runtime(config: Config) -> InferenceRuntime: + """ + Selects exactly one strategy by config.inference.runtime + BUILD_* flag gating. + Raises RuntimeNotAvailableError if the requested strategy's BUILD_* flag is OFF. + """ +``` + +## Invariants + +- **I-1 (single source of truth for runtime label):** `current_runtime_label()` returns a string equal to `config.inference.runtime`. AC-NEW-3 audit relies on this exact-match property. +- **I-2 (Protocol-family error envelope):** Every Protocol method raises only members of `c7_inference.errors.RuntimeError` family or returns normally. Third-party exceptions are caught and rewrapped. +- **I-3 (frozen DTOs):** `BuildConfig`, `EngineCacheEntry`, `ThermalState`, and `OptimizationProfile` are `@dataclass(frozen=True)`. Mutation attempts raise `FrozenInstanceError`. +- **I-4 (opaque EngineHandle):** Consumers MUST NOT introspect `EngineHandle` fields. Implementations subclass with private state; the Protocol surface is unchanged. +- **I-5 (lazy-import gating):** Concrete strategies are imported only inside the factory's `if BUILD_*:` blocks. The package `__init__.py` exports only the Protocol, DTOs, and errors. A Tier-0 build with `BUILD_TENSORRT_RUNTIME=OFF` MUST NOT load `c7_inference.tensorrt_runtime` (verifiable via `sys.modules`). +- **I-6 (default-safe thermal):** When `ThermalState.is_telemetry_available == False`, `ThermalState.thermal_throttle_active == False` (the steady-state default; consumers may choose to ignore the throttle bit when telemetry is unavailable). +- **I-7 (idempotent release):** `release_engine(handle)` may be called more than once on the same handle; second-and-later calls return silently. +- **I-8 (sync-stream `infer`):** `infer` returns only after the GPU stream has synchronised; the returned dict's tensors are host-resident (numpy arrays) and ready for consumer use. + +## Non-Goals + +- **Not covered: multi-stream concurrent inference.** One CUDA stream per Runtime instance this cycle. Future work if the F3 hot path becomes multi-threaded. +- **Not covered: cross-process engine cache reuse.** Engines are per-process; a separate process must deserialise from the on-disk cache. +- **Not covered: per-frame input/output type negotiation.** Inputs / outputs are numpy arrays in named dicts; type / dtype negotiation is per-strategy and per-engine. +- **Not covered: streaming / iterative inference.** `infer` is request/response; no callbacks, no chunked outputs. +- **Not covered: dynamic batch.** `OptimizationProfile` carries `min_shape / opt_shape / max_shape`, but the consumer is responsible for picking the actual runtime shape; the Protocol does not auto-batch. +- **Not covered: engine versioning / hot-reload.** Engines are loaded at takeoff (F2) and held for the flight; a new engine requires a process restart. + +## Versioning Rules + +- **Breaking changes** (renamed method, removed field, type change, required→optional flip, error-class removed from family) require a new major version (`2.0.0`) and a deprecation path for every consumer task listed in the contract header. The change log MUST list each consuming task that needs a coordinated update. +- **Non-breaking additions** (new optional method via Protocol structural compatibility, new optional field on a DTO with a default, new error variant added to the family) require a minor version bump (e.g., `1.1.0`). +- **Patch** (clarification only; no shape change) is documentation-only. + +The current contract is `1.0.0` and includes the 1.1.0 anticipated extension `ThermalState.is_telemetry_available` from AZ-302 (added pre-freeze; will be `1.0.0` at first frozen freeze). + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| protocol-conformance-full | A class implementing all six methods | `isinstance(impl, InferenceRuntime) == True` | AZ-297 AC-1 | +| protocol-conformance-partial | A class missing `thermal_state` | `isinstance == False` | AZ-297 AC-1 | +| frozen-dto-mutation | `BuildConfig(precision=Fp16, ...).precision = Int8` | `FrozenInstanceError` | AZ-297 AC-2 / I-3 | +| error-family-catch-all | Raise each of the nine error subtypes | All caught by `except c7_inference.errors.RuntimeError` | AZ-297 AC-3 / I-2 | +| factory-tensorrt-on | `config.inference.runtime="tensorrt"` + `BUILD_TENSORRT_RUNTIME=ON` | Returns `TensorrtRuntime`; label `"tensorrt"` | AZ-297 AC-4 | +| factory-tensorrt-off | Same config + `BUILD_TENSORRT_RUNTIME=OFF` | `RuntimeNotAvailableError`; `sys.modules` does NOT contain `c7_inference.tensorrt_runtime` | AZ-297 AC-5 / I-5 | +| factory-unknown-runtime | `config.inference.runtime="tensorflow_lite"` | `ConfigSchemaError` at config-load time | AZ-297 AC-6 | +| label-exact-match | Runtime constructed for each of the three strategies | `current_runtime_label()` == `config.inference.runtime` | AZ-297 AC-7 / I-1 | +| contract-introspection-parity | Parse this file's Shape section vs. the runtime Protocol | All methods, fields, errors match | AZ-297 AC-8 | +| thermal-default-safe | `ThermalState(is_telemetry_available=False, thermal_throttle_active=True)` | Implementations MUST NOT construct this — invariant I-6 says `throttle_active=False` whenever `is_telemetry_available=False`. A test asserts the publisher's output respects this. | I-6 | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract — Protocol + 4 DTOs + 9-error family + composition-root factory + lazy-import gating. Includes the `ThermalState.is_telemetry_available` field added by AZ-302 (no separate version bump because the field landed before first freeze). | autodev (AZ-297 / AZ-302 coordination) | diff --git a/_docs/02_document/contracts/c8_fc_adapter/fc_adapter_protocol.md b/_docs/02_document/contracts/c8_fc_adapter/fc_adapter_protocol.md new file mode 100644 index 0000000..94e347c --- /dev/null +++ b/_docs/02_document/contracts/c8_fc_adapter/fc_adapter_protocol.md @@ -0,0 +1,181 @@ +# Contract: `FcAdapter` / `GcsAdapter` Protocols + +**Owner**: c8_fc_adapter (epic AZ-261 / E-C8) +**Producer task**: AZ-390 (FcAdapter / GcsAdapter Protocols + DTOs + errors + factories + composition) +**Consumer tasks**: AZ-391 (Inbound subscription + telemetry dispatch), AZ-392 (CovarianceProjector), AZ-393 (PymavlinkArdupilotAdapter outbound), AZ-394 (Msp2InavAdapter outbound), AZ-395 (MAVLink 2.0 signing handshake), AZ-396 (Source-set switch), AZ-397 (GcsAdapter + QgcTelemetryAdapter). +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 +**Module-layout home**: `src/gps_denied_onboard/components/c8_fc_adapter/interface.py`, `src/gps_denied_onboard/components/c8_fc_adapter/__init__.py`, `src/gps_denied_onboard/runtime_root/fc_factory.py` + +## Purpose + +Defines the public interfaces for C8: per-FC inbound telemetry subscription + outbound external-position emission, plus the GCS link. Two production `FcAdapter` strategies linked at build time per ADR-002: `PymavlinkArdupilotAdapter` (ArduPilot Plane via MAVLink 2.0 with signing) and `Msp2InavAdapter` (iNav via MSP2, unsigned per RESTRICT-COMM-2). One production `GcsAdapter` strategy: `QgcTelemetryAdapter` (downsampled 1–2 Hz summary to QGroundControl + operator command ingestion). Selected at startup via `config.fc.adapter` and `config.gcs.adapter` with `BUILD_FC_` / `BUILD_GCS_` flag gating per ADR-002. + +C8 is the **single source** of FC inbound telemetry — C1 (VIO) and C5 (StateEstimator) receive `ImuWindow` / `AttitudeWindow` / `GpsHealth` / `FlightStateSignal` exclusively via a constructor-injected `FcAdapter`. C8 is also the **single sink** of outbound external-position — C5's `EstimatorOutput` is encoded into the per-FC wire format at 5 Hz with honest 6×6 → 2×2 covariance projection. + +Replay extensions (AZ-265 / E-DEMO-REPLAY) live inside the same component but ship under separate `BUILD_TLOG_REPLAY_ADAPTER` / `BUILD_REPLAY_SINK_JSONL` flags; they implement the same Protocols and are out of scope for E-C8 itself. + +The shared `WgsConverter` (AZ-279), `SE3Utils` (AZ-277), and `FdrClient` (AZ-273) helpers are constructor-injected. + +## Public API + +### Protocol: `FcAdapter` + +```python +@runtime_checkable +class FcAdapter(Protocol): + def open(self, port: PortConfig, signing_key: bytes | None) -> None: ... + def close(self) -> None: ... + def subscribe_telemetry( + self, callback: Callable[[FcTelemetryFrame], None] + ) -> Subscription: ... + def emit_external_position(self, output: EstimatorOutput) -> None: ... + def emit_status_text(self, msg: str, severity: Severity) -> None: ... + def request_source_set_switch(self) -> None: ... # AP-only; iNav raises SourceSetSwitchNotSupportedError + def current_flight_state(self) -> FlightStateSignal: ... +``` + +### Protocol: `GcsAdapter` + +```python +@runtime_checkable +class GcsAdapter(Protocol): + def open(self, port: PortConfig) -> None: ... + def close(self) -> None: ... + def emit_summary(self, output: EstimatorOutput) -> None: ... # internally rate-limited to 1–2 Hz + def subscribe_operator_commands( + self, callback: Callable[[OperatorCommand], None] + ) -> Subscription: ... + def emit_status_text(self, msg: str, severity: Severity) -> None: ... +``` + +### DTOs (frozen, slotted) + +```python +@dataclass(frozen=True, slots=True) +class PortConfig: + device: str # e.g. /dev/ttyTHS1 + baud: int + fc_kind: FcKind # enum {ARDUPILOT_PLANE, INAV} + +class FcKind(Enum): + ARDUPILOT_PLANE = "ardupilot_plane" + INAV = "inav" + +class Severity(Enum): + INFO = 6 + WARNING = 4 + ERROR = 3 # values mirror MAVLink STATUSTEXT severities + +@dataclass(frozen=True, slots=True) +class FcTelemetryFrame: + kind: TelemetryKind # enum {IMU_SAMPLE, ATTITUDE, GPS_HEALTH, MAV_STATE} + payload: TelemetryPayload # union; see _types/fc.py + received_at: int # monotonic_ns + signed: bool # true ONLY for AP signed frames + +@dataclass(frozen=True, slots=True) +class FlightStateSignal: + state: FlightState # enum {INIT, ARMED, IN_FLIGHT, ON_GROUND, FAILED} + last_valid_gps_hint_wgs84: LatLonAlt | None # for AC-5.1 warm-start + last_valid_gps_age_ms: int | None + captured_at: int # monotonic_ns + +@dataclass(frozen=True, slots=True) +class GpsHealth: + status: GpsStatus # enum {NO_FIX, DEGRADED, STABLE, STABLE_NON_SPOOFED, SPOOFED} + fix_age_ms: int + captured_at: int + +@dataclass(frozen=True, slots=True) +class EmittedExternalPosition: + fc_kind: FcKind + horiz_accuracy_m: float # AP horiz_accuracy / iNav hPosAccuracy (mm internally) + source_label: PoseSourceLabel + emitted_at: int # monotonic_ns + sequence_number: int +``` + +### Error hierarchy + +```python +class FcAdapterError(Exception): ... +class FcOpenError(FcAdapterError): ... +class FcEmitError(FcAdapterError): ... +class SigningHandshakeError(FcAdapterError): ... +class SigningKeyExpiredError(FcAdapterError): ... +class SourceSetSwitchError(FcAdapterError): ... +class SourceSetSwitchNotSupportedError(SourceSetSwitchError): ... +class FcAdapterConfigError(FcAdapterError): ... + +class GcsAdapterError(Exception): ... +class GcsEmitError(GcsAdapterError): ... +class GcsAdapterConfigError(GcsAdapterError): ... +``` + +### Composition-root factories + +```python +def build_fc_adapter( + config: AppConfig, + wgs_converter: WgsConverter, + se3_utils: SE3Utils, + covariance_projector: CovarianceProjector, + fdr_client: FdrClient, + clock: Clock, +) -> FcAdapter: ... + +def build_gcs_adapter( + config: AppConfig, + fdr_client: FdrClient, + clock: Clock, +) -> GcsAdapter: ... +``` + +Selection: `config.fc.adapter ∈ {"ardupilot_plane", "inav"}` → corresponding strategy, gated by `BUILD_FC_ARDUPILOT_PLANE` / `BUILD_FC_INAV`. `config.gcs.adapter ∈ {"qgc_mavlink"}` → `QgcTelemetryAdapter`, gated by `BUILD_GCS_QGC_MAVLINK`. Unknown strategy → `FcAdapterConfigError` / `GcsAdapterConfigError` at config load. Build-flag OFF for the requested strategy → same error class with the disabled-flag name in the message. + +## Invariants + +1. **Single open**: `open(...)` MUST be called exactly once per adapter instance. Re-open raises `FcOpenError`. `close()` is idempotent. +2. **Signing key required for AP**: `PymavlinkArdupilotAdapter.open(...)` with `signing_key=None` raises `SigningHandshakeError`. `Msp2InavAdapter.open(...)` MUST reject any non-None `signing_key` with `FcAdapterConfigError` (RESTRICT-COMM-2 — iNav has no signing). +3. **5 Hz periodic emit**: `emit_external_position` is consumed at exactly 5 Hz by the runtime root's emit timer. The adapter does NOT drive its own timer; it only encodes + writes when called. Internal emission rate-limit lives in the runtime root. +4. **Honest covariance projection**: every emitted external-position MUST have `horiz_accuracy_m` derived from the input `EstimatorOutput.covariance_6x6` via the shared `CovarianceProjector` — Frobenius-norm equivalence to the source 3×3 horizontal block within 1% (C8-IT-01). NEVER substitute a constant or downsampled estimate. +5. **Source-label propagation**: `EstimatorOutput.source_label` MUST be re-emitted via the per-FC out-of-band channel (AP: `NAMED_VALUE_FLOAT` + STATUSTEXT; iNav: STATUSTEXT only via the MAVLink telemetry side-channel). +6. **Smoothed estimates rejected**: `emit_external_position` MUST raise `FcEmitError` if `output.smoothed == True`. The forward-time invariant (AC-4.5 revised) is enforced at the C8 boundary as a defensive backstop on top of C5's filtering. +7. **Inbound timestamp monotonicity**: `FcTelemetryFrame.received_at` MUST be monotonically non-decreasing per kind. Out-of-order frames are dropped + logged at WARN. +8. **Single-writer thread for outbound**: `emit_external_position`, `emit_status_text`, and `request_source_set_switch` MUST be called from the same thread. Multi-thread write raises `RuntimeError`. Inbound subscribe-callbacks fire on the inbound decode thread; consumers must handle the thread boundary themselves. +9. **iNav signing assertion**: the iNav adapter MUST never emit a MAVLink2 frame with the signed-flag set, even on the side-channel telemetry link. Verified by C8-IT-08. +10. **Per-flight key zeroisation**: at `close()` (or process exit), the AP signing key buffer MUST be overwritten with zeroes before deallocation. The key MUST never be written to disk. Verified by C8-ST-02. +11. **Source-set switch idempotence**: `request_source_set_switch()` is safe to call multiple times in the same flight. Re-entry within 1 s is no-op'd (rate-limited); re-entry after a successful switch logs INFO + sends STATUSTEXT but does not re-issue the command. +12. **GcsAdapter downsampling**: `emit_summary` is invoked at 5 Hz by the runtime; the adapter internally downsamples to 1–2 Hz (configurable; default 2 Hz). Downsampling is rate-based (every Nth call), not selection-based. + +## Producer / Consumer Split + +| Task ID | Scope | +|---------|-------| +| AZ-390 (Producer) | Protocols, DTOs, error hierarchy, factories, composition root extension, `FcKind` / `FlightState` / `GpsStatus` / `Severity` enums, `FcAdapterStub` baseline (test-only no-op accepted by composition). NO concrete production adapter, NO wire encoding. | +| AZ-391 (Consumer 1) | Inbound subscription path: MAVLink 2.0 telemetry decoder (RAW_IMU/ATTITUDE/GPS_RAW_INT/MAV_STATE/HEARTBEAT) for AP + MSP2 telemetry decoder for iNav; produces `FcTelemetryFrame` + bounded ring buffers; emits `ImuWindow` / `AttitudeWindow` / `GpsHealth` / `FlightStateSignal` to subscribers. Backpressure + drop-oldest on overflow. | +| AZ-392 (Consumer 2) | `CovarianceProjector` helper inside C8: 6×6 → 3×3 position sub-matrix → 2×2 horizontal sub-matrix → equivalent_radius (m for AP, mm for iNav). Honest projection per the AC-4.3 formula. | +| AZ-393 (Consumer 3) | `PymavlinkArdupilotAdapter` outbound path: encode `EstimatorOutput` as `GPS_INPUT` (5 Hz); side-channel `NAMED_VALUE_FLOAT` for `source_label` + `STATUSTEXT` mirror; uses the CovarianceProjector. NO signing logic (delivered separately). | +| AZ-394 (Consumer 4) | `Msp2InavAdapter` outbound path: encode `EstimatorOutput` as `MSP2_SENSOR_GPS` (5 Hz) via YAMSPy + INAV-Toolkit; STATUSTEXT mirror via the secondary MAVLink telemetry channel. iNav-specific quirks (mm units, sequence numbers). | +| AZ-395 (Consumer 5) | MAVLink 2.0 per-flight signing handshake (AP only): generate ephemeral key at `open(...)`, complete pymavlink signing handshake, key rotation logging to FDR, key zeroisation on close. D-C8-9 R03 risk; gated for production by IT-3 SITL pass. | +| AZ-396 (Consumer 6) | `MAV_CMD_SET_EKF_SOURCE_SET` D-C8-2 source-set switch (AP only): `request_source_set_switch()` body, ACK handling, `SourceSetSwitchError` on timeout, idempotence per Invariant 11. Wired to C5's spoof-recovery gate via the runtime root. | +| AZ-397 (Consumer 7) | `QgcTelemetryAdapter` GcsAdapter: open MAVLink 2.0 channel, downsample 5 Hz → 1–2 Hz `emit_summary`, operator command ingestion (`subscribe_operator_commands`), STATUSTEXT mirror. | + +Tests C8-IT-01..08 + C8-PT-01 + C8-ST-01..02 are deferred to E-BBT (AZ-262) per the project's E-BBT pattern. + +## Constraints + +- `@runtime_checkable` Protocols; DTOs `frozen=True, slots=True`. +- Lazy-import per ADR-002. +- Public API restricted to `interface.py` + `__init__.py` re-exports per `module-layout.md`. +- `pymavlink` is bundled unmodified per D-C8-3. +- Signing key MUST never appear in a log line, FDR record, or stderr trace. +- The `PortConfig.device` and `signing_key` are constructor-time inputs to `open(...)`; they MUST NOT be re-readable from the adapter post-open (no `get_port_config()` accessor). + +## Risks / Mitigations + +- **R03** (MAVLink 2.0 per-flight signing has no operator-deployed precedent): gated by IT-3 SITL pass before flight-test sign-off. +- **R09** (signing key compromise): per-flight ephemeral keys + zeroisation; never persisted; never logged. +- Cross-adapter drift (AP vs iNav contract): the shared `CovarianceProjector` + `FcTelemetryFrame` enforce wire-agnostic semantics. Per-FC quirks (mm units, signing) are quarantined to the variant adapter. diff --git a/_docs/02_document/contracts/replay/replay_protocol.md b/_docs/02_document/contracts/replay/replay_protocol.md new file mode 100644 index 0000000..5a3a637 --- /dev/null +++ b/_docs/02_document/contracts/replay/replay_protocol.md @@ -0,0 +1,161 @@ +# Contract: Replay Mode (`FrameSource` + `ReplaySink` + `Clock` + replay composition) + +**Owner**: replay (epic AZ-265 / E-DEMO-REPLAY) — strategies live inside existing components (`frame_source/`, `c8_fc_adapter/`); only the composition root and CLI are net-new top-level files. +**Producer task**: AZ-398 (`FrameSource` Protocol + `VideoFileFrameSource` + `LiveCameraFrameSource` retrofit + `Clock` Protocol) +**Consumer tasks**: AZ-399 (TlogReplayFcAdapter), AZ-400 (ReplaySink + JsonlReplaySink), AZ-401 (compose_replay + Clock injection), AZ-402 (gps-denied-replay CLI), AZ-403 (Dockerfile + CI matrix + SBOM diff), AZ-404 (E2E replay fixture test), AZ-405 (Auto-sync IMU take-off detection). +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 +**Module-layout home**: +- `src/gps_denied_onboard/frame_source/interface.py`, `__init__.py` — `FrameSource` Protocol (Layer 1 cross-cutting per `module-layout.md`). +- `src/gps_denied_onboard/components/c8_fc_adapter/tlog_replay_adapter.py` — `TlogReplayFcAdapter` (gated `BUILD_TLOG_REPLAY_ADAPTER`). +- `src/gps_denied_onboard/components/c8_fc_adapter/replay_sink.py` — `ReplaySink` interface + `JsonlReplaySink` (gated `BUILD_REPLAY_SINK_JSONL`). +- `src/gps_denied_onboard/clock/interface.py`, `__init__.py` — `Clock` Protocol. +- `src/gps_denied_onboard/runtime_root/replay.py` — `compose_replay(config) -> ReplayRoot`. + +## Purpose + +Defines the public interfaces enabling **offline replay mode** per epic AZ-265: run the production C1–C5 pipeline against historical inputs (1–2 min Derkachi-style clip + matching pymavlink `.tlog`) so the parent-suite UI demo has end-to-end fidelity equal to a live flight. Production C1–C5 components MUST remain mode-agnostic — replay-aware logic lives ONLY in the composition root, the new strategies, and the CLI. The replay binary is a fourth Docker image (`gps-denied-replay-cli`) containing C1–C5 + replay strategies but NOT C6/C10/C11/C12 (no operator-side workflows; tile cache is read pre-built). + +This contract defines four Protocols and the replay composition surface: +- **`FrameSource`** — the formalised cross-cutting interface for camera-frame ingestion (previously implicit). Two strategies: `LiveCameraFrameSource` (retrofit; existing camera plumbing renamed and put behind the Protocol) and `VideoFileFrameSource` (replay-only, gated `BUILD_VIDEO_FILE_FRAME_SOURCE`). +- **`Clock`** — the wall-clock vs. tlog-derived time abstraction (R-DEMO-4 mitigation). Two strategies: `WallClock` (live/research/operator) and `TlogDerivedClock` (replay only). +- **`ReplaySink`** — the offline `EstimatorOutput` consumer interface. One strategy: `JsonlReplaySink` (one `EstimatorOutput` per JSONL line; gated `BUILD_REPLAY_SINK_JSONL`). +- **`TlogReplayFcAdapter`** — replay-only `FcAdapter` strategy (per AZ-261 `FcAdapter` Protocol from `_docs/02_document/contracts/c8_fc_adapter/fc_adapter_protocol.md`); parses pymavlink `.tlog` and emits `ImuWindow` / `AttitudeWindow` / `GpsHealth` / `FlightStateSignal` at tlog-timestamp cadence (or wall-clock-paced per `--pace`). Gated `BUILD_TLOG_REPLAY_ADAPTER`. + +The shared `WgsConverter` (AZ-279) is constructor-injected into the tlog adapter for tlog-GPS → local-tangent-plane conversion. + +## Public API + +### Protocol: `FrameSource` + +```python +@runtime_checkable +class FrameSource(Protocol): + def next_frame(self) -> NavCameraFrame | None: ... # None on end-of-stream + def close(self) -> None: ... +``` + +### Protocol: `Clock` + +```python +@runtime_checkable +class Clock(Protocol): + def monotonic_ns(self) -> int: ... + def time_ns(self) -> int: ... # wall-clock (UTC) for log timestamps + def sleep_until_ns(self, target_ns: int) -> None: ... # honoured in --pace realtime; no-op in --pace asap +``` + +### Protocol: `ReplaySink` + +```python +@runtime_checkable +class ReplaySink(Protocol): + def emit(self, output: EstimatorOutput) -> None: ... + def close(self) -> None: ... +``` + +### Concrete: `TlogReplayFcAdapter` + +```python +class TlogReplayFcAdapter(FcAdapter): + def __init__( + self, + tlog_path: Path, + target_fc_dialect: FcKind, # ARDUPILOT_PLANE | INAV + clock: Clock, + wgs_converter: WgsConverter, + time_offset_ms: int = 0, # auto-detected by AZ-405 auto-sync task or set via --time-offset-ms + pace: ReplayPace = ReplayPace.ASAP, # REALTIME | ASAP + ): ... +``` + +The `TlogReplayFcAdapter` implements the full `FcAdapter` Protocol from AZ-261. `emit_external_position` raises `FcEmitError("replay adapter does not emit to FC")` (replay is read-only on the FC side; downstream consumers use `ReplaySink` instead). `request_source_set_switch` raises `SourceSetSwitchNotSupportedError`. `subscribe_telemetry` is the primary surface — fans out IMU/attitude/GPS-health/flight-state from the tlog at the configured pace. + +### CLI surface + +``` +gps-denied-replay + --video PATH + --tlog PATH + --output results.jsonl + --camera-calibration calib.json + --config config.yaml + [--pace {realtime,asap}] # default asap + [--time-offset-ms N] # overrides auto-sync +``` + +### Composition root extension + +```python +def compose_replay(config: Config) -> ReplayRoot: ... +``` + +`ReplayRoot` is a dataclass holding all wired components plus the `FrameSource`, `TlogReplayFcAdapter`, `ReplaySink`, and `Clock` chosen for the replay run. The runtime loop is: +``` +loop: + frame = frame_source.next_frame() + if frame is None: break + c1 = vio.process(frame) # C1 + candidates = vpr.lookup(c1) # C2 + reranked = rerank.rerank(candidates) # C2.5 + matched = matcher.match(reranked) # C3 + refined = refiner.refine_if_needed(matched) # C3.5 + pose = pose_estimator.estimate(refined) # C4 + state.add_pose_anchor(pose) # C5 + state.add_vio(c1.vio_output) # C5 + output = state.current_estimate() + replay_sink.emit(output) +replay_sink.close() +``` + +The tlog adapter's `subscribe_telemetry` callbacks are wired to C5's `add_fc_imu` and to C1's IMU prior on the same threads as in the live binary. + +## Invariants + +1. **Mode-agnostic C1–C5**: production components MUST NOT contain `if replay_mode:` branches. Mode-specific behaviour lives in the strategy (Frame source / FC adapter / Sink / Clock). Verified by an explicit grep guard in CI. +2. **Single `Clock` per process**: the composition root resolves `Clock` exactly once at startup. All time-driven logic (AC-5.2 fallback timer, STATUSTEXT rate-limits, key rotation logging) consumes the injected `Clock` via constructor — never `time.monotonic_ns()` directly. Verified by an AST scan in CI for direct `time.monotonic_ns` / `time.time_ns` references in components. +3. **Frame source ordering**: `next_frame()` returns frames in monotonically non-decreasing `monotonic_ns` order. Out-of-order frames raise `FrameSourceError` (NOT silently dropped — replay must be deterministic). +4. **End-of-stream is None**: `next_frame()` returns `None` ONLY when the stream is permanently exhausted. Transient I/O failures raise `FrameSourceError`. +5. **TlogReplayFcAdapter emit-only-via-sink**: `emit_external_position` and `emit_status_text` raise `FcEmitError("replay adapter does not emit to FC")`. Downstream consumers MUST emit to `ReplaySink` instead. +6. **Pace mode honoured by Clock**: `pace=REALTIME` → `Clock.sleep_until_ns(target_ns)` blocks until wall-clock catches up; `pace=ASAP` → no-op. The pace flag is consumed ONLY by the `Clock` and the tlog adapter — components see only the `Clock` Protocol. +7. **JsonlReplaySink one-line-per-emit**: each `emit(output)` writes exactly one JSON object + newline; the file is fsync'd on `close()`. Schema matches `EstimatorOutput` (frozen dataclass serialised via `dataclasses.asdict` + `orjson.dumps`). +8. **Time-offset honoured**: when constructed with `time_offset_ms != 0`, the tlog adapter shifts every emitted timestamp by that offset before passing to subscribers. `time_offset_ms` is set ONCE at construction (no live re-tuning). +9. **Build-flag gating**: `VideoFileFrameSource`, `TlogReplayFcAdapter`, `JsonlReplaySink` MUST refuse construction when their respective `BUILD_*` flag is OFF (per ADR-002 — replay binary has them ON; airborne / research / operator have them OFF). +10. **Determinism**: same `(video, tlog, config, time_offset_ms, pace=ASAP)` input → same JSONL output within ≤ 1e-6 float drift in position fields (AC-5). + +## Producer / Consumer Split + +| Task ID | Scope | +|---------|-------| +| AZ-398 (Producer) | `FrameSource` Protocol; `Clock` Protocol; `VideoFileFrameSource` (gated `BUILD_VIDEO_FILE_FRAME_SOURCE`); `LiveCameraFrameSource` retrofit (rename existing camera-ingest plumbing into the Protocol shape — no behaviour change); `WallClock` + `TlogDerivedClock` strategies; composition wiring in the existing `compose_root`/`compose_operator` (Clock = WallClock there). NO tlog parsing, NO sink, NO replay composition. | +| AZ-399 (Consumer 1) | `TlogReplayFcAdapter`: pymavlink stream-parser (DO NOT materialise; R-DEMO-2 throughput floor); maps tlog message types → `FcTelemetryFrame`; supports both AP and iNav dialects; `subscribe_telemetry` fan-out at the configured pace; respects `time_offset_ms`; honours `Clock` for pacing; fail-fast at startup if required message types absent (R-DEMO-3). | +| AZ-400 (Consumer 2) | `ReplaySink` Protocol + `JsonlReplaySink` (one JSON object per line; orjson serialiser; `close()` fsyncs). | +| AZ-401 (Consumer 3) | `compose_replay(config) -> ReplayRoot`: full strategy resolution for the replay binary; `Clock` strategy selection (TlogDerivedClock for ASAP, WallClock for REALTIME; documented per R-DEMO-4); `FrameSource` = `VideoFileFrameSource`; `FcAdapter` = `TlogReplayFcAdapter`; `Sink` = `JsonlReplaySink`; ALL of C1–C5 wired with the same Public API as the live binary. NO C6/C10/C11/C12. Configuration loading + camera-calibration loading. | +| AZ-402 (Consumer 4) | `gps-denied-replay` CLI entrypoint: argparse, config + calibration loader, runtime loop (the loop body documented in this contract above), structured-error exit codes (0=success, 2=AC-8 sync-impossible, 1=any other error). | +| AZ-403 (Consumer 5) | `gps-denied-replay-cli` Dockerfile (multi-stage; Python + C1–C5 + cpp/* + replay strategies; NO C6/C10/C11/C12; NO HTTP server) + GitHub Actions matrix entry + SBOM diff CI step verifying absence of excluded components per AC-4. | +| AZ-404 (Consumer 6) | E2E replay fixture test: `tests/e2e/replay/test_derkachi_1min.py` — runs the CLI against a 1–2 min Derkachi clip + matching tlog; asserts AC-3 (≤ 100 m for ≥ 80 % of ticks); gated by `RUN_REPLAY_E2E=1` in CI. | +| AZ-405 (Consumer 7) | Auto-sync of video ↔ tlog via IMU take-off detection (AC-7 / AC-8). Take-off pattern: sustained vertical accel > 0.5 g + change in attitude rate > 1 rad/s lasting ≥ 0.5 s (typical quadcopter signature). Confidence-scored; falls back to WARN + best-guess if < 80 %; `--time-offset-ms` always overrides; AC-8 hard-fail (exit 2) if neither auto-detect nor manual offset produces > 95 % frame-window match. | + +## Constraints + +- `@runtime_checkable` on all Protocols; DTOs `frozen=True, slots=True`. +- Lazy-import per ADR-002 with the new `BUILD_VIDEO_FILE_FRAME_SOURCE`, `BUILD_TLOG_REPLAY_ADAPTER`, `BUILD_REPLAY_SINK_JSONL` flags. +- C1–C5 components MUST remain mode-agnostic (Invariant 1). +- All time-driven logic in components MUST consume the injected `Clock` (Invariant 2). +- No HTTP server in the replay binary (parent-suite UI shells out to the CLI; defer until subprocess shape is proven insufficient). +- pymavlink bundled unmodified per D-C8-3. +- The tlog parser MUST stream-parse — never materialise the entire tlog into memory (R-DEMO-2; multi-GB tlogs). + +## Risks / Mitigations + +- **R-DEMO-1** (tlog ↔ video timestamp drift / unsynchronised recordings): auto-sync via IMU take-off detection (AC-7) + `--time-offset-ms` manual override. Fixed-wing hand-launch fallback documented. +- **R-DEMO-2** (pymavlink slow on multi-GB tlogs): stream-parse, never materialise. Throughput floor benchmarked + documented in CI. +- **R-DEMO-3** (demo footage missing required FC messages): `TlogReplayFcAdapter.open(...)` fails fast at startup, listing missing message types and the components that need them. +- **R-DEMO-4** (production C1–C5 paths bake real-time-cadence assumptions): `Clock` injection (Invariants 1, 2). Documented as ADR amendment in next architecture-doc cycle. + +## Notes for the Implementer + +- The `LiveCameraFrameSource` retrofit is a no-op restructure: the existing camera-ingest thread becomes a class implementing `FrameSource`. Its behaviour is unchanged. This is what allows C1 to consume `FrameSource` via constructor without becoming replay-aware. +- The `TlogReplayFcAdapter`'s `subscribe_telemetry` fan-out runs on a dedicated thread (mirroring the live `PymavlinkArdupilotAdapter` decode-thread semantics). This way C1 and C5 see identical thread boundaries in live and replay. +- The `Clock` Protocol is the SAME interface in live and replay — only the strategy differs. This is the single Liskov-clean line that lets components consume `Clock` without knowing the mode. diff --git a/_docs/02_document/contracts/shared_config/composition_root_protocol.md b/_docs/02_document/contracts/shared_config/composition_root_protocol.md new file mode 100644 index 0000000..20daac2 --- /dev/null +++ b/_docs/02_document/contracts/shared_config/composition_root_protocol.md @@ -0,0 +1,83 @@ +# Contract: composition_root_protocol + +**Component**: shared_config (cross-cutting concern owned by E-CC-CONF / AZ-246) +**Producer tasks**: AZ-269 (config loader + outer Config) and AZ-270 (compose_root + compose_operator + StrategyNotLinkedError) +**Consumer tasks**: every component task that takes a config block; `runtime_root.py` and `operator_tool/__main__.py` (the two composition-root entrypoints) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Frozen public surface for the configuration loader and the two composition-root functions. Components depend on these signatures (and the precedence rule) to know how their per-component config arrives at construction time and how they will be wired against their declared interfaces. + +## Shape + +### Function signatures (pythonic; binding is stdlib `dataclasses` / `attrs`-style) + +```python +@frozen +class Config: + """Outer config object. Populated by union of every component's config block. + Each component contributes one immutable nested dataclass field named after its slug + (e.g. config.c2_vpr, config.c5_state). Components MUST NOT read other components' blocks + — the composition root is the only consumer of the full Config.""" + +def load_config(env: Mapping[str, str], paths: Sequence[Path]) -> Config: ... + +def compose_root(config: Config) -> RuntimeRoot: ... +def compose_operator(config: Config) -> OperatorRoot: ... + +class StrategyNotLinkedError(RuntimeError): + """Raised by compose_root / compose_operator when the config selects a strategy whose + BUILD_ flag was OFF in the linked binary (ADR-002 enforcement gate #3, after + SBOM diff and runtime self-check).""" + strategy_name: str # the strategy class identifier the config requested + component_slug: str # owning component (e.g. "c1_vio") + available_strategies: list[str] # strategies actually linked into this binary +``` + +| Symbol | Required | Description | Constraints | +|--------|----------|-------------|-------------| +| `Config` | yes | Outer frozen dataclass | One nested field per component slug; nested fields are immutable | +| `load_config` | yes | Builds `Config` from env + YAML files | Precedence: env > YAML > documented defaults | +| `compose_root` | yes | Wires the airborne `RuntimeRoot` | Constructs every component instance, injects dependencies, returns root | +| `compose_operator` | yes | Wires the operator-side `OperatorRoot` | Same contract, different component subset | +| `StrategyNotLinkedError` | yes | Raised on strategy/build-flag mismatch | Carries `strategy_name`, `component_slug`, `available_strategies` | + +## Invariants + +- `load_config` is pure with respect to its inputs: same `env` + same file contents always yields the same `Config`. +- Precedence is **env > YAML > defaults** for every key. Two YAML files merge with later paths winning over earlier ones. +- `compose_root` and `compose_operator` MUST NOT mutate the passed `Config`. +- `StrategyNotLinkedError` is the only error type these functions raise on a strategy/build-flag mismatch — never `ValueError`, `KeyError`, or a generic `RuntimeError`. +- Cold-start `load_config` + `compose_root` ≤ 1 s on Tier-2 (counts toward AC-NEW-1's 30 s startup budget). + +## Non-Goals + +- This contract does NOT define the Config dataclass field set — each component owns its own block (defined in its component epic). The contract only fixes the OUTER container's composition rule (one nested field per component slug, frozen). +- This contract does NOT define the YAML schema — that follows from the per-component config blocks. +- This contract does NOT define `RuntimeRoot` / `OperatorRoot` internal structure — only that they are returned from these functions. + +## Versioning Rules + +- **Breaking changes** (function rename, new required positional arg, exception class rename, precedence change) require a new major version + a deprecation pass through every component config block. +- **Non-breaking additions** (new keyword-only arg with default, new optional method on `RuntimeRoot`) require a minor version bump. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| precedence-env-wins | env sets `LOG_LEVEL=DEBUG`; YAML sets `log.level=INFO` | `config.log.level == "DEBUG"` | env > YAML | +| precedence-yaml-wins | YAML sets `log.level=INFO`; no env entry | `config.log.level == "INFO"` | YAML > defaults | +| precedence-defaults | neither env nor YAML set `log.level` | `config.log.level == ` | defaults baseline | +| compose-root-default-binary | valid Config with default strategies | returns `RuntimeRoot` whose component count matches the airborne profile | reachability proof | +| compose-root-strategy-missing | config selects `vins_mono`; binary built with `BUILD_VINS_MONO=OFF` | raises `StrategyNotLinkedError` with `strategy_name="vins_mono"`, `component_slug="c1_vio"`, `available_strategies=["okvis2", "klt_ransac"]` | ADR-002 enforcement | +| compose-operator-no-airborne | operator-side config | returns `OperatorRoot` containing only operator-tier components (e.g. C11, C12) | wrong-tier components excluded | +| load-config-purity | call `load_config(env, paths)` twice with same inputs | identical `Config` objects (or deep-equal) | reproducibility | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from E-CC-CONF epic (AZ-246) | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md b/_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md new file mode 100644 index 0000000..e31dd36 --- /dev/null +++ b/_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md @@ -0,0 +1,107 @@ +# Contract: fdr_client_protocol + +**Component**: shared_fdr_client (cross-cutting concern owned by E-CC-FDR-CLIENT / AZ-247) +**Producer task**: AZ-273 — `_docs/02_tasks/todo/AZ-273_fdr_client_ringbuf.md` +**Consumer tasks**: every onboard component that emits FDR records (C1–C13), the C13 writer thread (AZ-248 / E-C13), the overrun-policy task (AZ-XX / E-CC-FDR-CLIENT #3), `FakeFdrSink` (AZ-XX / E-CC-FDR-CLIENT #4), and the composition root (`runtime_root.py`) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Frozen public surface for the producer-side FDR queue. Every onboard producer holds exactly one `FdrClient(producer_id)`, calls `enqueue(record)`, and never blocks. The C13 writer thread is the sole consumer via `pop_one` / `drain`. The `on_overrun` hook is the documented extension point through which the overrun-policy PBI (next task in this epic) implements drop-oldest + `kind="overrun"` emission — without this hook, overrun behaviour would be hard-coded into the queue and AC-NEW-3 ("no silent drops") would be unobservable from outside. + +## Shape + +### Function and method APIs + +```python +from typing import Callable +from .fdr_record_schema import FdrRecord # owned by AZ-272 + +class EnqueueResult: + OK = "ok" + OVERRUN = "overrun" + +class FdrSpscViolationError(RuntimeError): + """Raised when the SPSC contract is violated (concurrent dequeue, multi-producer enqueue).""" + +class FdrClient: + def __init__(self, producer_id: str, capacity: int) -> None: ... + @property + def producer_id(self) -> str: ... + @property + def on_overrun(self) -> Callable[[FdrRecord], None] | None: ... + @on_overrun.setter + def on_overrun(self, hook: Callable[[FdrRecord], None] | None) -> None: ... + + # Producer-side (single-threaded per FdrClient; lock-free; never blocks). + def enqueue(self, record: FdrRecord) -> EnqueueResult: ... + + # Consumer-side (C13 writer; single-threaded per FdrClient; SPSC contract). + def pop_one(self) -> FdrRecord | None: ... + def drain(self, max_records: int) -> list[FdrRecord]: ... + + # Test-only. + def flush(self) -> None: ... + +# Module-level factory; preferred entrypoint for production code. +def make_fdr_client(producer_id: str, config: Config) -> FdrClient: ... +``` + +| Symbol | Required | Description | Constraints | +|--------|----------|-------------|-------------| +| `FdrClient(producer_id, capacity)` | yes | Construct a per-producer client; `capacity` MUST be `>= 16` and a power of two (ring-buffer-friendly) | `producer_id` non-empty; raises `ValueError` otherwise | +| `enqueue(record)` | yes | Non-blocking single-producer enqueue | Returns `OK` on success or `OVERRUN` when buffer is full; never raises into the caller; allocation-free on steady state | +| `on_overrun` (property) | yes | Hook invoked exactly once per overrun event with the would-be-enqueued record | Set by the overrun-policy PBI; default is `None` (records dropped silently is NOT acceptable in production — AC-NEW-3 requires the hook to be wired in `compose_root`) | +| `pop_one()` | yes | Single-consumer dequeue; returns the next record or `None` if empty | SPSC: only ONE thread may call `pop_one` / `drain` | +| `drain(max_records)` | yes | Pop up to `max_records` records in a single call | Same SPSC constraint as `pop_one` | +| `flush()` | yes | Test-only: blocks the calling thread until the buffer is empty | Production code MUST NOT call this on the hot path | +| `make_fdr_client(producer_id, config)` | yes | Factory; reads capacity from `config.fdr_client..capacity` with documented default; caches one instance per `producer_id` | Two calls with the same `producer_id` return the same instance | + +## Invariants + +- **Lock-free**: `enqueue` and `pop_one` MUST NOT acquire a lock that any other thread can hold. They MAY use atomic primitives (CAS, single-word reads/writes, memory barriers) — these are not "locks" in the queue's sense. +- **Non-blocking enqueue**: `enqueue` returns within O(1) and never transitions the calling thread to BLOCKED state. When the buffer is full, it returns `OVERRUN` synchronously and invokes `on_overrun(record)` exactly once if the hook is set. +- **Allocation-free steady state**: `enqueue` for an in-buffer record (slot is free) MUST NOT allocate any heap object. The contract test verifies this with a `tracemalloc` snapshot diff (0 new objects). +- **SPSC**: each `FdrClient` instance has at most ONE producer thread (calls `enqueue`) and at most ONE consumer thread (calls `pop_one` / `drain`). Multi-producer or multi-consumer use is undefined behaviour. The instance includes an opt-in guard that raises `FdrSpscViolationError` on concurrent entry — used by the contract test. +- **One client per producer_id**: `make_fdr_client(producer_id, config)` returns the same cached instance on repeat calls. Tests use `_reset_for_tests()` (private, documented in Non-Goals) to clear the cache. +- **Producer_id stamped on every record**: `enqueue` does NOT mutate `record.producer_id` — the caller is responsible for setting it. The contract test verifies that `enqueue(FdrRecord(..., producer_id="c1_vio"))` lands on the consumer side with `producer_id == "c1_vio"`. +- **Cold-start budget**: constructing all FdrClient instances during `compose_root` is a one-time cost; the contract requires per-instance construction p99 ≤ 1 ms on Tier-2 (so ≤ 13 producers × 1 ms = 13 ms within the 1 s `compose_root` budget from the composition_root_protocol contract). + +## Non-Goals + +- This contract does NOT define the drop-oldest behaviour or what `on_overrun` does — that is the next PBI (AZ-XX) in this epic. The contract only defines the hook signature and the "exactly-once" invariant. +- This contract does NOT define the C13 writer thread, segment files, segment rotation, or 64 GB cap — owned by E-C13 (AZ-248). +- This contract does NOT define the `FdrRecord` schema or its serialisation — owned by AZ-272. +- This contract does NOT define `FakeFdrSink` — owned by the fourth PBI in this epic. `FakeFdrSink` SHOULD conform to `FdrClient`'s public surface so it is a drop-in replacement for component tests. +- `_reset_for_tests()` is intentionally private and test-only. Production code calling it is a contract violation. + +## Versioning Rules + +- **Breaking changes** (renaming a public method, changing return types, removing a method, weakening an invariant) → new major version + a deprecation pass through every consumer. +- **Non-breaking additions** (adding a new method, adding an optional kwarg with a default, strengthening an invariant) → minor version bump. +- **Patch changes** (doc clarification, performance budget tightening within tested limits) → patch bump. +- The contract test (`tests/contract/fdr_client_protocol.py`) MUST be updated alongside any version bump. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| valid-enqueue-pop-roundtrip | One `enqueue(record)` followed by one `pop_one()` | record returned; second `pop_one()` returns None | basic happy path | +| nonblocking-stalled-consumer | Consumer never calls `pop_one`; producer calls `enqueue` 1025 times into 1024-cap client | every call returns within 50 µs; #1025 returns `OVERRUN` | covers AC-1 | +| allocation-free-steady-state | Warmup, then `tracemalloc` snapshot diff across one `enqueue` | 0 new heap objects | covers AC-2 | +| capacity-from-config | `make_fdr_client("c1_vio", config_with_capacity_4096)` | `client._capacity() == 4096` | covers AC-3 | +| spsc-guard-rejects-multi-consumer | Two threads call `pop_one()` concurrently with guard enabled | `FdrSpscViolationError` raised | covers AC-4 | +| on-overrun-fires-once | Recording closure on `on_overrun`; force one overrun | closure called exactly once with the offending record | covers AC-5 | +| flush-drains | N records buffered, draining consumer thread, call `flush()` | returns only after buffer empty | covers AC-6 | +| empty-producer-id-rejected | `FdrClient(producer_id="")` | `ValueError` mentioning `producer_id` | covers AC-7 | +| invariant-cached-instance | Two `make_fdr_client("c1_vio", config)` calls | same instance | NFR-reliability | +| spsc-guard-rejects-multi-producer | Two threads call `enqueue` concurrently on same client with guard enabled | `FdrSpscViolationError` raised | strengthens AC-4 | +| no-mutation-of-producer-id | `enqueue(FdrRecord(producer_id="c1_vio"))` then `pop_one()` | popped record has `producer_id == "c1_vio"` | invariant test | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from E-CC-FDR-CLIENT epic (AZ-247) | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md b/_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md new file mode 100644 index 0000000..c7ce162 --- /dev/null +++ b/_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md @@ -0,0 +1,107 @@ +# Contract: fdr_record_schema + +**Component**: shared_fdr_client (cross-cutting concern owned by E-CC-FDR-CLIENT / AZ-247) +**Producer task**: AZ-272 — `_docs/02_tasks/todo/AZ-272_fdr_record_schema.md` +**Consumer tasks**: every onboard component that emits FDR records (C1–C13), the C13 writer (AZ-248 / E-C13), post-flight tooling (E-C12 operator side), the FdrClient ring buffer (AZ-XX / E-CC-FDR-CLIENT next task), and `FakeFdrSink` (AZ-XX / E-CC-FDR-CLIENT fourth task) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Frozen, versioned wire format for every record written to the Flight Data Recorder. Every onboard producer (logs, VIO ticks, state ticks, tile matches, overruns, rollovers, failed-tile thumbnails, mid-flight tile snapshots, flight headers/footers) MUST round-trip through this schema, and the C13 writer + post-flight tooling MUST be the only readers. The schema enforces forward-compatibility so post-flight tooling pinned at version N keeps working when producers move to N+1. + +## Shape + +### Outer envelope (one of these per record on the wire) + +```python +# Conceptual dataclass — actual implementation may emit via orjson- or msgpack-backed serialiser pinned at E-BOOT. +@frozen +class FdrRecord: + schema_version: int # MUST be >= 1; reader uses this to pick the right parser branch + ts: str # ISO 8601 UTC, microsecond precision, e.g. "2026-05-10T03:14:15.123456Z" + producer_id: str # non-empty; component slug from module-layout.md (e.g. "c2_vpr") or "shared." for cross-cutting producers + kind: str # one of the v1.0.0 kinds (closed enum below) OR an unknown future tag (preserved opaquely) + payload: dict[str, Any] # kind-specific shape; well-known shapes documented per kind below + + # Forward-compat bucket — populated by parser when the wire bytes carry fields the local schema does not know. + # NEVER set by producers; producers leave it empty. + extra: dict[str, Any] = field(default_factory=dict) +``` + +| Field | Type | Required | Description | Constraints | +|-------|------|----------|-------------|-------------| +| `schema_version` | integer | yes | Schema major.minor packed as integer (1 for 1.x, 2 for 2.x) | `>= 1` | +| `ts` | string (ISO 8601 UTC, µs) | yes | Emit timestamp | RFC 3339 with `Z` suffix | +| `producer_id` | string | yes | Origin producer slug | non-empty; matches a module-layout component slug or `shared.` | +| `kind` | string | yes | Record category | dotted snake_case, max 64 chars; v1.0.0 closed enum below | +| `payload` | object | yes (may be `{}` for kinds whose payload is empty) | Kind-specific data | JSON-safe / msgpack-safe scalars, nested dicts/arrays, no binary blobs >4 KiB | +| `extra` | object | parser-only | Forward-compat bucket for unknown future fields | populated by parser; producers MUST leave empty | + +### v1.0.0 closed enum of `kind` values + +| `kind` | Producer | Payload shape (required keys) | Notes | +|--------|----------|-------------------------------|-------| +| `log` | every component (via E-CC-LOG bridge) | `{level, component, frame_id?, kind, msg, kv, exc?}` (matches `log_record_schema` v1.0.0) | Forwarded WARN/ERROR records (per AZ-267 fdr_log_bridge) | +| `vio.tick` | C1 | `{frame_id, R, t, P, last_anchor_age_ms, mre_px?, imu_bias_norm?}` | Per-frame VIO output | +| `state.tick` | C5 | `{frame_id, fused_pose, covariance_2x2, estimator_label}` | Smoothed fused-pose tick | +| `tile_match` | C2.5 / C3 | `{frame_id, tile_id, score, match_count, ransac_inliers}` | Tile-matching diagnostics | +| `overrun` | E-CC-FDR-CLIENT itself | `{producer_id, dropped_count}` (`dropped_count > 0`) | AC-NEW-3: never silent. Emitted by drop-oldest hook | +| `segment_rollover` | E-C13 (writer) | `{old_segment, new_segment, total_bytes_after}` | Emitted on segment rotation, including 64 GB-cap drops | +| `failed_tile_thumbnail` | C6 / C11 | `{frame_id, tile_id, jpeg_bytes_b64}` (≤ 0.1 Hz rate cap) | AC-8.5 forensic exception | +| `mid_flight_tile_snapshot` | C13 (snapshot path) | `{snapshot_path, captured_at}` | AC-8.4 mid-flight snapshot pointer | +| `flight_header` | C13 (writer) | `{flight_id, started_at, schema_version, build_info}` | Single record at flight open | +| `flight_footer` | C13 (writer) | `{flight_id, ended_at, records_written, records_dropped}` | Single record at flight close | + +### Wire bytes + +- `serialise(record: FdrRecord) -> bytes` returns a single self-delimited byte string (length-prefixed if msgpack, single-line UTF-8 if orjson — pinned at E-BOOT in `pyproject.toml`). +- `parse(buf: bytes) -> FdrRecord` is the inverse for a single record. Streaming parser (multi-record) is not part of this contract — C13 writer/reader own that. + +## Invariants + +- `schema_version >= 1` on every record; missing or non-integer values are rejected by `parse` with `FdrSchemaError`. +- `producer_id` is non-empty on every record. Anonymous records on the wire are a contract violation — `serialise` rejects them with `FdrSchemaError`. +- For `kind="overrun"`: `payload.producer_id` MUST equal the originating producer's slug, and `payload.dropped_count` MUST be `> 0`. (The OUTER envelope's `producer_id` is `"shared.fdr_client"` because the overrun record is emitted by the FdrClient itself, not by the producer whose enqueue overran.) +- Forward-compatible parser: a record at minor version N+1 carrying fields unknown at version N parses without exception; unknown payload fields land in `payload.extra`; unknown top-level fields land in record-level `extra`. Tooling MAY then choose to skip the record. +- Unknown future `kind` values do NOT raise — `parse` returns an `FdrRecord` with `kind` set to the raw string and `payload` set to whatever decoded; tooling MAY skip. +- Renaming a field, changing a field type, or removing a required field requires a major version bump (schema_version 2.x). +- Embedded binary blobs ≤ 4 KiB only. Bigger payloads (e.g. mid-flight tile JPEGs, ML inference inputs) MUST be referenced by sidecar path on disk; the contract test rejects oversized inline blobs. +- `serialise` and `parse` are pure: same input → byte-identical output (or deep-equal record). +- `FdrSchemaError` is the ONLY exception type either function raises on schema violation; library-specific exceptions (`orjson.JSONDecodeError`, `msgpack.UnpackException`, etc.) MUST be wrapped before crossing the public API. + +## Non-Goals + +- This contract does NOT define the lock-free SPSC ring buffer (`FdrClient`) — owned by the next task in E-CC-FDR-CLIENT. +- This contract does NOT define the writer thread, segment files, or 64 GB cap — owned by E-C13 (AZ-248). +- This contract does NOT define what triggers a record (per-component § 9 logging policies, VIO tick rate, etc. are owned by component epics). +- This contract does NOT define multi-record framing on disk — that is C13's segment file format, owned separately. + +## Versioning Rules + +- **Breaking changes** (field renamed/removed, type changed, ordering changed for length-prefixed wire format, library choice changed) → new major version (e.g. 2.0.0) + a deprecation pass through every consumer + a paired major bump on this contract. +- **Non-breaking additions** (new optional payload field appended, new `kind` value, new top-level optional field) → minor version bump. Forward-compat parser tolerates these by design. +- **Patch changes** (clarification, doc-only, no wire change) → patch bump. +- The contract test (`tests/contract/fdr_record_schema.py`) MUST be updated alongside any version bump. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| valid-roundtrip-log | `kind="log", payload={"level":"INFO","component":"c2_vpr","kind":"vpr.warmup","msg":"loaded","kv":{"model":"salad"}}` | `parse(serialise(r)) == r` | covers AC-1 | +| valid-roundtrip-overrun | `kind="overrun", producer_id="shared.fdr_client", payload={"producer_id":"c1_vio","dropped_count":42}` | round-trips; both producer_ids preserved | covers AC-1 + AC-5 | +| forward-compat-future-field | wire bytes carry `payload.new_field="x"` (hypothetical v1.1) parsed at v1.0 | record parses; `payload.extra["new_field"] == "x"` | covers AC-2 | +| forward-compat-unknown-kind | `kind="future.kind", payload={"foo":1}` | record parses opaquely; no exception | covers AC-3 | +| invalid-missing-version | bytes missing `schema_version` field | `FdrSchemaError`; message names `schema_version` | covers AC-4 | +| invalid-overrun-missing-dropped-count | `kind="overrun", payload={"producer_id":"c1_vio"}` | `FdrSchemaError`; message names `dropped_count` | covers AC-5 | +| invalid-overrun-zero-dropped-count | `kind="overrun", payload={"producer_id":"c1_vio","dropped_count":0}` | `FdrSchemaError`; message names `dropped_count` | covers AC-5 (must be `> 0`) | +| invalid-empty-producer-id | `producer_id=""` on serialise | `FdrSchemaError`; message names `producer_id` | covers AC-6 | +| invalid-oversized-blob | `payload={"jpeg":<8 KiB bytes>}` | `FdrSchemaError`; message says "use sidecar path" | invariant: ≤ 4 KiB inline | +| pure-determinism | call `serialise(r)` twice | byte-identical outputs | NFR-reliability | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from E-CC-FDR-CLIENT epic (AZ-247) | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md b/_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md new file mode 100644 index 0000000..35abe38 --- /dev/null +++ b/_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md @@ -0,0 +1,82 @@ +# Contract: descriptor_normaliser + +**Component**: shared_helpers / `helpers.descriptor_normaliser` (cross-cutting concern owned by E-CC-HELPERS / AZ-264) +**Producer task**: AZ-283 — `_docs/02_tasks/todo/AZ-283_descriptor_normaliser.md` +**Consumer tasks**: every C2 task that produces a query embedding before FAISS lookup; every C2.5 task that pre-processes descriptors for re-rank; every C3 task that pre-processes descriptors for cross-domain matching; every C10 task that builds the corpus side of the FAISS index during pre-flight provisioning +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +L2-normalise descriptors so cosine similarity aligns with FAISS's Euclidean / inner-product metric. Required because FAISS HNSW operates on Euclidean / inner-product spaces but the upstream backbones (UltraVPR, MegaLoc, MixVPR, etc.) emit raw cosine-similar embeddings. The same normalisation MUST be applied at both the **corpus** side (C10 during F1 provisioning) and the **query** side (C2 at runtime) — otherwise the index returns garbage. Centralising the helper guarantees they don't drift apart. Per `_docs/02_document/common-helpers/08_helper_descriptor_normaliser.md`. + +## Shape + +### For function / method APIs + +```python +class DescriptorNormaliser: + @staticmethod + def l2_normalise(descriptor: np.ndarray) -> np.ndarray: ... # shape (D,) + @staticmethod + def l2_normalise_batch(descriptors: np.ndarray) -> np.ndarray: ... # shape (N, D) + @staticmethod + def descriptor_metric() -> str: ... # always "inner_product" +``` + +| Name | Signature | Throws / Errors | Blocking? | +|------|-----------|-----------------|-----------| +| `l2_normalise` | `(descriptor: (D,)) -> (D,)` | `DescriptorNormaliserError` if shape is not 1-D, `D < 1`, or dtype is not `float16` / `float32` | sync, hot-path | +| `l2_normalise_batch` | `(descriptors: (N, D)) -> (N, D)` | `DescriptorNormaliserError` if shape is not 2-D, `N < 1`, `D < 1`, or dtype is not `float16` / `float32` | sync, hot-path | +| `descriptor_metric` | `() -> str` | none | sync, in-memory, returns `"inner_product"` | + +Numpy arrays. dtype contract: `float16` in → `float16` out; `float32` in → `float32` out (no silent up-cast). The helper does NOT mutate inputs in place — it returns a new array. + +## Invariants + +- **Stateless**: no module-level state; static methods only. Stateless static-only design satisfies `coderule.mdc`. +- **dtype-preserving**: `float16` in → `float16` out; `float32` in → `float32` out. The helper does NOT silently up-cast or down-cast. Other dtypes (e.g., `float64`, `int8`) are rejected. +- **Zero-norm vector handling**: a zero-norm input vector is returned as the zero vector (no division-by-zero, no exception). Callers must filter or accept that such descriptors will match nothing on FAISS lookup. Documented invariant. +- **No in-place mutation**: every call returns a new numpy array; the input is never modified. +- **Single source of truth for metric**: `descriptor_metric()` always returns `"inner_product"`. C6's `DescriptorIndex.search_topk` and C10's index-build code MUST call this helper for the FAISS index distance metric — never hard-code `"l2"` or `"cosine"`. +- **L2 idempotence**: `l2_normalise(l2_normalise(x)) == l2_normalise(x)` byte-equal for non-zero `x`. Re-normalising an already-normalised vector is a no-op (within `atol=0` for `float32`; within `atol=1e-3` for `float16` due to half-precision rounding). +- **No upward imports** (Layer 1): the module imports ONLY from `_types`, numpy, and stdlib. No `gps_denied_onboard.components.*` imports. + +## Non-Goals + +- Whitening / mean-subtraction — out of scope; consumers that need it apply it before / after this helper. +- PCA / dimensionality reduction — owned elsewhere (or out of scope entirely). +- GPU-accelerated normalisation — out of scope for v1.0.0; numpy / numpy-CUDA is fine for descriptor vector sizes (≤ 8192 dims) at the per-frame rate. +- Quantisation (PQ, IVF) — owned by C6 / C10 around the FAISS index, not by this helper. +- Auto-detection of descriptor dim — the helper is shape-agnostic for any `D >= 1`; consumers ensure the corpus and query side use the same `D`. + +## Versioning Rules + +- **Breaking changes** (function renamed/removed, signature changed, dtype contract relaxed, return value of `descriptor_metric()` changed) require a new major version + a re-build of every FAISS index built with the previous version (since the index metric is baked into the corpus-side normalisation). +- **Non-breaking additions** (new helper function, new optional kwarg with safe default) require a minor version bump. +- Changing `descriptor_metric()` return value is ALWAYS a major version because it forces every downstream FAISS index to be rebuilt. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| valid-unit-vector | `np.array([3.0, 4.0], dtype=float32)` | `np.array([0.6, 0.8], dtype=float32)`; norm ≈ 1.0 within `atol=1e-6` | Round-trip happy path | +| valid-batch | `np.array([[3.0, 4.0], [1.0, 0.0]], dtype=float32)` | rows `[0.6, 0.8]` and `[1.0, 0.0]`; each row's norm ≈ 1.0 | Batch path | +| valid-fp16-roundtrip | random `float16` descriptor of dim 512 | `result.dtype == float16`; norm ≈ 1.0 within `atol=1e-3` | dtype preservation | +| valid-fp32-roundtrip | random `float32` descriptor of dim 512 | `result.dtype == float32`; norm ≈ 1.0 within `atol=1e-6` | dtype preservation | +| valid-zero-vector | `np.zeros(128, dtype=float32)` | returned as `np.zeros(128, dtype=float32)`; no exception, no NaN | Zero-norm invariant | +| valid-idempotent-fp32 | `l2_normalise(l2_normalise(x))` for `float32` `x` | byte-equal to `l2_normalise(x)` | Idempotence (fp32) | +| valid-idempotent-fp16 | `l2_normalise(l2_normalise(x))` for `float16` `x` | matches within `atol=1e-3` | Idempotence (fp16, looser due to half-precision) | +| valid-no-mutation | call `l2_normalise(x)`; check `x` afterward | `x` is bit-identical to its original value | No in-place mutation | +| valid-metric | `descriptor_metric()` | returns the string `"inner_product"` | Single source of truth | +| invalid-dtype-float64 | `np.array([1.0, 2.0], dtype=float64)` | `DescriptorNormaliserError` mentions `float16` / `float32` only | dtype contract | +| invalid-shape-2d-on-single | `np.zeros((2, 3), dtype=float32)` passed to `l2_normalise` (single) | `DescriptorNormaliserError` mentions 1-D shape required | Shape contract (single) | +| invalid-shape-1d-on-batch | `np.zeros(128, dtype=float32)` passed to `l2_normalise_batch` | `DescriptorNormaliserError` mentions 2-D shape required | Shape contract (batch) | +| no-upward-imports | static import scan | only `_types`, numpy, stdlib | Layer 1 invariant | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from `_docs/02_document/common-helpers/08_helper_descriptor_normaliser.md` | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/shared_helpers/engine_filename_schema.md b/_docs/02_document/contracts/shared_helpers/engine_filename_schema.md new file mode 100644 index 0000000..1b15cff --- /dev/null +++ b/_docs/02_document/contracts/shared_helpers/engine_filename_schema.md @@ -0,0 +1,92 @@ +# Contract: engine_filename_schema + +**Component**: shared_helpers / `helpers.engine_filename_schema` (cross-cutting concern owned by E-CC-HELPERS / AZ-264) +**Producer task**: AZ-281 — `_docs/02_tasks/todo/AZ-281_engine_filename_schema.md` +**Consumer tasks**: every C7 task that writes / reads `.engine` files via the inference runtime; every C10 task that compiles engines through C7 and writes them to the cache root +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Self-describing `.engine` filename schema per D-C10-7. TensorRT engines are NOT portable across `(SM, JetPack, TRT, precision)` tuples; encoding the tuple in the filename makes mismatch instantly visible at takeoff load (F2) so refusing-to-deserialize-on-mismatch becomes trivial. Per `_docs/02_document/common-helpers/06_helper_engine_filename_schema.md`. + +## Shape + +### For function / method APIs + +```python +class EngineFilenameSchema: + @staticmethod + def build(model_name: str, sm: int, jetpack: str, trt: str, precision: str) -> str: ... + @staticmethod + def parse(filename: str) -> EngineCacheKey: ... + @staticmethod + def matches_host(filename: str, host_capabilities: HostCapabilities) -> bool: ... +``` + +| Name | Signature | Throws / Errors | Blocking? | +|------|-----------|-----------------|-----------| +| `build` | `(model_name, sm, jetpack, trt, precision) -> str` | `EngineFilenameSchemaError` if any input fails validation (see Invariants) | sync, pure | +| `parse` | `(filename) -> EngineCacheKey` | `EngineFilenameSchemaError` if filename does not match the format | sync, pure | +| `matches_host` | `(filename, host_capabilities) -> bool` | `EngineFilenameSchemaError` only if the filename itself is malformed (returns False on tuple mismatch — that's the expected "not a match" path) | sync, pure | + +`EngineCacheKey` and `HostCapabilities` are imported from `gps_denied_onboard._types.manifests`. The `EngineCacheKey` Protocol exposes: `model_name: str`, `sm: int`, `jetpack: str`, `trt: str`, `precision: str` (where `precision in {"fp16", "int8", "mixed"}`). + +### Filename format + +``` +{model}__sm{SM}_jp{JP_dotted}_trt{TRT_dotted}_{precision}.engine +``` + +Example: `ultravpr__sm87_jp6.2_trt10.3_fp16.engine` + +## Invariants + +- **Stateless**: no module-level state; static methods only. The static-only design satisfies the coderule.mdc constraint ("only use static methods for pure self-contained computations") because filename parsing is a pure mathematical function of its arguments. +- **Format strictness**: filenames MUST follow `{model}__sm{SM}_jp{JP}_trt{TRT}_{precision}.engine` exactly. The double underscore (`__`) after `model` is intentional — it is the field separator that lets `model` itself contain single underscores (e.g., `ultra_vpr__sm87_...`). +- **Field validation**: + - `model_name`: non-empty, only `[a-z0-9_]` characters (no double underscores), max 64 chars. + - `sm`: positive integer (e.g., 87 for Jetson Orin Nano Super; 86 for Orin AGX; 72 for Xavier). + - `jetpack`: dotted version string `.` (e.g., `6.2`); each segment is a non-negative integer. + - `trt`: dotted version string `.` (e.g., `10.3`); same rules as `jetpack`. + - `precision`: strictly one of `"fp16"`, `"int8"`, `"mixed"`. + - The dotted-version format must round-trip cleanly through filesystems — no `/` or `\` in `model_name` or version segments. +- **`matches_host` is exact-match**: returns True iff every tuple element matches exactly (`sm == current_sm`, `jetpack == current_jetpack`, `trt == current_trt`). Precision and model_name do not affect host-matching but ARE preserved in the parsed key. +- **Round-trip identity**: `parse(build(*args)) == EngineCacheKey(*args)` for any valid args. `build(parse(filename)._asdict())` returns the same filename for any valid filename. +- **No upward imports** (Layer 1): the module imports ONLY from `_types`, `re`, and stdlib. No `gps_denied_onboard.components.*` imports. + +## Non-Goals + +- Versioning of the schema itself — there is no `schema_version` field. Adding a new tuple dimension is a Plan-phase carryforward (see Caveats in `_docs/02_document/common-helpers/06_helper_engine_filename_schema.md`). +- Engine compilation / compatibility resolution — owned by C7. +- Hot-loading engines / lazy materialisation — owned by C7. +- Filename collision detection across cache roots — owned by C10's Manifest. + +## Versioning Rules + +- **Breaking changes** (filename format changed, separator changed, new mandatory field added, precision enum reduced) require a new major version + a re-write pass over every existing `.engine` filename in the cache root. +- **Non-breaking additions** (new accessor function, new optional kwarg with safe default, new `precision` enum value appended) require a minor version bump. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| valid-build-ultravpr | `("ultravpr", 87, "6.2", "10.3", "fp16")` | `"ultravpr__sm87_jp6.2_trt10.3_fp16.engine"` | Reference example from helper doc | +| valid-roundtrip | `parse(build(*args))` for 10 random valid tuples | each round-trip returns deep-equal `EngineCacheKey` | Round-trip invariant | +| valid-matches-host-true | filename built for `(sm=87, jp=6.2, trt=10.3)`, host with same | `matches_host` returns True | Exact match | +| valid-matches-host-false-sm | filename built for `sm=87`, host with `sm=72` | `matches_host` returns False (no exception) | Tuple mismatch | +| valid-matches-host-false-trt | filename built for `trt=10.3`, host with `trt=10.4` | `matches_host` returns False | Patch-version mismatch is still a mismatch | +| invalid-precision-enum | `build(..., precision="bf16")` | `EngineFilenameSchemaError` mentions allowed enum | Precision strictness | +| invalid-model-uppercase | `build("UltraVPR", ...)` | `EngineFilenameSchemaError` mentions `[a-z0-9_]` | Model-name strictness | +| invalid-model-double-underscore | `build("ultra__vpr", ...)` | `EngineFilenameSchemaError` mentions reserved separator | Separator collision guard | +| invalid-jetpack-format | `jetpack="6.2.1"` | `EngineFilenameSchemaError` mentions dotted `.` format | Version strictness | +| invalid-parse-malformed | `parse("not_an_engine_file.bin")` | `EngineFilenameSchemaError` raised | Parse strictness | +| invalid-parse-missing-suffix | `parse("ultravpr__sm87_jp6.2_trt10.3_fp16")` (no `.engine`) | `EngineFilenameSchemaError` raised | Suffix required | +| no-upward-imports | static import scan | only `_types`, `re`, stdlib | Layer 1 invariant | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from `_docs/02_document/common-helpers/06_helper_engine_filename_schema.md` | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/shared_helpers/imu_preintegrator.md b/_docs/02_document/contracts/shared_helpers/imu_preintegrator.md new file mode 100644 index 0000000..77ff511 --- /dev/null +++ b/_docs/02_document/contracts/shared_helpers/imu_preintegrator.md @@ -0,0 +1,82 @@ +# Contract: imu_preintegrator + +**Component**: shared_helpers / `helpers.imu_preintegrator` (cross-cutting concern owned by E-CC-HELPERS / AZ-264) +**Producer task**: AZ-276 — `_docs/02_tasks/todo/AZ-276_imu_preintegrator.md` +**Consumer tasks**: every C1 VIO task that consumes IMU windows; every C5 state-estimator task that builds GTSAM `CombinedImuFactor`s +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Centralise GTSAM `CombinedImuFactor` preintegration so C1 (VIO) and C5 (StateEstimator) cannot drift into two slightly-different IMU integrations of the same FC IMU window. The helper owns the GTSAM `PreintegrationCombinedParams` + `PreintegratedCombinedMeasurements` lifecycle; consumers feed samples and read closed factors. Per `_docs/02_document/common-helpers/01_helper_imu_preintegrator.md`. + +## Shape + +### For function / method APIs + +```python +class ImuPreintegrator: + def __init__(self, params: PreintegrationCombinedParams) -> None: ... + def reset_with_bias(self, bias: ImuBias) -> None: ... + def integrate_sample(self, sample: ImuSample) -> None: ... + def integrate_window(self, window: ImuWindow) -> None: ... + def current_preintegration(self) -> CombinedImuFactor: ... + def reset_for_new_keyframe(self) -> CombinedImuFactor: ... +``` + +| Name | Signature | Throws / Errors | Blocking? | +|------|-----------|-----------------|-----------| +| `reset_with_bias` | `(bias: ImuBias) -> None` | none | sync, in-memory | +| `integrate_sample` | `(sample: ImuSample) -> None` | `ImuPreintegrationError` if `sample.ts_ns` is not strictly monotonic vs. last sample | sync, hot-path | +| `integrate_window` | `(window: ImuWindow) -> None` | `ImuPreintegrationError` on monotonicity violation | sync, hot-path | +| `current_preintegration` | `() -> CombinedImuFactor` | `ImuPreintegrationError` if zero samples integrated since last reset | sync | +| `reset_for_new_keyframe` | `() -> CombinedImuFactor` | `ImuPreintegrationError` if zero samples integrated since last reset | sync; clears internal state | + +`ImuSample`, `ImuWindow`, `ImuBias` types are imported from `gps_denied_onboard._types.nav`. `CombinedImuFactor` is the GTSAM-native factor type (re-exported from `helpers.imu_preintegrator` so consumers do not import GTSAM directly). + +### Construction + +```python +def make_imu_preintegrator(calibration: CameraCalibration) -> ImuPreintegrator: ... +``` + +`make_imu_preintegrator` reads gyro/accel noise covariances from `CameraCalibration` (which carries the IMU noise model per-deployment per `_docs/02_document/components/01_c1_vio/description.md`) and returns an instance with the right `PreintegrationCombinedParams`. Composition root binds one instance per writer thread. + +## Invariants + +- **Single-threaded by design**: no internal lock. The composition root binds ONE preintegrator instance to ONE writer thread; concurrent calls from multiple threads are undefined behaviour. The contract test asserts the helper does not acquire any locks. +- **Strict monotonic timestamps**: every sample fed through `integrate_sample` / `integrate_window` MUST have `ts_ns` strictly greater than the previously-integrated sample's `ts_ns`. Violations raise `ImuPreintegrationError`; the preintegrator state is NOT mutated by a rejected sample. +- **Bias drift is the consumer's responsibility**: the preintegrator never re-estimates bias internally. Consumers (C1, C5) call `reset_with_bias(...)` whenever their bias estimate changes; until then, integration uses the last-set bias. +- **No clock ownership**: every IMU sample carries its own monotonic timestamp. The preintegrator never reads a wall clock and never injects timestamps. +- **Consumers receive GTSAM types**: `current_preintegration()` and `reset_for_new_keyframe()` return GTSAM `CombinedImuFactor` instances that consumers attach to their factor graphs. The factor object is owned by the caller after return (no lingering references inside the helper). +- **`reset_for_new_keyframe` is destructive**: it returns the closed factor AND resets internal accumulators. Callers MUST capture the return value or lose the integration. + +## Non-Goals + +- Bias estimation / re-bias logic — owned by C1 and C5. +- Multi-threaded sample feeding — out of scope; helper is single-thread by contract. +- IMU sample acquisition / FC adapter integration — owned by C8. +- Serialising preintegrated factors to FDR records — owned by C13 / E-CC-FDR-CLIENT. + +## Versioning Rules + +- **Breaking changes** (method renamed/removed, parameter type changed, return type changed, monotonicity invariant relaxed) require a new major version + a deprecation pass through C1 and C5. +- **Non-breaking additions** (new optional method, new diagnostic accessor that does not mutate state) require a minor version bump. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| valid-monotonic-sequence | 100 samples with strictly increasing `ts_ns`, then `current_preintegration()` | factor returned with `deltaTij` matching the time span; non-zero `delta_pose` | Round-trip happy path | +| valid-window-then-keyframe | one `integrate_window(N samples)` then `reset_for_new_keyframe()` | factor returned; subsequent `current_preintegration()` raises `ImuPreintegrationError` (state cleared) | Confirms destructive reset | +| invalid-non-monotonic-sample | sample with `ts_ns < last_ts_ns` | `ImuPreintegrationError` raised; internal state unchanged (next valid sample integrates as if rejected sample never came) | Strict-monotonic invariant | +| valid-rebias | `reset_with_bias(bias_a)`, integrate 50 samples, `reset_with_bias(bias_b)`, integrate 50 more, `current_preintegration()` | factor reflects bias_b applied to second half | Re-bias mid-window | +| invalid-empty-preintegration | `current_preintegration()` after `reset_for_new_keyframe()` with no further samples | `ImuPreintegrationError` mentions "no samples since reset" | Guard against empty factor | +| determinism | same `(bias, samples)` integrated twice into two instances | deep-equal `CombinedImuFactor` outputs | Pure-function determinism | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from `_docs/02_document/common-helpers/01_helper_imu_preintegrator.md` | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/shared_helpers/lightglue_runtime.md b/_docs/02_document/contracts/shared_helpers/lightglue_runtime.md new file mode 100644 index 0000000..a429255 --- /dev/null +++ b/_docs/02_document/contracts/shared_helpers/lightglue_runtime.md @@ -0,0 +1,93 @@ +# Contract: lightglue_runtime + +**Component**: shared_helpers / `helpers.lightglue_runtime` (cross-cutting concern owned by E-CC-HELPERS / AZ-264) +**Producer task**: AZ-278 — `_docs/02_tasks/todo/AZ-278_lightglue_runtime.md` +**Consumer tasks**: C2.5 InlierBasedReranker (single-pair LightGlue inlier counter); C3 CrossDomainMatcher (heavier matching pass) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Single owner of the LightGlue inference engine. C2.5 does single-pair LightGlue matching for inlier counting on K=10 candidates per frame; C3 does the heavier matching pass on the surviving N=3 candidates. Both consume the SAME LightGlue engine — sharing avoids paying the engine-build / GPU-memory cost twice and structurally prevents the C2.5 ↔ C3 import cycle (R14 fix in `_docs/02_document/epics.md`). Per `_docs/02_document/common-helpers/03_helper_lightglue_runtime.md`. + +## Shape + +### For function / method APIs + +```python +class LightGlueRuntime: + def __init__(self, engine_handle: EngineHandle) -> None: ... + def descriptor_dim(self) -> int: ... + def match( + self, + features_a: KeypointSet, + features_b: KeypointSet, + ) -> CorrespondenceSet: ... + def match_batch( + self, + features_a_list: list[KeypointSet], + features_b_list: list[KeypointSet], + ) -> list[CorrespondenceSet]: ... +``` + +| Name | Signature | Throws / Errors | Blocking? | +|------|-----------|-----------------|-----------| +| `__init__` | `(engine_handle: EngineHandle) -> None` | `LightGlueRuntimeError` if `engine_handle` is None or descriptor_dim < 1 | sync, one-time | +| `descriptor_dim` | `() -> int` | none | sync, in-memory | +| `match` | `(KeypointSet, KeypointSet) -> CorrespondenceSet` | `LightGlueRuntimeError` if descriptor dims mismatch the engine's expected dim, or if a concurrent caller tries to enter | sync, GPU-bound | +| `match_batch` | `(list[KeypointSet], list[KeypointSet]) -> list[CorrespondenceSet]` | same as `match` | sync, GPU-bound | + +`EngineHandle`, `KeypointSet`, and `CorrespondenceSet` are imported from `gps_denied_onboard._types`. `EngineHandle` is a Protocol (NOT a concrete class) so this helper does not import any Layer 2+ component; the production handle is created by C7's `InferenceRuntime.deserialize_engine` and injected by the composition root. + +### Construction + +The composition root constructs the runtime once at takeoff: + +```python +engine_handle = inference_runtime.deserialize_engine(LIGHTGLUE_ENGINE_CACHE_ENTRY) +runtime = LightGlueRuntime(engine_handle) +# inject the SAME instance into both consumers +c2_5_reranker = InlierBasedReranker(..., lightglue_runtime=runtime, ...) +c3_matcher = CrossDomainMatcher(..., lightglue_runtime=runtime, ...) +``` + +## Invariants + +- **Serial-access invariant** (R14 cross-component): the runtime owns ONE CUDA stream. Concurrent calls to `match` / `match_batch` from multiple threads are FORBIDDEN. The composition root binds the runtime to the single F3 hot-path thread (per `_docs/02_document/epics.md` R14 entry). The helper's contract test asserts a guard exists that rejects concurrent entry with `LightGlueConcurrentAccessError`. +- **Backbone consistency**: features fed in MUST come from the same backbone as the LightGlue engine was trained for (DISK in production-default; ALIKED / XFeat alternates). Mixing backbones is a runtime error caught by the input shape check (`descriptor_dim` mismatch raises `LightGlueRuntimeError`). The helper does NOT silently coerce dimensions. +- **No shared mutable state**: the runtime exposes no `set_*` / `update_*` methods. Once constructed with an `engine_handle`, its behaviour is fixed for its lifetime. +- **No upward imports** (Layer 1): the module imports ONLY from `_types`, numpy, and stdlib. NO `gps_denied_onboard.components.*` imports — neither C2.5 nor C3 nor C7 — under any circumstance. This is the structural fix for R14: the helper sits below the components in the layering, so the C2.5 ↔ C3 cycle becomes impossible to express. +- **Engine handle is opaque**: the helper does not know whether the handle wraps a TensorRT engine, an ONNX session, or a PyTorch model. It calls a fixed Protocol surface (`forward(...)`, `descriptor_dim`); the implementation owner is C7. + +## Non-Goals + +- Engine compilation / serialisation — owned by C7 (via `EngineFilenameSchema` + the inference runtime). +- Engine cache management / takeoff load — owned by C10 (`CacheProvisioner`). +- Backbone-specific feature extraction (DISK, ALIKED, XFeat) — owned by C3 / C7. +- Multi-GPU sharding — out of scope; production target is single-GPU Tier-2. +- Mixed-backbone matching (cross-DISK-ALIKED) — out of scope; consumers ensure backbone consistency before calling. + +## Versioning Rules + +- **Breaking changes** (method renamed/removed, signature changed, `EngineHandle` Protocol changed, serial-access invariant relaxed) require a new major version + a deprecation pass through C2.5 and C3. +- **Non-breaking additions** (new optional kwarg with safe default, new diagnostic accessor) require a minor version bump. +- Changing the underlying engine format (TensorRT → ONNX) is NOT a contract change because the helper's surface treats the handle as opaque — but it IS a C7 contract change and must follow C7's versioning rules. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| valid-single-pair | two `KeypointSet`s of matching descriptor dim | `CorrespondenceSet` returned with `len > 0` for a synthetic-overlap pair | Round-trip happy path (C2.5 use) | +| valid-batch-3 | three pairs of `KeypointSet`s | three `CorrespondenceSet`s returned in order | Batch path (C3 use) | +| invalid-dim-mismatch | features with `descriptor_dim` not matching the engine | `LightGlueRuntimeError` mentions the expected vs actual dim | Backbone-consistency invariant | +| invalid-concurrent-access | two threads call `match` simultaneously | `LightGlueConcurrentAccessError` raised in the second-entering thread | R14 serial-access invariant | +| invalid-empty-handle | `LightGlueRuntime(engine_handle=None)` | `LightGlueRuntimeError` raised at construction | Construction guard | +| no-upward-imports | static import scan | only `_types`, numpy, stdlib — no `components.*` | R14 structural fix | +| determinism-given-engine | same `(features_a, features_b)` matched twice with the same engine handle | byte-equal `CorrespondenceSet` outputs | Pure-function determinism downstream of the engine | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from `_docs/02_document/common-helpers/03_helper_lightglue_runtime.md` (R14 fix) | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/shared_helpers/ransac_filter.md b/_docs/02_document/contracts/shared_helpers/ransac_filter.md new file mode 100644 index 0000000..3df593f --- /dev/null +++ b/_docs/02_document/contracts/shared_helpers/ransac_filter.md @@ -0,0 +1,95 @@ +# Contract: ransac_filter + +**Component**: shared_helpers / `helpers.ransac_filter` (cross-cutting concern owned by E-CC-HELPERS / AZ-264) +**Producer task**: AZ-282 — `_docs/02_tasks/todo/AZ-282_ransac_filter.md` +**Consumer tasks**: every C2.5 task that runs RANSAC over single-pair LightGlue matches; every C3 task that runs RANSAC over 2D-2D correspondences for the per-candidate inlier count; every C3.5 task that recomputes residual after AdHoP refinement; every C4 task that computes the per-frame final reprojection residual for FDR provenance +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Thin, deterministic wrapper around OpenCV's RANSAC + reprojection-residual computation. Keeps the four call sites (C2.5, C3, C3.5, C4) on one canonical inlier-filtering algorithm and one canonical residual definition (median pixel residual). Per `_docs/02_document/common-helpers/07_helper_ransac_filter.md`. + +## Shape + +### For function / method APIs + +```python +class RansacFilter: + @staticmethod + def filter_correspondences( + correspondences: np.ndarray, # shape (N, 4): [x_a, y_a, x_b, y_b] + ransac_threshold_px: float, + min_inliers: int, + ) -> RansacResult: ... + + @staticmethod + def compute_reprojection_residual( + correspondences: np.ndarray, # shape (I, 4): inlier set + K: np.ndarray, # shape (3, 3): camera intrinsics + distortion: np.ndarray, # shape (5,) or (8,): OpenCV distortion model + pose: SE3, + ) -> float: ... +``` + +`RansacResult` is a frozen dataclass: + +```python +@dataclass(frozen=True) +class RansacResult: + inlier_correspondences: np.ndarray # shape (I, 4) + inlier_count: int # I + outlier_count: int # N - I + median_residual_px: float +``` + +| Name | Signature | Throws / Errors | Blocking? | +|------|-----------|-----------------|-----------| +| `filter_correspondences` | `(correspondences, ransac_threshold_px, min_inliers) -> RansacResult` | `RansacFilterError` if `correspondences` shape != `(N, 4)`, `ransac_threshold_px <= 0`, `min_inliers < 0`, or `N < 4` (RANSAC needs ≥4 points for homography) | sync, CPU | +| `compute_reprojection_residual` | `(correspondences, K, distortion, pose) -> float` | `RansacFilterError` on shape / dtype mismatch (correspondences must be `(I, 4)`, `K` must be `(3, 3)`, distortion must be `(5,)` or `(8,)`); returns `NaN` if `I == 0` | sync, CPU | + +`SE3` is the type alias from `helpers.se3_utils` (re-exported GTSAM `Pose3`). All numpy arrays use `dtype=float64`. + +## Invariants + +- **Stateless**: no module-level state; static methods only. Stateless static-only design satisfies `coderule.mdc` ("only use static methods for pure self-contained computations"). +- **Deterministic given fixed seed**: `cv2.findHomography(..., cv2.RANSAC)` is non-deterministic by default. The helper sets `cv2.setRNGSeed(0)` (or uses the explicit `seed` kwarg where the OpenCV API supports it) so the same input correspondences always produce the same `RansacResult`. Deterministic behaviour is part of the contract. +- **Median residual semantics**: `compute_reprojection_residual` returns the MEDIAN reprojection residual in pixels (NOT the mean — outliers in the 2D residual distribution should not bias the consumer's quality signal). Returns `NaN` if `correspondences.shape[0] == 0`. +- **OpenCV-internal RANSAC ownership note**: for C4's `solvePnPRansac` (2D-3D RANSAC), OpenCV does its own internal RANSAC. THIS helper's `filter_correspondences` is for the standalone 2D-2D case (C3, C2.5, C3.5). C4 uses ONLY `compute_reprojection_residual` from this helper. +- **Min-inliers semantics**: `min_inliers` is informational — `RansacResult.inlier_count` may be less than `min_inliers`. The helper does NOT raise when the count falls short; the consumer decides whether to proceed (`InsufficientInliersError` etc. live in the consuming components). +- **No upward imports** (Layer 1): the module imports ONLY from `_types`, `helpers.se3_utils` (allowed — same Layer 1), `cv2`, `numpy`, and stdlib. No `gps_denied_onboard.components.*` imports. + +## Non-Goals + +- 2D-3D RANSAC inside `solvePnPRansac` — OpenCV does it internally; this helper does not wrap it. +- Per-component RANSAC threshold defaults — they are documented per-component in C2.5, C3, C3.5, C4 specs. This helper takes the threshold as a parameter; defaults belong to the consumers. +- Adaptive RANSAC (PROSAC, USAC) — out of scope for v1.0.0. +- GPU-accelerated RANSAC — out of scope for v1.0.0. +- Confidence / iteration-count tuning of the underlying `cv2.findHomography` call — exposed only via the `ransac_threshold_px` parameter; if a future consumer needs to tune iterations, that's a minor-version contract addition. + +## Versioning Rules + +- **Breaking changes** (function renamed/removed, signature changed, return shape changed, residual statistic changed from median to mean) require a new major version + a deprecation pass through C2.5, C3, C3.5, C4. +- **Non-breaking additions** (new optional kwarg with safe default, new accessor on `RansacResult`) require a minor version bump. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| valid-clean-correspondences | 100 perfect homography correspondences | `inlier_count == 100`, `outlier_count == 0`, `median_residual_px ≈ 0.0` | Round-trip happy path | +| valid-mixed | 80 inliers + 20 outlier correspondences with threshold 1.5 px | `inlier_count` ∈ `[78, 82]` (RANSAC noise tolerance), `outlier_count == 100 - inlier_count` | Mixed-quality input | +| valid-determinism | same input run twice through `filter_correspondences` | byte-equal `RansacResult` outputs | Deterministic-seed invariant | +| valid-residual-zero-on-clean | 4 perfect 2D-2D correspondences with known pose | `median_residual_px ≈ 0.0` | Clean residual | +| valid-residual-nan-on-empty | empty inlier array | returns `NaN` (no exception) | Empty-input semantics | +| invalid-shape | `correspondences.shape = (10, 3)` | `RansacFilterError`; mentions `(N, 4)` shape | Shape contract | +| invalid-threshold | `ransac_threshold_px = -1.0` | `RansacFilterError`; mentions positive threshold | Threshold guard | +| invalid-too-few-points | `correspondences.shape = (3, 4)` | `RansacFilterError`; mentions minimum 4 points | RANSAC point-count guard | +| invalid-K-shape | `K.shape = (4, 4)` in residual call | `RansacFilterError`; mentions `(3, 3)` shape | K shape contract | +| no-upward-imports | static import scan | only `_types`, `helpers.se3_utils`, `cv2`, `numpy`, stdlib | Layer 1 invariant | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from `_docs/02_document/common-helpers/07_helper_ransac_filter.md` | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/shared_helpers/se3_utils.md b/_docs/02_document/contracts/shared_helpers/se3_utils.md new file mode 100644 index 0000000..e6373d8 --- /dev/null +++ b/_docs/02_document/contracts/shared_helpers/se3_utils.md @@ -0,0 +1,78 @@ +# Contract: se3_utils + +**Component**: shared_helpers / `helpers.se3_utils` (cross-cutting concern owned by E-CC-HELPERS / AZ-264) +**Producer task**: AZ-277 — `_docs/02_tasks/todo/AZ-277_se3_utils.md` +**Consumer tasks**: every C1 VIO task that produces relative poses, every C2.5 / C3 / C3.5 task that handles 4x4 → SE(3) conversion, every C4 task that converts `solvePnPRansac` output into a GTSAM factor, every C5 task that builds iSAM2 graph keys, every C8 task that encodes pose for FC emission +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Centralise SE(3) ↔ 4×4-matrix conversion and Lie-algebra exponential / logarithm / adjoint so every component that crosses the matrix-vs-pose boundary uses the same numerical convention. Per `_docs/02_document/common-helpers/02_helper_se3_utils.md`. Backed by GTSAM `Pose3` primitives where available; pure numpy fallback otherwise. + +## Shape + +### For function / method APIs + +```python +def matrix_to_se3(T_4x4: np.ndarray) -> SE3: ... +def se3_to_matrix(pose: SE3) -> np.ndarray: ... +def exp_map(xi: np.ndarray) -> SE3: ... # xi shape (6,) +def log_map(pose: SE3) -> np.ndarray: ... # returns shape (6,) +def adjoint(pose: SE3) -> np.ndarray: ... # returns shape (6, 6) +def is_valid_rotation(R_3x3: np.ndarray, *, atol: float = 1e-6) -> bool: ... +``` + +| Name | Signature | Throws / Errors | Blocking? | +|------|-----------|-----------------|-----------| +| `matrix_to_se3` | `(T_4x4) -> SE3` | `Se3InvalidMatrixError` if shape != (4,4), bottom row != [0,0,0,1], or rotation is not orthogonal within `atol` | sync, pure | +| `se3_to_matrix` | `(SE3) -> np.ndarray (4,4)` | none | sync, pure | +| `exp_map` | `(xi: (6,)) -> SE3` | `Se3InvalidMatrixError` if shape != (6,) | sync, pure | +| `log_map` | `(SE3) -> np.ndarray (6,)` | none | sync, pure | +| `adjoint` | `(SE3) -> np.ndarray (6,6)` | none | sync, pure | +| `is_valid_rotation` | `(R_3x3) -> bool` | none (returns False for any invalid input) | sync, pure | + +`SE3` is a type alias for the GTSAM `Pose3` (re-exported from `helpers.se3_utils` so consumers do not import GTSAM directly). All numpy arrays use `dtype=float64`; passing `float32` raises `Se3InvalidMatrixError`. + +## Invariants + +- **Stateless**: no module-level state; every function is pure. The same input always produces the same output (deep-equal). +- **Right-handed convention**: rotation order is right-handed; `T_4x4` follows the standard `[[R, t], [0, 1]]` block layout. +- **Orthogonal-rotation guarantee on the way in**: callers MUST orthogonalise their rotation matrices before `matrix_to_se3`. The helper rejects matrices whose `R^T R` deviates from `I` by more than `atol`. The helper does NOT silently re-orthogonalise. +- **Positive-determinant rotation**: `det(R) ≈ +1`. Mirror matrices (`det(R) ≈ -1`) are rejected. +- **Round-trip identity**: `se3_to_matrix(matrix_to_se3(T)) == T` for any valid `T` within numerical tolerance (`np.allclose(..., atol=1e-9)`). +- **Lie-algebra round-trip**: `exp_map(log_map(p)) == p` for any non-degenerate `p` within `atol=1e-9`. Near-identity edge cases (twist norm < 1e-10) MUST not raise — the implementation falls back to the small-angle Taylor expansion documented in GTSAM. +- **No upward imports** (Layer 1): the module imports ONLY from `_types`, GTSAM, numpy, and stdlib. No `gps_denied_onboard.components.*` imports. + +## Non-Goals + +- Quaternion utilities (`Rotation` / `Quaternion`) — out of scope; consumers that need a quaternion are expected to convert via numpy's `from_matrix` / `from_quat` paths inline. +- SE(2) / planar pose helpers — out of scope. +- Pose interpolation / Slerp — out of scope (consumers that need it implement it locally on top of `exp_map` / `log_map`). +- Manifold operators richer than exp/log/adjoint (e.g., parallel transport, twist composition Jacobians) — out of scope; revisit when a consumer needs them. + +## Versioning Rules + +- **Breaking changes** (function renamed/removed, signature changed, error type changed, dtype contract relaxed) require a new major version + a deprecation pass through C1, C2.5, C3, C3.5, C4, C5, C8. +- **Non-breaking additions** (new helper function, new optional kwarg with safe default) require a minor version bump. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| valid-roundtrip-4x4 | random valid `T_4x4` | `np.allclose(se3_to_matrix(matrix_to_se3(T)), T, atol=1e-9)` | Round-trip happy path | +| valid-roundtrip-lie | random `xi` of norm ≈ 1.0 | `np.allclose(log_map(exp_map(xi)), xi, atol=1e-9)` | Lie-algebra round-trip | +| valid-near-identity | `xi = [1e-12]*6` | `exp_map(xi)` returns identity within `atol=1e-9`; no exception | Small-angle stability | +| invalid-non-orthogonal | `T_4x4` whose `R` has `R^T R - I` of norm 1e-3 | `Se3InvalidMatrixError` raised; helper does NOT silently re-orthogonalise | Strict caller-orthogonalisation rule | +| invalid-mirror | `T_4x4` with `det(R) = -1` | `Se3InvalidMatrixError` raised | Positive-det invariant | +| invalid-bottom-row | `T_4x4` with bottom row `[0,0,0,2]` | `Se3InvalidMatrixError` raised | Block-layout guard | +| invalid-dtype | `T_4x4` with `dtype=float32` | `Se3InvalidMatrixError` raised mentioning dtype | dtype contract | +| determinism | same `T_4x4` through `matrix_to_se3 → se3_to_matrix` twice | byte-equal numpy outputs | Pure-function determinism | +| no-upward-imports | static import scan of `helpers.se3_utils` | only `_types`, GTSAM, numpy, stdlib | Layer 1 invariant | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from `_docs/02_document/common-helpers/02_helper_se3_utils.md` | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/shared_helpers/sha256_sidecar.md b/_docs/02_document/contracts/shared_helpers/sha256_sidecar.md new file mode 100644 index 0000000..5ef1de7 --- /dev/null +++ b/_docs/02_document/contracts/shared_helpers/sha256_sidecar.md @@ -0,0 +1,77 @@ +# Contract: sha256_sidecar + +**Component**: shared_helpers / `helpers.sha256_sidecar` (cross-cutting concern owned by E-CC-HELPERS / AZ-264) +**Producer task**: AZ-280 — `_docs/02_tasks/todo/AZ-280_sha256_sidecar.md` +**Consumer tasks**: every C6 task that writes the FAISS index / descriptor sidecar; every C7 task that writes engine cache files + INT8 calibration cache; every C10 task that writes the Manifest; every C11 task that verifies tile artifacts before serving them +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Centralise the atomic-write + SHA-256 content-hash sidecar pattern (D-C10-3). Every persistent artifact that takeoff-load (F2) must verify gets written atomically AND has a `.sha256` sidecar that the verifier can independently recompute. Without a shared helper, C6 / C7 / C10 / C11 each grow their own slightly-different implementation; the takeoff-load gate breaks the moment one of them drifts. Per `_docs/02_document/common-helpers/05_helper_sha256_sidecar.md`. + +## Shape + +### For function / method APIs + +```python +class Sha256Sidecar: + @staticmethod + def write_atomic(path: Path, payload: bytes) -> str: ... # returns hex digest + @staticmethod + def write_atomic_and_sidecar(path: Path, payload: bytes) -> str: ... # returns hex digest + @staticmethod + def verify(path: Path) -> bool: ... # checks payload hash against sidecar + @staticmethod + def aggregate_hash(paths: list[Path]) -> str: ... # for Manifest covering many files +``` + +| Name | Signature | Throws / Errors | Blocking? | +|------|-----------|-----------------|-----------| +| `write_atomic` | `(path, payload) -> str` | `Sha256SidecarError` if parent dir missing or filesystem rejects rename; underlying `OSError` is wrapped | sync, I/O | +| `write_atomic_and_sidecar` | `(path, payload) -> str` | same as `write_atomic` plus failure to write the sidecar atomically | sync, I/O | +| `verify` | `(path) -> bool` | `Sha256SidecarError` if `path` exists but `path.sha256` is missing or malformed (returns `False` if `path` itself is missing) | sync, I/O | +| `aggregate_hash` | `(list[Path]) -> str` | `Sha256SidecarError` if any path is missing | sync, I/O | + +`Path` is `pathlib.Path`. Hex digests are lowercase 64-char strings. + +## Invariants + +- **Atomic write**: `write_atomic` writes to a temp file in the same directory as `path` and renames to `path` once the bytes are flushed. The rename is filesystem-level — partial files NEVER appear at `path`. +- **Sidecar format**: `write_atomic_and_sidecar` writes `.sha256` containing ONLY the lowercase hex digest, no JSON wrapper, no trailing newline. Keeps verification trivial (`open(...).read().strip() == expected`). +- **Verify is independent**: `verify(path)` recomputes the digest from the file's bytes and compares to the sidecar; it does NOT trust the sidecar's value alone. +- **Aggregate hash is order-deterministic**: `aggregate_hash` sorts the input paths first (case-sensitive, full path) so two runs that read the same files always yield the same aggregate. The aggregate is the SHA-256 of the concatenation of `\0\n` lines (in sorted order). +- **No upward imports** (Layer 1): the module imports ONLY from `_types`, `atomicwrites`, `hashlib`, `pathlib`, and stdlib. No `gps_denied_onboard.components.*` imports. +- **Production filesystem requirement**: the atomic rename is filesystem-level — works on POSIX local filesystems, not on NFS / SMB / overlayfs. The cache root MUST live on a local filesystem in production. Documented in the contract's Caveats section; not enforced at runtime (it would require an OS-specific check that adds no value when the deployment is locked). + +## Non-Goals + +- Cryptographic signing — the sidecar protects against accidental corruption + file-replacement-after-staging, NOT against an attacker with write access. Threat model treats the operator workstation as trusted; the companion's write access is restricted to F4 (mid-flight tile gen) which has its own per-flight signing key path (out of scope for this helper). +- Streaming hashing of files larger than RAM — the helper's API takes `payload: bytes`, so the entire payload is in memory at write time. Files larger than RAM are out of scope (and outside the operational constraints of the cache root anyway). +- Compression / on-disk encoding — payload is written verbatim. +- Sidecar format versioning — there is no version byte; if the format ever changes, the verifier rejects the old format and forces a re-write. + +## Versioning Rules + +- **Breaking changes** (sidecar format changed, function renamed/removed, return type changed, atomicity invariant relaxed) require a new major version + a deprecation pass through C6, C7, C10, C11. +- **Non-breaking additions** (new helper function, new optional kwarg with safe default) require a minor version bump. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| valid-write-and-verify | random 1 MiB payload, write to tmp path, then `verify` | `verify` returns True; sidecar contains the hex digest of the payload | Round-trip happy path | +| valid-aggregate-deterministic | 3 files written with the helper, then `aggregate_hash` called twice with paths in different order | both calls return the same hex digest | Order-deterministic invariant | +| valid-atomic-no-partial | inject a fault between temp write and rename (e.g., raise `OSError` mid-write); call `verify` afterward | `path` does NOT exist (or pre-existing version unchanged); no partial file at the target name | Atomicity invariant | +| invalid-sidecar-mismatch | manually overwrite `path` with different bytes after the sidecar was written | `verify(path)` returns False | Independent verification | +| invalid-missing-sidecar | `verify` on a path whose `.sha256` was deleted | `Sha256SidecarError` raised mentioning the missing sidecar | Strict sidecar requirement | +| invalid-malformed-sidecar | sidecar contains `not a hex digest` | `Sha256SidecarError` raised mentioning malformed digest | Sidecar format strictness | +| invalid-missing-file-in-aggregate | `aggregate_hash` on a list including a non-existent path | `Sha256SidecarError` raised mentioning the missing path | Aggregate input validation | +| no-upward-imports | static import scan | only `_types`, `atomicwrites`, `hashlib`, `pathlib`, stdlib | Layer 1 invariant | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from `_docs/02_document/common-helpers/05_helper_sha256_sidecar.md` | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/shared_helpers/wgs_converter.md b/_docs/02_document/contracts/shared_helpers/wgs_converter.md new file mode 100644 index 0000000..3c7a529 --- /dev/null +++ b/_docs/02_document/contracts/shared_helpers/wgs_converter.md @@ -0,0 +1,88 @@ +# Contract: wgs_converter + +**Component**: shared_helpers / `helpers.wgs_converter` (cross-cutting concern owned by E-CC-HELPERS / AZ-264) +**Producer task**: AZ-279 — `_docs/02_tasks/todo/AZ-279_wgs_converter.md` +**Consumer tasks**: every C4 pose-estimation task that compares pose-in-WGS to pose-in-ENU; every C5 state-estimator task that initialises the iSAM2 graph from a WGS origin; every C6 task that maps a tile bbox to lat/lon; every C8 task that encodes pose for FC emission; every C10 / C11 task that resolves a bbox to a tile-id list; every C12 task where the operator enters a bbox +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Centralise WGS84 ↔ local-tangent-plane (ENU) ↔ tile-pixel coordinate conversions. Required by every component that interacts with geographic positions. Per `_docs/02_document/common-helpers/04_helper_wgs_converter.md`. Backed by `pyproj` for the geodesy primitives; tile_xy math uses the standard slippy-map convention so it matches `satellite-provider`'s on-disk layout. + +## Shape + +### For function / method APIs + +```python +class WgsConverter: + @staticmethod + def latlonalt_to_ecef(p: LatLonAlt) -> np.ndarray: ... # shape (3,) + @staticmethod + def ecef_to_latlonalt(p_ecef: np.ndarray) -> LatLonAlt: ... + @staticmethod + def latlonalt_to_local_enu(origin: LatLonAlt, p: LatLonAlt) -> np.ndarray: ... # shape (3,) + @staticmethod + def local_enu_to_latlonalt(origin: LatLonAlt, p_enu: np.ndarray) -> LatLonAlt: ... + @staticmethod + def latlon_to_tile_xy(zoom: int, lat: float, lon: float) -> tuple[int, int]: ... + @staticmethod + def tile_xy_to_latlon_bounds(zoom: int, x: int, y: int) -> BoundingBox: ... +``` + +| Name | Signature | Throws / Errors | Blocking? | +|------|-----------|-----------------|-----------| +| `latlonalt_to_ecef` | `(LatLonAlt) -> np.ndarray (3,)` | `WgsConversionError` if lat / lon / alt are out of range | sync, pure | +| `ecef_to_latlonalt` | `(np.ndarray (3,)) -> LatLonAlt` | `WgsConversionError` on shape mismatch | sync, pure | +| `latlonalt_to_local_enu` | `(origin, p) -> np.ndarray (3,)` | `WgsConversionError` on origin / point validation | sync, pure | +| `local_enu_to_latlonalt` | `(origin, p_enu) -> LatLonAlt` | `WgsConversionError` on origin / shape | sync, pure | +| `latlon_to_tile_xy` | `(zoom, lat, lon) -> (int, int)` | `WgsConversionError` if zoom < 0 or > 22, lat out of `[-85.0511, 85.0511]`, lon out of `[-180, 180]` | sync, pure | +| `tile_xy_to_latlon_bounds` | `(zoom, x, y) -> BoundingBox` | `WgsConversionError` if `x` or `y` out of `[0, 2^zoom)` | sync, pure | + +`LatLonAlt` and `BoundingBox` are imported from `gps_denied_onboard._types`. Numpy arrays use `dtype=float64`. `WgsConversionError` is the only exception type the public surface raises. + +## Invariants + +- **Stateless**: no module-level state; static methods only. The static-only design satisfies the coderule.mdc constraint ("only use static methods for pure self-contained computations") because every operation is a pure mathematical function of its arguments. +- **WGS84 ellipsoid only**: all conversions use the WGS84 ellipsoid; no datum-shift logic. If a future deployment needs alternative datum support, switch to an instance-based factory then. +- **Slippy-map tile convention**: `latlon_to_tile_xy` matches OSM / `satellite-provider`'s on-disk `{zoom}/{x}/{y}.jpg` layout. Latitude is clamped to the Web-Mercator-valid range `[-85.0511, 85.0511]`; values outside raise `WgsConversionError`. +- **ENU sign convention**: `latlonalt_to_local_enu` returns `(east, north, up)` in metres. Origin altitude IS used (height above ellipsoid); zero altitude is NOT silently substituted. +- **Round-trip identity**: `local_enu_to_latlonalt(origin, latlonalt_to_local_enu(origin, p)) ≈ p` within `atol=1e-6` metres (lat/lon to ~1 m, alt to ~1 cm) for `p` within 100 km of `origin`. Beyond 100 km the tangent-plane approximation degrades — the contract documents this limit. +- **Zoom-level dependence**: `tile_xy_to_latlon_bounds` and `latlon_to_tile_xy` are sensitive to `zoom`; callers MUST pass the right zoom for the tile in question (typically `zoomLevel` from `TileMetadata`). +- **No upward imports** (Layer 1): the module imports ONLY from `_types`, `pyproj`, numpy, and stdlib. NO `gps_denied_onboard.components.*` imports. + +## Non-Goals + +- Datum-shift logic / non-WGS84 datums — out of scope for v1.0.0. +- UTM / MGRS conversions — out of scope. +- Geoid-height corrections (orthometric vs. ellipsoidal altitude) — out of scope; consumers using altitude do so under the ellipsoid convention or apply geoid correction themselves. +- Vincenty / great-circle distance helpers — out of scope. +- Coordinate transforms involving rotation (body-frame ↔ ECEF) — owned by `helpers.se3_utils` plus the per-deployment `CameraCalibration`. + +## Versioning Rules + +- **Breaking changes** (function renamed/removed, signature changed, ENU sign convention flipped, return shape changed) require a new major version + a deprecation pass through C4, C5, C6, C8, C10, C11, C12. +- **Non-breaking additions** (new helper function, new optional kwarg with safe default) require a minor version bump. +- Adding a new datum is a major version (the static-only design assumes WGS84). + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| valid-roundtrip-ecef | `LatLonAlt(50.0, 30.0, 100.0)` | `ecef_to_latlonalt(latlonalt_to_ecef(p))` matches `p` within `atol=1e-9 deg, 1e-6 m` | Round-trip happy path | +| valid-roundtrip-enu | origin + point ~10 km away | `local_enu_to_latlonalt(origin, latlonalt_to_local_enu(origin, p))` matches `p` within 1 m horizontal + 1 cm vertical | ENU round-trip | +| valid-tile-roundtrip-z18 | `(zoom=18, lat=50.45, lon=30.52)` | `latlon_to_tile_xy` returns valid `(x, y)`; `tile_xy_to_latlon_bounds(zoom, x, y)` contains the input lat/lon | Slippy-map convention | +| valid-tile-bounds-z18 | `(zoom=18, x=148000, y=89400)` | bounds returned with non-zero area; corners at expected slippy-map lat/lon | Tile bounds | +| invalid-lat-out-of-range | lat = 95.0 in `latlon_to_tile_xy` | `WgsConversionError` mentions Web-Mercator latitude range | Slippy-map invariant | +| invalid-zoom-too-high | zoom = 25 | `WgsConversionError` mentions zoom range `[0, 22]` | Zoom guard | +| invalid-tile-xy-out-of-range | `(zoom=18, x=2^18, y=0)` | `WgsConversionError` mentions tile-xy range | Tile-xy guard | +| invalid-shape | `ecef_to_latlonalt(np.array([1.0, 2.0]))` (shape (2,)) | `WgsConversionError` mentions expected shape (3,) | Shape contract | +| no-upward-imports | static import scan | only `_types`, `pyproj`, numpy, stdlib | Layer 1 invariant | +| determinism | same input through any function twice | byte-equal outputs | Pure-function determinism | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from `_docs/02_document/common-helpers/04_helper_wgs_converter.md` | autodev decompose Step 2 | diff --git a/_docs/02_document/contracts/shared_logging/log_record_schema.md b/_docs/02_document/contracts/shared_logging/log_record_schema.md new file mode 100644 index 0000000..4e1f19a --- /dev/null +++ b/_docs/02_document/contracts/shared_logging/log_record_schema.md @@ -0,0 +1,84 @@ +# Contract: log_record_schema + +**Component**: shared_logging (cross-cutting concern owned by E-CC-LOG / AZ-245) +**Producer task**: AZ-266 — `_docs/02_tasks/todo/AZ-266_log_module.md` +**Consumer tasks**: every component task that emits logs (C1–C13 components, plus C12 operator tooling) +**Version**: 1.0.0 +**Status**: draft +**Last Updated**: 2026-05-10 + +## Purpose + +Frozen, machine-parseable JSON envelope for every log record emitted by any onboard component. Stable field set + ordering is a hard requirement for FDR analysis tooling (`kind="log"` records are post-flight queryable) and for the contract test that verifies field-name + ordering invariants. + +## Shape + +### One JSON object per log line, UTF-8, no trailing comma, newline-terminated + +```python +# Conceptual dataclass — actual implementation may emit via orjson / python-json-logger +@frozen +class LogRecord: + ts: str # ISO 8601 UTC, microsecond precision, e.g. "2026-05-10T03:14:15.123456Z" + level: str # one of {"DEBUG", "INFO", "WARN", "ERROR"} — matches Python stdlib levelname (no "WARNING") + component: str # component slug from module-layout.md, e.g. "c2_vpr", "c5_state", "shared.logging" + frame_id: int | None # monotonic per-flight frame counter; None for non-frame-correlated records (startup, shutdown, periodic) + kind: str # categorical tag, e.g. "vio.tick", "vpr.query", "fdr.write", "log.diag" + msg: str # human-readable short message, no PII, no stack traces (those go in `exc`) + kv: dict[str, Any] # arbitrary structured key-value payload, JSON-safe scalars + nested dict/list only + exc: str | None # optional formatted exception traceback for ERROR/WARN; None otherwise +``` + +| Field | Type | Required | Description | Constraints | +|-------|------|----------|-------------|-------------| +| `ts` | string (ISO 8601 UTC, µs) | yes | Emit timestamp | RFC 3339 with `Z` suffix | +| `level` | string | yes | Log level | strictly one of `DEBUG`, `INFO`, `WARN`, `ERROR` | +| `component` | string | yes | Origin component slug | snake_case, must match a module-layout entry or `shared.` | +| `frame_id` | integer or null | no | Per-flight monotonic frame index | non-negative when present | +| `kind` | string | yes | Record category tag | dotted snake_case, max 64 chars | +| `msg` | string | yes | Human message | no embedded newlines (use `kv` for multi-line context) | +| `kv` | object | yes (may be `{}`) | Structured key-value payload | JSON-safe scalars + nested objects/arrays | +| `exc` | string or null | no | Exception traceback for ERROR/WARN | absent or `null` for INFO/DEBUG | + +### Field ordering (REQUIRED — verified by contract test) + +`ts, level, component, frame_id, kind, msg, kv, exc` — formatter MUST emit keys in this order. Re-ordering breaks downstream column-aligned parsers used by FDR tooling. + +## Invariants + +- Every record is a single JSON object on a single line (newline-terminated, no embedded newlines in any field value). +- `level` value uses `WARN` not `WARNING` (intentional, simpler grep target). +- `frame_id` is omitted (`null`) — never invented — when the emitter has no current frame context. +- `kv` values must be JSON-serialisable without custom encoders; binary payloads are base64-encoded strings within `kv`. +- `exc` is present only for `level in {WARN, ERROR}` records that originated from an exception; otherwise it is `null` or absent. +- The schema is strictly additive — no field is ever removed or renamed without a major version bump and a matching FDR record-schema migration in E-CC-FDR-CLIENT. + +## Non-Goals + +- This contract does not define WHAT to log (per-component § 9 sections own that). +- This contract does not define log routing (stdout vs journald vs FDR — see handler topology in E-CC-LOG epic). +- This contract does not define structured event types — `kind` is a free-form tag, not a closed enum. + +## Versioning Rules + +- **Breaking changes** (field renamed/removed, type changed, ordering changed, level enum reduced) require a new major version + a deprecation pass through every consumer. +- **Non-breaking additions** (new optional field appended at the end of the order, new `kind` tag, new `level` value) require a minor version bump. +- The contract test (`tests/contract/log_schema.py`) MUST be updated alongside any version bump. + +## Test Cases + +| Case | Input | Expected | Notes | +|------|-------|----------|-------| +| valid-info-no-frame | `level=INFO, component="c2_vpr", kind="vpr.warmup", msg="loaded model", kv={"model": "salad"}` | accepted; `frame_id=null`, `exc=null`; field order matches spec | Startup-time INFO record | +| valid-warn-with-frame | `level=WARN, component="c5_state", frame_id=4321, kind="state.cov_spike", msg="covariance jumped 5x", kv={"jump_factor": 5.2}` | accepted; key order locked; FDR bridge MUST forward this record | Cross-cuts AC: WARN flows into FDR | +| valid-error-with-exc | `level=ERROR, component="c11_tilemanager", kind="tile.upload_fail", msg="HTTP 503", kv={"tile": "z18/x12345/y67890"}, exc="Traceback (most recent call last):..."` | accepted; `exc` present and non-null; FDR bridge MUST forward | Cross-cuts AC: ERROR + exc captured | +| invalid-bad-level | `level="WARNING"` | rejected with `LogSchemaError` (or formatter logs at ERROR and drops record) | Contract test enforces `WARN` not `WARNING` | +| invalid-multiline-msg | `msg="line1\nline2"` | rejected OR newline replaced with `\\n` literal — formatter must guarantee single-line output | One JSON object per line invariant | +| invalid-non-serialisable-kv | `kv={"obj": }` | rejected with `LogSchemaError` (caller must convert to list before passing) | JSON-safe-only invariant | +| ordering-stable | any valid record | emitted JSON keys appear in `ts, level, component, frame_id, kind, msg, kv, exc` order regardless of construction order | Contract test parses raw bytes and asserts key order | + +## Change Log + +| Version | Date | Change | Author | +|---------|------|--------|--------| +| 1.0.0 | 2026-05-10 | Initial contract derived from E-CC-LOG epic (AZ-245) | autodev decompose Step 2 | diff --git a/_docs/02_tasks/_dependencies_table.md b/_docs/02_tasks/_dependencies_table.md new file mode 100644 index 0000000..0b4a591 --- /dev/null +++ b/_docs/02_tasks/_dependencies_table.md @@ -0,0 +1,267 @@ +# Dependencies Table + +**Date**: 2026-05-10 (refreshed after E-BBT decomposition) +**Total Tasks**: 140 (99 product + 41 blackbox-test) +**Total Complexity Points**: 472 (339 product + 133 blackbox-test) + +Dependencies columns list only the tracker-ID portion (descriptive tail +text in each task spec is omitted here for table-readability). The +authoritative dependency narrative — including "co-developed", "forward +dependency", and helper-vs-Protocol distinctions — lives in each task's +own `Dependencies:` field. The graph is a strict DAG: a topological +traversal visits all 140 tasks. The 13 forward edges (dep ID > task ID) +are all declared and documented below under **Cycle Check**. + +| Task | Name | Complexity | Dependencies | Epic | +|--------|----------------------------------------------------------------------------------------------|------------|-------------------------------------------------------------------------------------------------------|--------| +| AZ-263 | Initial Structure | 5 | None | AZ-244 | +| AZ-266 | Shared Logging Module | 3 | AZ-263 | AZ-245 | +| AZ-267 | FDR Log Bridge | 2 | AZ-266, AZ-272 (forward) | AZ-245 | +| AZ-268 | Log Schema Contract Test | 2 | AZ-266, AZ-267 | AZ-245 | +| AZ-269 | Config Loader | 3 | AZ-263 | AZ-246 | +| AZ-270 | Composition Root | 3 | AZ-269 | AZ-246 | +| AZ-271 | Config Precedence Tests | 2 | AZ-269, AZ-270 | AZ-246 | +| AZ-272 | FdrRecord Schema | 3 | AZ-263, AZ-266 | AZ-247 | +| AZ-273 | FdrClient Ring Buffer | 5 | AZ-263, AZ-272, AZ-269, AZ-266 | AZ-247 | +| AZ-274 | FDR Overrun Policy | 2 | AZ-272, AZ-273 | AZ-247 | +| AZ-275 | FakeFdrSink | 2 | AZ-272, AZ-273 | AZ-247 | +| AZ-276 | ImuPreintegrator Helper | 2 | AZ-263 | AZ-264 | +| AZ-277 | SE3Utils Helper | 2 | AZ-263 | AZ-264 | +| AZ-278 | LightGlueRuntime Helper | 3 | AZ-263 | AZ-264 | +| AZ-279 | WgsConverter Helper | 2 | AZ-263 | AZ-264 | +| AZ-280 | Sha256Sidecar Helper | 2 | AZ-263 | AZ-264 | +| AZ-281 | EngineFilenameSchema Helper | 2 | AZ-263 | AZ-264 | +| AZ-282 | RansacFilter Helper | 2 | AZ-263, AZ-277 | AZ-264 | +| AZ-283 | DescriptorNormaliser Helper | 2 | AZ-263 | AZ-264 | +| AZ-291 | C13 Writer Thread | 5 | AZ-263, AZ-272, AZ-273, AZ-266, AZ-269 | AZ-248 | +| AZ-292 | C13 Flight Header/Footer + Accounting | 3 | AZ-291, AZ-272, AZ-263, AZ-269, AZ-266 | AZ-248 | +| AZ-293 | C13 Capacity Cap Policy | 5 | AZ-291, AZ-292, AZ-272, AZ-263, AZ-269, AZ-266 | AZ-248 | +| AZ-294 | C13 Mid-Flight Tile Snapshot Path | 3 | AZ-291, AZ-272, AZ-263, AZ-269 | AZ-248 | +| AZ-295 | C13 AC-8.5 Forbidden-Kind + Thumbnail Rate Cap | 3 | AZ-291, AZ-272, AZ-263, AZ-269, AZ-266 | AZ-248 | +| AZ-296 | C13 Takeoff Abort on FdrOpenError | 2 | AZ-291, AZ-292, AZ-263, AZ-266 | AZ-248 | +| AZ-297 | C7 InferenceRuntime Protocol | 3 | AZ-263, AZ-269, AZ-266, AZ-280, AZ-281 | AZ-249 | +| AZ-298 | C7 TensorrtRuntime | 5 | AZ-297, AZ-301, AZ-280, AZ-281, AZ-263, AZ-269, AZ-266 | AZ-249 | +| AZ-299 | C7 OnnxTrtEpRuntime | 3 | AZ-297, AZ-301, AZ-280, AZ-281, AZ-263, AZ-269, AZ-266 | AZ-249 | +| AZ-300 | C7 PytorchFp16Runtime | 2 | AZ-297, AZ-263, AZ-269, AZ-266 | AZ-249 | +| AZ-301 | C7 EngineGate | 3 | AZ-297, AZ-280, AZ-281, AZ-266 | AZ-249 | +| AZ-302 | C7 ThermalState Publisher | 3 | AZ-297, AZ-263, AZ-269, AZ-266, AZ-273 | AZ-249 | +| AZ-303 | C6 Storage Interfaces | 3 | AZ-263, AZ-269, AZ-266, AZ-280 | AZ-250 | +| AZ-304 | C6 Postgres Schema | 2 | AZ-303, AZ-263, AZ-269, AZ-266 | AZ-250 | +| AZ-305 | C6 PostgresFilesystemStore | 5 | AZ-303, AZ-304, AZ-280, AZ-279, AZ-263, AZ-269, AZ-266, AZ-273 | AZ-250 | +| AZ-306 | C6 FaissDescriptorIndex | 5 | AZ-303, AZ-280, AZ-263, AZ-269, AZ-266 | AZ-250 | +| AZ-307 | C6 Freshness Gate | 2 | AZ-303, AZ-304, AZ-305, AZ-263, AZ-269, AZ-266, AZ-273 | AZ-250 | +| AZ-308 | C6 Cache Budget Eviction | 3 | AZ-303, AZ-305, AZ-263, AZ-269, AZ-266, AZ-273 | AZ-250 | +| AZ-316 | C11 TileDownloader | 5 | AZ-263, AZ-269, AZ-266, AZ-303, AZ-305, AZ-307, AZ-308 | AZ-251 | +| AZ-317 | C11 Flight-State Gate | 2 | AZ-263, AZ-269, AZ-266 | AZ-251 | +| AZ-318 | C11 Per-Flight Signing Key | 3 | AZ-263, AZ-269, AZ-266, AZ-273 | AZ-251 | +| AZ-319 | C11 TileUploader | 5 | AZ-263, AZ-269, AZ-266, AZ-273, AZ-303, AZ-305, AZ-317, AZ-318 | AZ-251 | +| AZ-320 | C11 Idempotent Retry Decorator | 3 | AZ-263, AZ-269, AZ-266, AZ-273, AZ-303, AZ-319 | AZ-251 | +| AZ-321 | C10 Engine Compiler | 5 | AZ-263, AZ-269, AZ-266, AZ-280, AZ-281, AZ-298 | AZ-252 | +| AZ-322 | C10 Descriptor Batcher | 3 | AZ-263, AZ-269, AZ-266, AZ-303, AZ-306, AZ-321 | AZ-252 | +| AZ-323 | C10 Manifest Builder | 3 | AZ-263, AZ-269, AZ-266, AZ-280, AZ-281, AZ-303 | AZ-252 | +| AZ-324 | C10 ManifestVerifier | 3 | AZ-263, AZ-269, AZ-266, AZ-280, AZ-281 | AZ-252 | +| AZ-325 | C10 CacheProvisioner | 3 | AZ-263, AZ-269, AZ-266, AZ-303, AZ-321, AZ-322, AZ-323 | AZ-252 | +| AZ-326 | C12 CLI App | 3 | AZ-263, AZ-269, AZ-266 | AZ-253 | +| AZ-327 | C12 Companion Bringup | 3 | AZ-263, AZ-269, AZ-266 | AZ-253 | +| AZ-328 | C12 Build-Cache Orchestrator | 5 | AZ-326, AZ-327, AZ-316, AZ-325, AZ-263, AZ-269, AZ-266 | AZ-253 | +| AZ-329 | C12 Post-Landing Upload | 3 | AZ-326, AZ-319, AZ-272, AZ-263, AZ-269, AZ-266 | AZ-253 | +| AZ-330 | C12 OperatorReLocService | 3 | AZ-326, AZ-273, AZ-263, AZ-269, AZ-266 | AZ-253 | +| AZ-331 | C1 VioStrategy Protocol | 3 | AZ-263, AZ-269, AZ-266, AZ-270, AZ-272, AZ-276, AZ-277 | AZ-254 | +| AZ-332 | C1 OKVIS2 Strategy | 5 | AZ-331, AZ-263, AZ-269, AZ-266, AZ-276, AZ-277, AZ-272, AZ-273 | AZ-254 | +| AZ-333 | C1 VINS-Mono Strategy | 5 | AZ-331, AZ-263, AZ-269, AZ-266, AZ-276, AZ-277, AZ-272, AZ-273 | AZ-254 | +| AZ-334 | C1 KLT/RANSAC Strategy | 5 | AZ-331, AZ-263, AZ-269, AZ-266, AZ-276, AZ-277, AZ-282, AZ-272, AZ-273 | AZ-254 | +| AZ-335 | C1 Warm-Start + F8 Reboot Recovery | 3 | AZ-331, AZ-332, AZ-333, AZ-334, AZ-263, AZ-269, AZ-266, AZ-270, AZ-280, AZ-272 | AZ-254 | +| AZ-336 | C2 VprStrategy Protocol + Factory + Composition | 3 | AZ-263, AZ-269, AZ-270, AZ-303, AZ-297, AZ-266 | AZ-255 | +| AZ-337 | C2 UltraVPR Primary Backbone (TRT) | 5 | AZ-336, AZ-263, AZ-269, AZ-298, AZ-303, AZ-283, AZ-281, AZ-321, AZ-266, AZ-272 | AZ-255 | +| AZ-338 | C2 NetVLAD Mandatory Simple-Baseline | 3 | AZ-336, AZ-263, AZ-269, AZ-300, AZ-303, AZ-283, AZ-266, AZ-272 | AZ-255 | +| AZ-339 | C2 MegaLoc + MixVPR Secondary Backbones (Research-only) | 5 | AZ-336, AZ-263, AZ-269, AZ-298, AZ-303, AZ-283, AZ-281, AZ-321, AZ-266, AZ-272 | AZ-255 | +| AZ-340 | C2 SelaVPR + EigenPlaces + SALAD Secondary Backbones (Research-only) | 5 | AZ-336, AZ-263, AZ-269, AZ-298, AZ-303, AZ-283, AZ-281, AZ-321, AZ-266, AZ-272 | AZ-255 | +| AZ-341 | C2 FAISS HNSW Retrieve Wiring | 3 | AZ-336, AZ-263, AZ-269, AZ-303, AZ-305, AZ-306, AZ-266, AZ-272 | AZ-255 | +| AZ-342 | C2.5 ReRankStrategy Protocol + Factory + Composition | 2 | AZ-263, AZ-269, AZ-270, AZ-278, AZ-303, AZ-266 | AZ-256 | +| AZ-343 | C2.5 InlierCountReRanker (drop-and-continue) | 3 | AZ-342, AZ-263, AZ-269, AZ-278, AZ-303, AZ-266, AZ-272 | AZ-256 | +| AZ-344 | C3 CrossDomainMatcher Protocol + Factory + Composition | 3 | AZ-263, AZ-269, AZ-270, AZ-278, AZ-282, AZ-297, AZ-266 | AZ-257 | +| AZ-345 | C3 DISK+LightGlue Primary Matcher | 5 | AZ-344, AZ-263, AZ-269, AZ-278, AZ-282, AZ-298, AZ-299, AZ-303, AZ-281, AZ-321, AZ-266, AZ-272 | AZ-257 | +| AZ-346 | C3 ALIKED+LightGlue Secondary Matcher | 3 | AZ-344, AZ-263, AZ-269, AZ-278, AZ-282, AZ-298, AZ-299, AZ-303, AZ-281, AZ-321, AZ-266, AZ-272 | AZ-257 | +| AZ-347 | C3 XFeat Alternate Lightweight Matcher | 3 | AZ-344, AZ-263, AZ-269, AZ-282, AZ-298, AZ-299, AZ-303, AZ-281, AZ-321, AZ-266, AZ-272 | AZ-257 | +| AZ-348 | C3.5 ConditionalRefiner Protocol + Factory + PassthroughRefiner + Composition | 3 | AZ-263, AZ-269, AZ-270, AZ-282, AZ-297, AZ-344, AZ-266 | AZ-258 | +| AZ-349 | C3.5 AdHoPRefiner — production-default conditional refiner | 5 | AZ-348, AZ-263, AZ-269, AZ-282, AZ-298, AZ-299, AZ-281, AZ-321, AZ-266, AZ-272 | AZ-258 | +| AZ-355 | C4 PoseEstimator Protocol + Factory + DTOs + Composition | 3 | AZ-263, AZ-269, AZ-270, AZ-282, AZ-279, AZ-277, AZ-266 | AZ-259 | +| AZ-358 | C4 OpenCVGtsamPoseEstimator (steady-state path) | 5 | AZ-355, AZ-381, AZ-282, AZ-279, AZ-277, AZ-269, AZ-266, AZ-272, AZ-263 | AZ-259 | +| AZ-361 | C4 D-CROSS-LATENCY-1 hybrid — Jacobian + thermal-driven mode switch | 3 | AZ-358, AZ-355, AZ-302, AZ-277, AZ-279, AZ-269, AZ-266, AZ-272, AZ-263 | AZ-259 | +| AZ-381 | C5 StateEstimator Protocol + Factory + DTOs + Composition + concrete ISam2GraphHandle | 3 | AZ-263, AZ-269, AZ-270, AZ-276, AZ-277, AZ-279, AZ-273, AZ-355, AZ-266 | AZ-260 | +| AZ-382 | C5 GtsamIsam2StateEstimator skeleton — iSAM2 + IncrementalFixedLagSmoother wiring | 5 | AZ-381, AZ-263, AZ-269, AZ-266, AZ-272 | AZ-260 | +| AZ-383 | C5 GtsamIsam2StateEstimator — add_vio / add_pose_anchor / add_fc_imu factor add bodies | 5 | AZ-382, AZ-381, AZ-276, AZ-358, AZ-263, AZ-269, AZ-266, AZ-272 | AZ-260 | +| AZ-384 | C5 GtsamIsam2StateEstimator — Marginals + output methods | 3 | AZ-383, AZ-382, AZ-381, AZ-279, AZ-277, AZ-263, AZ-269, AZ-266, AZ-272 | AZ-260 | +| AZ-385 | C5 SourceLabelStateMachine + spoof-promotion gate | 5 | AZ-384, AZ-381, AZ-382, AZ-383, AZ-263, AZ-269, AZ-266, AZ-272, AZ-391, AZ-397 | AZ-260 | +| AZ-386 | C5 EskfStateEstimator — mandatory simple-baseline | 5 | AZ-381, AZ-276, AZ-277, AZ-279, AZ-263, AZ-269, AZ-266, AZ-272 | AZ-260 | +| AZ-387 | C5 smoothed past-keyframe → FDR path (AC-4.5 revised) | 3 | AZ-384, AZ-386, AZ-273, AZ-272, AZ-263, AZ-269, AZ-266 | AZ-260 | +| AZ-388 | C5 AC-5.2 fallback path — 3 s no-estimate detector + downstream signal | 3 | AZ-384, AZ-386, AZ-273, AZ-272, AZ-390, AZ-397, AZ-263, AZ-269, AZ-266 | AZ-260 | +| AZ-389 | C5 internal orthorectifier — produces mid-flight tile candidates for C6 | 3 | AZ-384, AZ-385, AZ-303, AZ-263, AZ-269, AZ-266, AZ-272 | AZ-260 | +| AZ-390 | C8 FcAdapter + GcsAdapter Protocols + DTOs + errors + composition factories | 3 | AZ-263, AZ-269, AZ-270, AZ-273, AZ-277, AZ-279, AZ-266 | AZ-261 | +| AZ-391 | C8 inbound subscription — IMU/attitude/GPS-health/MAV_STATE producer | 5 | AZ-390, AZ-263, AZ-269, AZ-266, AZ-272, AZ-273, AZ-276 | AZ-261 | +| AZ-392 | C8 CovarianceProjector — honest 6×6 → 2×2 → equivalent_radius helper | 3 | AZ-390, AZ-263, AZ-269, AZ-266, AZ-272 | AZ-261 | +| AZ-393 | C8 PymavlinkArdupilotAdapter outbound — GPS_INPUT 5 Hz + provenance side-channel | 5 | AZ-390, AZ-392, AZ-279, AZ-273, AZ-263, AZ-269, AZ-266, AZ-272 | AZ-261 | +| AZ-394 | C8 Msp2InavAdapter outbound — MSP2_SENSOR_GPS 5 Hz | 3 | AZ-390, AZ-392, AZ-279, AZ-273, AZ-263, AZ-269, AZ-266, AZ-272 | AZ-261 | +| AZ-395 | C8 AP MAVLink 2.0 per-flight signing — handshake + key rotation + zeroisation | 5 | AZ-393, AZ-390, AZ-273, AZ-272, AZ-263, AZ-269, AZ-266 | AZ-261 | +| AZ-396 | C8 AP D-C8-2 source-set switch — MAV_CMD_SET_EKF_SOURCE_SET + spoof-recovery wiring | 3 | AZ-393, AZ-390, AZ-385, AZ-273, AZ-272, AZ-263, AZ-269, AZ-266 | AZ-261 | +| AZ-397 | C8 QgcTelemetryAdapter — downsampled 1–2 Hz summary out + operator command in | 3 | AZ-390, AZ-392, AZ-279, AZ-273, AZ-263, AZ-269, AZ-266 | AZ-261 | +| AZ-398 | FrameSource Protocol + Clock Protocol + LiveCameraFrameSource retrofit + VideoFileFrameSource| 3 | AZ-263, AZ-269, AZ-270, AZ-266, AZ-272 | AZ-265 | +| AZ-399 | TlogReplayFcAdapter — replay-only FcAdapter parsing pymavlink .tlog | 5 | AZ-398, AZ-390, AZ-391, AZ-279, AZ-273, AZ-263, AZ-269, AZ-266, AZ-272 | AZ-265 | +| AZ-400 | ReplaySink Protocol + JsonlReplaySink impl | 3 | AZ-263, AZ-269, AZ-270, AZ-381, AZ-266, AZ-272 | AZ-265 | +| AZ-401 | compose_replay(config) -> ReplayRoot + Clock injection across C1–C5 | 3 | AZ-398, AZ-399, AZ-400, AZ-269, AZ-270, AZ-263, AZ-266, AZ-272, AZ-390 | AZ-265 | +| AZ-402 | gps-denied-replay CLI entrypoint + argparse + camera-calibration loader | 3 | AZ-401, AZ-269, AZ-270, AZ-263, AZ-266, AZ-272, AZ-273 | AZ-265 | +| AZ-403 | gps-denied-replay-cli Dockerfile + GitHub Actions matrix entry + SBOM diff | 3 | AZ-402, AZ-398, AZ-399, AZ-400, AZ-401, AZ-263, AZ-269, AZ-266 | AZ-265 | +| AZ-404 | E2E replay fixture test — Derkachi 1–2 min clip + tlog | 5 | AZ-402, AZ-403, AZ-401, AZ-263, AZ-269, AZ-266, AZ-272, AZ-273 | AZ-265 | +| AZ-405 | Auto-sync of video ↔ tlog via IMU take-off detection | 5 | AZ-402, AZ-399, AZ-398, AZ-263, AZ-269, AZ-266, AZ-272 | AZ-265 | +| AZ-406 | Blackbox Test Infrastructure Bootstrap (Tier-1 + Tier-2 harness scaffold) | 5 | AZ-263 | AZ-262 | +| AZ-407 | Static fixture builders — tile-cache, age-injector, cold-boot, MAVLink passkey, CVE JPEG | 3 | AZ-406 | AZ-262 | +| AZ-408 | Runtime synthetic-injection fixture builders — outlier, blackout-spoof, multi-segment | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-409 | FT-P-01 — Still-image set-60 frame-center accuracy | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-410 | FT-P-02 — Cumulative drift between satellite anchors on Derkachi | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-411 | FT-P-03 + FT-P-14 — Estimate output schema + WGS84 coordinate validation | 2 | AZ-406, AZ-407 | AZ-262 | +| AZ-412 | FT-P-04 — Frame-to-frame registration ≥95% on normal Derkachi segments | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-413 | FT-P-05 + FT-P-06 — Cross-domain matcher MRE budgets | 3 | AZ-406, AZ-407, AZ-412 | AZ-262 | +| AZ-414 | FT-P-07 + FT-N-02 — Sharp-turn recovery via satellite reference | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-415 | FT-P-08 — ≥3 disconnected segments via satellite-reference re-localization | 3 | AZ-406, AZ-407, AZ-408 | AZ-262 | +| AZ-416 | FT-P-09-AP — ArduPilot Plane GPS_INPUT contract + MAVLink 2.0 signing handshake | 5 | AZ-406, AZ-407 | AZ-262 | +| AZ-417 | FT-P-09-iNav — iNav MSP2_SENSOR_GPS contract conformance | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-418 | FT-P-10 — GTSAM smoothing-loop look-back accuracy | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-419 | FT-P-11 — Cold-start initialization from FC EKF | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-420 | FT-P-12 + FT-P-13 — GCS downsample + GCS-originated re-loc command | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-421 | FT-P-15 + FT-P-16 + FT-P-18 — Tile cache + offline + no-raw-retention | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-422 | FT-P-17 + FT-N-06 — Mid-flight tile generation + freshness | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-423 | FT-P-19 — Satellite-relocalization scale-ratio + scene-change PARTIAL | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-424 | FT-N-01 — 350 m outlier injection tolerance | 3 | AZ-406, AZ-407, AZ-408 | AZ-262 | +| AZ-425 | FT-N-03 — Extended outage triggers OPERATOR_RELOC_REQUEST | 3 | AZ-406, AZ-407, AZ-408 | AZ-262 | +| AZ-426 | FT-N-04 — Visual blackout + spoofed GPS combined failsafe | 5 | AZ-406, AZ-407, AZ-408 | AZ-262 | +| AZ-427 | FT-N-05 — Stale-tile rejection on freshness violation | 2 | AZ-406, AZ-407 | AZ-262 | +| AZ-428 | NFT-PERF-01 — End-to-end latency p95 ≤ 400 ms on Tier-2 | 5 | AZ-406, AZ-407, AZ-444 (forward) | AZ-262 | +| AZ-429 | NFT-PERF-02 — Frame-by-frame streaming, no batching | 2 | AZ-406, AZ-407 | AZ-262 | +| AZ-430 | NFT-PERF-03 — Cold-start TTFF ≤ 30 s on Tier-2 | 5 | AZ-406, AZ-407, AZ-444 (forward) | AZ-262 | +| AZ-431 | NFT-PERF-04 — Spoofing-promotion latency p95 ≤ 600 ms | 3 | AZ-406, AZ-407, AZ-408 | AZ-262 | +| AZ-432 | NFT-RES-01 — IMU-only fallback drift bound | 3 | AZ-406, AZ-407, AZ-408 | AZ-262 | +| AZ-433 | NFT-RES-02 — Companion mid-flight reboot recovery | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-434 | NFT-RES-03 — 100-iteration Monte Carlo statistical envelope | 5 | AZ-406, AZ-407, AZ-408 | AZ-262 | +| AZ-435 | NFT-RES-04 — 35 s blackout-with-spoof full escalation ladder | 3 | AZ-406, AZ-407, AZ-408, AZ-426 | AZ-262 | +| AZ-436 | NFT-SEC-01 — Cache-poisoning safety probability ≤ 1e-6/flight | 5 | AZ-406, AZ-407 | AZ-262 | +| AZ-437 | NFT-SEC-02 + NFT-SEC-05 — No-egress + DNS-blackhole defense-in-depth | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-438 | NFT-SEC-03 — AP rejects unsigned/wrong-key/replayed messages | 3 | AZ-406, AZ-407 | AZ-262 | +| AZ-439 | NFT-SEC-04 — OpenCV CVE-2025-53644 + AddressSanitizer fuzz | 5 | AZ-406, AZ-407, AZ-444 (forward, optional) | AZ-262 | +| AZ-440 | NFT-LIM-01 — Jetson memory budget | 3 | AZ-406, AZ-407, AZ-444 (forward) | AZ-262 | +| AZ-441 | NFT-LIM-02 — 8h-extrapolated FDR size ≤ 50 GB | 2 | AZ-406, AZ-407 | AZ-262 | +| AZ-442 | NFT-LIM-03 + NFT-LIM-05 — Aggregate storage + thumbnail-log budget | 2 | AZ-406, AZ-407 | AZ-262 | +| AZ-443 | NFT-LIM-04 — Jetson thermal envelope @ workstation ambient (AC-NEW-5 PARTIAL) | 2 | AZ-406, AZ-407, AZ-444 (forward) | AZ-262 | +| AZ-444 | Tier-2 Jetson harness wrapper — run-tier2.sh, ssh provisioning, systemd, ASan-fuzz | 5 | AZ-406 | AZ-262 | +| AZ-445 | CSV reporter + evidence bundler — per-NFR machine-readable outputs + traceability-status.json | 2 | AZ-406 | AZ-262 | +| AZ-446 | CSV reporter refinements — trend-line + acceptance-band annotations + Monte Carlo CI | 2 | AZ-406, AZ-445 | AZ-262 | + +## Notes + +- **Forward dependency on AZ-272** in AZ-267: the FDR Log Bridge task + declares a forward dependency on the FdrRecord Schema (AZ-272). Both + tasks ship together; the bridge ships against the published schema + contract once AZ-272 is stable. This is the ONLY forward dependency + in the plan — verified by inspection. +- **C4 ↔ C5 co-development (ADR-003)**: AZ-358 depends on AZ-381's + concrete `ISam2GraphHandle`; AZ-383 depends on AZ-358's + `PoseEstimate.covariance_mode`. Both tasks ship in lockstep — the + shared `ISam2GraphHandle` Protocol stub is owned by AZ-355, the + concrete impl by AZ-381. +- **C5 ↔ C8 co-development**: AZ-385 depends on C8 `GpsHealth` + (AZ-391) and `QgcTelemetryAdapter` (AZ-397); AZ-388 depends on + AZ-390 / AZ-397; AZ-396 depends on AZ-385. Each side ships against + the AZ-390 Protocol contract until the consumer task lands. +- **AZ-401 (compose_replay)** intentionally depends on the C1–C5 epic + IDs (AZ-254 … AZ-260) at the documentation level — concrete strategy + task IDs flow in through each component's composition factory, not + through this composition root directly. +- **E-BBT (AZ-262) forward dependencies on AZ-444 (Tier-2 harness)**: + AZ-428, AZ-430, AZ-440, AZ-443 declare hard forward deps on AZ-444; + AZ-439 declares an optional forward dep on AZ-444 (Tier-2 ASan-fuzz + variant). These tasks ship in a tight loop with AZ-444 — the harness + must exist before any Tier-2 NFT scenario can run. Tier-1 sub-cases + of NFT-PERF-02, NFT-RES-*, NFT-SEC-*, NFT-LIM-02, NFT-LIM-03 do not + require AZ-444 and remain independently runnable. +- **E-BBT scenario chains within AZ-262**: + - AZ-413 (FT-P-05+06) depends on AZ-412 (FT-P-04) — FT-P-06 is a + piggyback assertion over FT-P-04 + FT-P-05 evidence. + - AZ-435 (NFT-RES-04) depends on AZ-426 (FT-N-04) — both consume + `blackout_spoof.py`; NFT-RES-04 is the focused 35 s escalation + scenario while FT-N-04 covers the 5 s / 15 s / 35 s ladder. + - AZ-446 depends on AZ-445 — refinements layer over the bundler. +- **All E-BBT tasks depend on AZ-406 (test infrastructure)**; this is + by design — AZ-406 is the foundation every blackbox test depends on + (analogous to AZ-263 for the product side). + +## Coverage Verification (Implementation Mode) + +- **Every product interface in `architecture.md` has implementation task coverage.** + - C1 `VioStrategy` → AZ-331 (Protocol) + AZ-332/333/334 (concrete) + - C2 `VprStrategy` → AZ-336 (Protocol) + AZ-337/338/339/340 (concrete) + - C2.5 `ReRankStrategy` → AZ-342 (Protocol) + AZ-343 (concrete) + - C3 `CrossDomainMatcher` → AZ-344 (Protocol) + AZ-345/346/347 (concrete) + - C3.5 `ConditionalRefiner` → AZ-348 (Protocol + Passthrough) + AZ-349 (AdHoP) + - C4 `PoseEstimator` → AZ-355 (Protocol) + AZ-358/361 (concrete) + - C5 `StateEstimator` → AZ-381 (Protocol) + AZ-382..AZ-389 (concrete) + - C6 `TileStore` / `DescriptorIndex` → AZ-303 (Interfaces) + AZ-304/305/306/307/308 + - C7 `InferenceRuntime` → AZ-297 (Protocol) + AZ-298/299/300/301/302 + - C8 `FcAdapter` / `GcsAdapter` → AZ-390 (Protocols) + AZ-391..AZ-397 + - C10 Provisioning → AZ-321/322/323/324/325 + - C11 Tile Manager → AZ-316/317/318/319/320 + - C12 Operator Tooling → AZ-326/327/328/329/330 + - C13 FDR Writer → AZ-291..AZ-296 + +- **Cross-cutting product modules**: + - Logging → AZ-266/267/268 + - Config + Composition Root → AZ-269/270/271 + - FDR Client → AZ-272..AZ-275 + - Shared helpers (IMU preintegrator, SE3, LightGlue runtime, WGS, + SHA-256 sidecar, engine filename schema, RANSAC, descriptor + normaliser) → AZ-276..AZ-283 + - Frame source + Clock → AZ-398 + - Replay sink → AZ-400 + - Replay composition + CLI + auto-sync → AZ-401/402/405 + +- **No unresolved `AZ-?` placeholders** in any task file (verified by grep on Step 4 close-out). + +- **E-BBT (AZ-262 / blackbox tests) coverage** vs `traceability-matrix.md`: + - **All 35 Covered ACs** map to ≥1 scenario task (AZ-409..AZ-443). + - **All 3 PARTIAL ACs** carry the PARTIAL annotation in their pass + criteria: AC-8.6 → AZ-423; AC-NEW-5 → AZ-443. + - **All 3 NOT COVERED ACs** (AC-7.1, AC-7.2, RESTRICT-CAM-2) are + handled by the conftest skip-rule embedded in AZ-406, not by + a dedicated task. + - **Fixture coverage**: 5 static fixtures (AZ-407) + 3 synthetic + injectors (AZ-408) + cold-boot snapshot (AZ-407) cover every + scenario's data needs. + - **Tier-2-only scenarios**: AZ-428, AZ-430, AZ-440, AZ-443 (and + optionally AZ-439's ASan-fuzz mode) all SKIP cleanly on Tier-1 + via the conftest tier-guard. + - **Reporting**: AZ-445 + AZ-446 produce per-NFR JSONs, + `traceability-status.json`, and `regression-baseline.json` for + every scenario. + +## Cycle Check + +A static dependency-graph traversal (Kahn topological sort) visits all +140 nodes — no cycles. The 13 forward edges (dep ID > task ID) are all +declared, bounded, and documented: + +- **AZ-267 → AZ-272** (FDR Log Bridge → FdrRecord Schema; shipped in + lockstep). +- **AZ-298 → AZ-301**, **AZ-299 → AZ-301** (TensorRT / ONNX-RT runtimes + → engine gate; runtime ships against the gate's published contract). +- **AZ-358 → AZ-381** (C4 OpenCV/GTSAM marginals → C5 SAM2 graph + handle; ADR-003 co-development against the AZ-355 Protocol stub). +- **AZ-385 → AZ-391, AZ-397**, **AZ-388 → AZ-390, AZ-397** (C5 ↔ C8 + co-development; each side ships against the AZ-390 Protocol contract + until the consumer task lands). +- **AZ-428, AZ-430, AZ-440, AZ-443 → AZ-444** (Tier-2 NFT scenarios + → Tier-2 harness wrapper; AZ-439 carries the same forward dep + optionally for the ASan-fuzz mode). AZ-444 is therefore scheduled + as the first Tier-2 E-BBT deliverable; the dependent scenarios land + on top of it. + +The graph is therefore a strict DAG once these documented forward +edges are accounted for, and remains sortable by tracker ID modulo +those edges. diff --git a/_docs/02_tasks/todo/AZ-266_log_module.md b/_docs/02_tasks/todo/AZ-266_log_module.md new file mode 100644 index 0000000..66d8a08 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-266_log_module.md @@ -0,0 +1,106 @@ +# Shared Structured Logging Module + +**Task**: AZ-266_log_module +**Name**: Shared Logging Module +**Description**: Provide the `get_logger(component_id)` entrypoint, a stable JSON formatter that emits records matching the log_record_schema contract, and the stdout / journald handlers used by Tier-1 and Tier-2 deployments. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure +**Component**: shared.logging (cross-cutting; epic AZ-245 / E-CC-LOG) +**Tracker**: AZ-266 +**Epic**: AZ-245 (E-CC-LOG) + +## Problem + +Every onboard component must emit structured JSON logs at DEBUG / INFO / WARN / ERROR with a stable, machine-parseable shape so post-flight analysis (FDR tooling, blackbox scenario checks, traceability matrix verification) can correlate events across components. Without one shared logger, format drift is guaranteed within a few weeks of parallel component development. + +## Outcome + +- A single `get_logger(component_id)` call is the only logging entrypoint any onboard module ever uses. +- Every emitted record is a single-line JSON object whose key set, key order, and value types match the `log_record_schema` contract version 1.0.0. +- Tier-1 deployments capture logs via Docker stdout; Tier-2 deployments capture logs via journald — switched by config, not by code. + +## Scope + +### Included + +- `get_logger(component_id: str) -> Logger` factory backed by Python stdlib `logging`. +- A JSON formatter that emits the schema's 8 fields in the contract-mandated order, regardless of construction order. Implementation may use `python-json-logger` or `orjson`-backed formatter — whichever is already pinned in the project's lockfile from AZ-263. +- A stdout handler for Tier-1 (Docker) and a journald handler for Tier-2 (Jetson). Selection is config-driven via the structured-logging entry of the cross-cutting config epic (AZ-246 / E-CC-CONF). +- Per-frame structured-logging helpers for the documented per-component shapes referenced in epic AZ-245 (`vio.tick`, `vpr.query`, etc.) so component code can emit one-liner logs without rebuilding the kv dict. +- Public interface contract published at `_docs/02_document/contracts/shared_logging/log_record_schema.md`. + +### Excluded + +- The FDR bridge that forwards ERROR + WARN records into the Flight Data Recorder — owned by the next task (`03_fdr_log_bridge`, parented to the same epic). +- Per-component log call sites (each component epic owns its own logging call sites). +- Log schema versioning beyond 1.0.0 — handled by future change-log entries on the contract file. + +## Acceptance Criteria + +**AC-1: Single logger entrypoint** +Given any onboard Python module that imports the shared logging package +When the module calls `get_logger("c2_vpr")` +Then it receives a `Logger` whose every record passes the schema contract test (no other logger configuration is required by the caller) + +**AC-2: Field order is stable** +Given a logger configured with the JSON formatter +When a component calls `logger.info(msg, extra={"frame_id": 42, "kind": "vpr.query", "kv": {...}})` +Then the emitted bytes parse as a single-line JSON object whose keys appear in the order `ts, level, component, frame_id, kind, msg, kv, exc`, regardless of the order the caller passed the fields + +**AC-3: Level normalisation** +Given a logger receiving a record at level `WARNING` (Python stdlib name) +When the formatter emits the JSON record +Then the `level` field reads `WARN` (per contract), not `WARNING` + +**AC-4: Handler topology selection** +Given the structured-logging config block selects `tier=1` (or `tier=2`) +When `runtime_root.py` initialises logging +Then exactly one stdout handler (or journald handler) is attached, with no duplicate handlers and no handler from the wrong tier + +**AC-5: Non-frame records omit frame_id** +Given a startup or shutdown log call that does not pass a `frame_id` +When the record is emitted +Then `frame_id` appears as JSON `null` (never as a synthesised value, never absent from the key list) + +## Non-Functional Requirements + +**Performance** +- Per-record formatter latency p99 ≤ 0.2 ms on Tier-2 (Jetson Orin Nano Super) for a record with `len(kv) ≤ 8` scalar entries. Validated by a microbenchmark in unit tests. +- DEBUG records on the steady-state hot path allocate at most one new string (the formatted JSON line); no transient dict copies of `kv` are permitted. + +**Reliability** +- Formatter never raises into the caller. A serialisation failure logs an internal `WARN` with `kind="log.format_error"` and drops the offending record's `kv` payload (replaces with `{"_format_error": ""}`); the rest of the record is still emitted. +- No global mutable state outside the standard `logging` module's own logger registry; multiple `get_logger("c2_vpr")` calls return the same cached `Logger` instance. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `get_logger("c2_vpr")` returns a Logger with the JSON formatter attached | Logger instance present; formatter produces valid contract record | +| AC-2 | Emit a record with kwargs in shuffled order | Parsed JSON keys appear in the contract's mandated order | +| AC-3 | Log at `logging.WARNING` level | Emitted JSON `level` field equals `"WARN"` | +| AC-4 | Initialise logging twice with the same tier-1 config | Exactly one stdout handler attached; no duplicates | +| AC-5 | Log a startup INFO without `frame_id` | Emitted JSON contains `"frame_id": null` | +| NFR-perf | Microbenchmark formatter on a record with 8 scalar kv entries | p99 ≤ 0.2 ms over 10k iterations | +| NFR-reliability | Pass a non-JSON-serialisable object in `kv` (e.g. a class instance) | Formatter emits the record with `kv={"_format_error": "..."}`; caller does not see an exception | + +## Constraints + +- Public interface frozen by `_docs/02_document/contracts/shared_logging/log_record_schema.md` v1.0.0 — any change requires a contract version bump. +- Stdlib `logging` is the only allowed underlying logging mechanism (per epic AZ-245 architecture note: "no third-party log aggregator"). +- No new dependency beyond what AZ-263 / E-BOOT already pinned in `pyproject.toml`. + +## Risks & Mitigation + +**Risk 1: Formatter performance regression** +- *Risk*: Naïve `json.dumps` on each record exceeds the 0.2 ms p99 budget on Jetson. +- *Mitigation*: Bench against `orjson`-backed formatter as a fallback if stdlib `json` misses budget; choice is reversible because the contract is the public surface, not the formatter implementation. + +**Risk 2: Handler duplication on hot-reload** +- *Risk*: Re-initialising logging during integration tests stacks duplicate handlers, multiplying every emitted record. +- *Mitigation*: `get_logger` checks for existing handlers on the named logger before adding new ones; integration test fixture asserts handler count after teardown. + +## Contract + +This task produces the contract at `_docs/02_document/contracts/shared_logging/log_record_schema.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-267_fdr_log_bridge.md b/_docs/02_tasks/todo/AZ-267_fdr_log_bridge.md new file mode 100644 index 0000000..f97c7b7 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-267_fdr_log_bridge.md @@ -0,0 +1,100 @@ +# FDR Log Bridge (ERROR + WARN forwarding) + +**Task**: AZ-267_fdr_log_bridge +**Name**: FDR Log Bridge +**Description**: Subscribe a logging Handler to the shared logger that forwards every ERROR and WARN record into the Flight Data Recorder via the FDR producer client, tagged `kind="log"` so post-flight tooling can correlate log events with the rest of the recorded telemetry. +**Complexity**: 2 points +**Dependencies**: AZ-266_log_module, AZ-247 (forward — FDR producer + record schema not yet decomposed; this task's contract surface is satisfied once AZ-247's record schema contract is published) +**Component**: shared.logging (cross-cutting; epic AZ-245 / E-CC-LOG) +**Tracker**: AZ-267 +**Epic**: AZ-245 (E-CC-LOG) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — log envelope this bridge consumes (produced by AZ-266). +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — FDR record schema this bridge writes into (produced by AZ-247; document does not yet exist — Step 4 cross-verification will catch the forward reference). + +## Problem + +The acceptance criterion "ERROR + WARN records appear in FDR with `kind = \"log\"` and a back-reference to the originating component" requires a bridge between the shared Python `logging` machinery and the FDR producer client. Without this bridge, post-flight tools cannot correlate a `c5_state` ERROR log with the surrounding telemetry frames captured at the same flight time. + +## Outcome + +- Every emitted log record at level WARN or ERROR is enqueued into the FDR producer queue with `kind="log"` and the originating component slug preserved. +- INFO and DEBUG records are NEVER enqueued into FDR (verified by the contract test in PBI #3 of this epic). +- The bridge never blocks the calling thread — it uses the FDR producer client's drop-oldest semantics so a saturated queue cannot stall a `logger.error(...)` call on the hot path. + +## Scope + +### Included + +- A logging Handler subclass installed onto the root onboard logger (or each `get_logger(...)` instance, whichever the AZ-266 implementation chose) that subscribes to records at WARN and ERROR. +- Translation logic from `LogRecord` (per `log_record_schema` v1.0.0) into the FDR record envelope expected by the FDR producer client, with `kind="log"` and a `component` back-reference. +- Wire-up in the composition root (consumed from AZ-246 / E-CC-CONF) so the bridge is attached exactly once, after the logger and the FDR client are both initialised. + +### Excluded + +- The FDR producer client itself — owned by AZ-247 / E-CC-FDR-CLIENT. +- The on-disk FDR segment writer thread — owned by AZ-248 / E-C13. +- The contract test that verifies "DEBUG + INFO never reach FDR" — owned by PBI #3 of this epic (next task). +- Per-component log call sites — owned by each component epic. + +## Acceptance Criteria + +**AC-1: WARN records reach FDR** +Given the bridge is installed and the FDR client's queue is below capacity +When any component emits `logger.warning(...)` via the shared logger +Then a single FDR record with `kind="log"`, `level="WARN"`, and `component=` is enqueued + +**AC-2: ERROR records reach FDR with traceback when applicable** +Given the bridge is installed +When a component emits `logger.exception(...)` from inside an `except` clause +Then the enqueued FDR record's `exc` field carries the formatted traceback string from the `LogRecord` + +**AC-3: INFO and DEBUG never reach FDR** +Given the bridge is installed +When any component emits `logger.info(...)` or `logger.debug(...)` +Then no FDR record is enqueued for that log call (verified by both unit tests here and the contract test in the next task) + +**AC-4: Backpressure is non-blocking** +Given the FDR producer queue is at its drop-oldest threshold +When a component emits `logger.error(...)` on the hot path +Then the call returns within the same latency budget as a stdout-only WARN call (no blocking on the queue), and the FDR client's existing drop counter is incremented + +**AC-5: Single attachment** +Given `compose_root(config)` runs at process start +When the bridge wire-up is invoked +Then exactly one bridge Handler is attached to the logger; reinitialising the composition root in tests does not stack duplicates + +## Non-Functional Requirements + +**Performance** +- Bridge add ≤ 0.05 ms p99 latency on top of the formatter's 0.2 ms budget (i.e. logger.error → bridge enqueue total p99 ≤ 0.25 ms on Tier-2). + +**Reliability** +- A failure to enqueue (queue full + drop-oldest already saturated) MUST NOT raise into the caller; it MUST log a one-shot internal `WARN` record (via stdout only — recursion into the bridge is short-circuited by a thread-local flag) every N occurrences, where N is at least 1000. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Emit a WARN through the shared logger with the bridge installed | Stub FDR queue receives one record with `kind="log"`, `level="WARN"`, `component` matching origin | +| AC-2 | Inside an `except` block, call `logger.exception("boom")` | Stub FDR queue's record carries non-empty `exc` traceback string | +| AC-3 | Emit INFO and DEBUG records | Stub FDR queue receives zero records | +| AC-4 | Pre-fill stub FDR queue to drop-oldest threshold, then emit an ERROR | Caller returns under 0.5 ms wall clock; FDR client's drop counter increments | +| AC-5 | Call `compose_root` twice with the same config in a single process | Logger has exactly one bridge Handler attached after the second call | + +## Constraints + +- The bridge has a forward dependency on AZ-247 (FDR producer client + record schema). It cannot pass its own AC tests until AZ-247 is implemented; Step 4 cross-verification will record this temporal dependency in `_dependencies_table.md`. +- The bridge's record translation MUST consume only the public surface of `log_record_schema` v1.0.0 — no peeking into formatter internals. + +## Risks & Mitigation + +**Risk 1: Recursion via internal `WARN` on enqueue failure** +- *Risk*: The "queue full" internal WARN itself goes through the bridge, recurses, and corrupts the queue further. +- *Mitigation*: Thread-local "in-bridge" flag short-circuits any logging call originating from the bridge itself; verified by a unit test that fills the queue and asserts no infinite loop. + +**Risk 2: Forward dependency on AZ-247 contract not yet written** +- *Risk*: The FDR record schema is described in epic AZ-247's text but not yet a contract file; this task's expectations may drift from AZ-247's eventual contract. +- *Mitigation*: AZ-247's first PBI MUST publish `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` before AZ-247's other PBIs; this task's implementation begins only after that contract exists. Step 4 cross-verification flags the temporal dependency. diff --git a/_docs/02_tasks/todo/AZ-268_log_schema_contract_test.md b/_docs/02_tasks/todo/AZ-268_log_schema_contract_test.md new file mode 100644 index 0000000..ceef4e2 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-268_log_schema_contract_test.md @@ -0,0 +1,68 @@ +# Log Schema Contract Test + +**Task**: AZ-268_log_schema_contract_test +**Name**: Log Schema Contract Test +**Description**: A standalone test module that verifies every shared logger emission conforms to `log_record_schema` v1.0.0 — field names, field ordering, required keys, and the "INFO + DEBUG never reach FDR" invariant. +**Complexity**: 2 points +**Dependencies**: AZ-266_log_module, AZ-267_fdr_log_bridge +**Component**: shared.logging (cross-cutting; epic AZ-245 / E-CC-LOG) +**Tracker**: AZ-268 +**Epic**: AZ-245 (E-CC-LOG) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — the contract this test verifies. + +## Problem + +The shared logging contract (v1.0.0) declares a strict 8-field set with mandated ordering. Without an automated test that parses raw emitted bytes and asserts the contract, formatter changes can silently drift the schema and break post-flight FDR analysis tools that depend on stable column ordering. + +## Outcome + +- A single test module under `tests/contract/log_schema.py` runs in unit-test scope, fails CI fast on any schema drift, and is the single authority that enforces the contract at code-review time. +- "DEBUG + INFO never reach FDR" is verified by a paired test case that wires a stub FDR queue and asserts zero records after a fixed batch of INFO/DEBUG calls. + +## Scope + +### Included + +- One test file (`tests/contract/log_schema.py` per epic AZ-245 AC-4) with cases for every row in the contract's "Test Cases" table (valid-info-no-frame, valid-warn-with-frame, valid-error-with-exc, invalid-bad-level, invalid-multiline-msg, invalid-non-serialisable-kv, ordering-stable). +- A "DEBUG + INFO never reach FDR" case that uses a stub FDR queue. +- A pytest marker (`contract`) so CI can run contract tests as a discrete stage if desired. + +### Excluded + +- Integration-level "every component logs at least one record" tests — owned by per-component test specs in their own epics (Step 9 Decompose Tests). +- Performance microbenchmarks for the formatter — owned by the AZ-266 unit tests. + +## Acceptance Criteria + +**AC-1: Contract cases all pass** +Given the AZ-266 + AZ-267 implementations are complete +When `pytest tests/contract/log_schema.py` runs +Then all test cases listed in `_docs/02_document/contracts/shared_logging/log_record_schema.md § Test Cases` pass + +**AC-2: Schema drift fails fast** +Given a hypothetical formatter change that re-orders the JSON keys +When `pytest tests/contract/log_schema.py` runs +Then the `ordering-stable` case fails with a diff showing actual vs. expected key order + +**AC-3: FDR-suppression invariant verified** +Given a stub FDR queue wired into the bridge +When the test emits 100 INFO + 100 DEBUG records +Then the stub queue reports zero records received + +**AC-4: Contract version pinned** +Given the test imports the contract version constant +When the contract is bumped to a new major version +Then the test fails until updated, preventing accidental coupling to an unreviewed contract change + +## Non-Functional Requirements + +**Reliability** +- The test never depends on real FDR I/O — only on the documented `enqueue` interface of the FDR producer client. + +## Constraints + +- Test file path is fixed at `tests/contract/log_schema.py` per epic AZ-245 AC-4 (allows the `traceability-matrix` reference to remain stable). +- Contract version constant must be sourced from a single location (the contract file or a generated constant) — never duplicated across the test and the formatter. diff --git a/_docs/02_tasks/todo/AZ-269_config_loader.md b/_docs/02_tasks/todo/AZ-269_config_loader.md new file mode 100644 index 0000000..f380d25 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-269_config_loader.md @@ -0,0 +1,104 @@ +# Config Loader + Outer Config Container + +**Task**: AZ-269_config_loader +**Name**: Config Loader +**Description**: Implement `load_config(env, paths) -> Config` and the outer frozen `Config` dataclass. Merges env vars + one or more YAML files + documented defaults with strict precedence (env > YAML > defaults), returning an immutable container that holds one nested dataclass field per component slug. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure +**Component**: shared.config (cross-cutting; epic AZ-246 / E-CC-CONF) +**Tracker**: AZ-269 +**Epic**: AZ-246 (E-CC-CONF) + +## Problem + +ADR-001 (runtime selection by config) and ADR-009 (composition root) both require a single source of truth for configuration. Without a shared loader with explicit precedence rules, components silently fall back to defaults, the composition root grows local config-parsing logic, and operators cannot reliably override settings via env in CI or by YAML in the field. + +## Outcome + +- `load_config(env, paths)` is the only function any onboard process uses to materialise its `Config` at startup. +- Precedence is deterministic and observable: env > YAML > defaults; later YAML files win over earlier ones; missing keys fall to defaults. +- The returned `Config` is frozen end-to-end (every nested component block is also frozen) so accidental mutation by component code is a TypeError. + +## Scope + +### Included + +- `load_config(env: Mapping[str, str], paths: Sequence[Path]) -> Config` per the composition_root_protocol contract. +- Outer frozen `Config` dataclass with one nested field per component slug. The OUTER container is owned by this task; the per-component nested dataclasses are owned by each component's epic and registered into the outer Config via a documented extension mechanism (a registry function called from `runtime_root.py`). +- Documented default values for cross-cutting blocks only (logging level, FDR queue size, etc.). Per-component defaults live in their own component epics. +- Friendly error messages when a required env var is missing (per AZ-263 AC-8): the error names the offending variable and points to `.env.example`. + +### Excluded + +- `compose_root` and `compose_operator` — owned by the next PBI in this epic. +- Per-component config blocks — owned by each component epic. +- The runtime self-check that strategies are linked — owned by the next PBI (StrategyNotLinkedError). + +## Acceptance Criteria + +**AC-1: Precedence env > YAML > defaults** +Given env sets `LOG_LEVEL=DEBUG` and YAML sets `log.level=INFO` +When `load_config(env, [yaml_path])` runs +Then `config.log.level == "DEBUG"` + +**AC-2: YAML > defaults when env is silent** +Given env has no `LOG_LEVEL` and YAML sets `log.level=INFO` +When `load_config(env, [yaml_path])` runs +Then `config.log.level == "INFO"` + +**AC-3: Defaults fill gaps** +Given env has no `LOG_LEVEL` and YAML omits `log.level` +When `load_config(env, [yaml_path])` runs +Then `config.log.level` equals the documented default + +**AC-4: Multi-file YAML merge order** +Given two YAML paths where the second sets `fdr.queue_size=8192` and the first sets it to `4096` +When `load_config(env, [first, second])` runs +Then `config.fdr.queue_size == 8192` (later file wins) + +**AC-5: Frozen end-to-end** +Given a loaded `Config` +When component code attempts `config.log.level = "DEBUG"` +Then a `TypeError` (or `FrozenInstanceError`) is raised + +**AC-6: Required-var missing fails fast with pointer** +Given a required env var is unset and no YAML override or default exists +When `load_config(env, paths)` runs +Then it raises an error whose message names the missing var and points to `.env.example` + +## Non-Functional Requirements + +**Performance** +- Cold-start `load_config` ≤ 250 ms on Tier-2 (allocates the budget for the rest of compose_root within 1 s). + +**Reliability** +- Loader is pure: same env + same file contents always yields a deep-equal `Config`. Verified by AC-relevant unit test. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | env vs. YAML for `log.level` | env value wins | +| AC-2 | YAML vs. default | YAML value wins | +| AC-3 | All-default for `log.level` | documented default returned | +| AC-4 | Two YAML files, conflicting key | later file wins | +| AC-5 | Mutation attempt on loaded Config | TypeError / FrozenInstanceError | +| AC-6 | Missing required env var | error message names the var + points to `.env.example` | +| NFR-perf | Microbenchmark `load_config` over a representative config | p99 ≤ 250 ms on Tier-2 | +| NFR-reliability | Call `load_config` twice with same args | deep-equal `Config` instances | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_config/composition_root_protocol.md` v1.0.0. +- No new dependency beyond what AZ-263 / E-BOOT pinned (stdlib + the YAML library already in `pyproject.toml`). + +## Risks & Mitigation + +**Risk 1: Per-component defaults drift across components** +- *Risk*: Without a documented registration mechanism, two components may both claim a `log.level` default and conflict. +- *Mitigation*: Defaults registry is keyed by component slug + key; collisions raise at registration time, not at load time. + +## Contract + +This task produces (jointly with AZ-NN compose_root) the contract at `_docs/02_document/contracts/shared_config/composition_root_protocol.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-270_compose_root.md b/_docs/02_tasks/todo/AZ-270_compose_root.md new file mode 100644 index 0000000..cde479d --- /dev/null +++ b/_docs/02_tasks/todo/AZ-270_compose_root.md @@ -0,0 +1,108 @@ +# Composition Root + StrategyNotLinkedError + +**Task**: AZ-270_compose_root +**Name**: Composition Root +**Description**: Implement `compose_root(config) -> RuntimeRoot` for the airborne process and `compose_operator(config) -> OperatorRoot` for the operator-side tooling. Both functions construct every component instance, inject dependencies against component interfaces, and refuse to start when the config selects a strategy whose `BUILD_` flag was OFF in the linked binary (raises `StrategyNotLinkedError`). +**Complexity**: 3 points +**Dependencies**: AZ-269_config_loader +**Component**: shared.config (cross-cutting; epic AZ-246 / E-CC-CONF) +**Tracker**: AZ-270 +**Epic**: AZ-246 (E-CC-CONF) + +## Problem + +Per ADR-009 (interface-first DI), only ONE place in the codebase may import concrete component implementations — the composition root. Without a single, tested composition function, components grow direct cross-imports and the build-time exclusion gate (ADR-002) loses its third enforcement point at runtime. + +## Outcome + +- A single `compose_root(config)` call returns a fully-wired airborne `RuntimeRoot` whose component graph matches the `Config`-selected strategies. +- Strategy/build-flag mismatch raises `StrategyNotLinkedError` with a clear message naming the missing strategy, the owning component, and the strategies actually linked into this binary. +- `compose_operator(config)` returns the operator-side `OperatorRoot` with only operator-tier components (e.g. C11 TileManager, C12 operator tooling) — and refuses to wire C1–C5 / C7 / C13 (airborne-only) even if asked. +- `runtime_root.py` exits with code 0 on a valid Config when no components do work (reachability proof per epic AC-4). + +## Scope + +### Included + +- `compose_root(config: Config) -> RuntimeRoot` per the composition_root_protocol contract. +- `compose_operator(config: Config) -> OperatorRoot` per the same contract. +- `StrategyNotLinkedError` exception with `strategy_name`, `component_slug`, `available_strategies` payload. +- Strategy/build-flag consistency check that runs at the start of both compose functions; ADR-002 enforcement gate #3. +- Component construction order respects the dependency graph in `_docs/02_document/architecture.md` (foundational components first). +- Composition-root code is the ONLY allowed importer of concrete component classes; module-layout.md's Layout Rule 6 is enforced at code-review time. + +### Excluded + +- The `RuntimeRoot` and `OperatorRoot` internal class definitions — owned by E-BOOT (AZ-263) for the skeleton; per-component `add_to_root` registration logic lives in each component epic. +- Per-component config blocks — owned by each component epic. +- Per-component strategy registration — each component epic registers its strategies into a discovery map; this task only wires what's been registered. + +## Acceptance Criteria + +**AC-1: Default deployment composes** +Given a default-deployment-binary `Config` and a binary built with the deployment `BUILD_*` flag set +When `compose_root(config)` runs +Then it returns a `RuntimeRoot` whose every component slot is populated by the strategy declared in `Config` + +**AC-2: Strategy/build-flag mismatch rejected** +Given a `Config` selects `vins_mono` for `c1_vio` and the binary was built with `BUILD_VINS_MONO=OFF` +When `compose_root(config)` runs +Then it raises `StrategyNotLinkedError` with `strategy_name="vins_mono"`, `component_slug="c1_vio"`, `available_strategies` listing the strategies actually linked + +**AC-3: Operator-side excludes airborne** +Given an operator `Config` accidentally references an airborne-only component (e.g. `c1_vio`) +When `compose_operator(config)` runs +Then it raises `StrategyNotLinkedError` (or a clearly-named subclass) noting the component is airborne-only + +**AC-4: Reachability proof** +Given a valid `Config` with all components stubbed to do nothing +When `runtime_root.py` runs `compose_root(config)` and exits +Then exit code is 0 and no exception is raised + +**AC-5: Construction order respects dependencies** +Given `Config` selects `c5_state` (depends on `c1_vio`, `c4_pose`) +When `compose_root(config)` constructs the graph +Then `c1_vio` and `c4_pose` instances exist before `c5_state` is constructed (verified by an order-tracing fake) + +**AC-6: Single import point enforced** +Given the codebase +When the architecture lint check (added under code-review skill, Phase 7) runs +Then only `compose_root` and `compose_operator` import from `components..` — every other module imports only from `components.` (Public API) + +## Non-Functional Requirements + +**Performance** +- `compose_root(config)` ≤ 750 ms on Tier-2 (combined with AZ-269's 250 ms loader budget for the 1 s total). + +**Reliability** +- Composition is deterministic: same `Config` → same component graph (verified by structural equality on the fake recorder). +- A failure mid-composition leaves no partially-constructed singletons (composition is all-or-nothing; on error, every constructed instance is closed). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Default Config + deployment-flag binary | Every component slot populated | +| AC-2 | Config selects unlinked strategy | `StrategyNotLinkedError` with full payload | +| AC-3 | Operator Config references airborne-only component | `StrategyNotLinkedError` (or subclass) noting tier mismatch | +| AC-4 | `runtime_root.py` smoke run with stubbed components | exit code 0 | +| AC-5 | `compose_root` with construction-order recorder | dependency order respected | +| AC-6 | Architecture lint over the codebase | Only compose_root / compose_operator import concrete strategies | +| NFR-perf | Microbench `compose_root` over a representative Config | p99 ≤ 750 ms on Tier-2 | +| NFR-reliability | Force a mid-composition failure (one strategy raises in `__init__`) | No partial state; every prior instance closed | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_config/composition_root_protocol.md` v1.0.0. +- Composition-root code is the ONLY place concrete strategy classes may be imported. Code-review Phase 7 emits an Architecture finding (High) on any other importer. + +## Risks & Mitigation + +**Risk 1: Component registration not fully discoverable at compose time** +- *Risk*: A component epic forgets to register its strategies into the discovery map, leaving `compose_root` unable to construct it. +- *Mitigation*: A startup self-check enumerates required components from the architecture spec and asserts every one has at least one registered strategy; missing → loud error at compose start. + +## Contract + +This task produces (jointly with AZ-269 config loader) the contract at `_docs/02_document/contracts/shared_config/composition_root_protocol.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-271_config_precedence_tests.md b/_docs/02_tasks/todo/AZ-271_config_precedence_tests.md new file mode 100644 index 0000000..b95294f --- /dev/null +++ b/_docs/02_tasks/todo/AZ-271_config_precedence_tests.md @@ -0,0 +1,75 @@ +# Config Precedence Unit Tests + +**Task**: AZ-271_config_precedence_tests +**Name**: Config Precedence Tests +**Description**: A focused unit-test module that verifies the env > YAML > defaults precedence rule for at least 3 keys per layer (per epic AZ-246 AC-3) and the multi-file YAML merge order (later wins). Companion to AZ-269 / AZ-270. +**Complexity**: 2 points +**Dependencies**: AZ-269_config_loader, AZ-270_compose_root +**Component**: shared.config (cross-cutting; epic AZ-246 / E-CC-CONF) +**Tracker**: AZ-271 +**Epic**: AZ-246 (E-CC-CONF) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — the contract whose precedence invariant this test suite verifies. + +## Problem + +The composition_root_protocol contract declares a hard precedence rule. Without explicit, tabular tests covering at least 3 keys per layer, regressions in the loader silently flip precedence and cause field bugs that only show up in production deployments where env overrides matter most. + +## Outcome + +- A pytest module with clearly-named cases covering precedence for ≥3 keys at each layer. +- A multi-file YAML merge case proving later paths win over earlier paths. +- Failure messages name the layer (env / YAML / defaults) so on-call engineers can triage fast. + +## Scope + +### Included + +- Precedence cases for ≥3 keys at each of: env > YAML, YAML > defaults, multi-file YAML merge order. +- One reachability case driving `compose_root` end-to-end with a stub Config to prove the loader and composition functions integrate cleanly. +- Test fixtures (in-memory YAML strings + env dict, no real files needed for precedence cases; one tmp_path for the multi-file case). + +### Excluded + +- Performance microbench — owned by AZ-269. +- Strategy/build-flag mismatch tests — owned by AZ-270. +- Per-component config block tests — owned by each component epic. + +## Acceptance Criteria + +**AC-1: env > YAML for at least 3 keys** +Given env sets `LOG_LEVEL`, `FDR_QUEUE_SIZE`, `MAVLINK_BAUD` and YAML sets all three to different values +When `load_config(env, [yaml])` runs +Then all three resolved values match env + +**AC-2: YAML > defaults for at least 3 keys** +Given env is empty and YAML sets `log.level`, `fdr.queue_size`, `mavlink.baud` +When `load_config(env, [yaml])` runs +Then all three resolved values match YAML (not the documented defaults) + +**AC-3: Defaults apply for at least 3 keys** +Given env is empty and YAML omits the three keys +When `load_config(env, [yaml])` runs +Then all three resolved values match the documented defaults + +**AC-4: Multi-file YAML — later wins** +Given two YAML paths setting the same key to different values +When `load_config(env, [first, second])` runs +Then the resolved value matches the second file + +**AC-5: Failure messages name the layer** +Given a precedence assertion fails +When pytest reports the failure +Then the assertion message names which layer's value was expected and which was found + +## Non-Functional Requirements + +**Reliability** +- Tests are hermetic: no real env vars consulted, no real YAML files outside tmp_path. + +## Constraints + +- Test file path is fixed at `tests/unit/shared/config/test_precedence.py` (mirrors the `tests/unit//` convention from module-layout.md Layout Rule 7 — `shared/config` is the component slug). +- Cases use the SAME 3 keys per layer to make the test matrix comparable across layers. diff --git a/_docs/02_tasks/todo/AZ-272_fdr_record_schema.md b/_docs/02_tasks/todo/AZ-272_fdr_record_schema.md new file mode 100644 index 0000000..09a04c2 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-272_fdr_record_schema.md @@ -0,0 +1,123 @@ +# FdrRecord Schema + Versioned Serialiser + +**Task**: AZ-272_fdr_record_schema +**Name**: FdrRecord Schema +**Description**: Define the `FdrRecord` versioned schema (one record kind per payload class — `log`, `vio.tick`, `state.tick`, `tile_match`, `overrun`, `segment_rollover`, `failed_tile_thumbnail`, `mid_flight_tile_snapshot`, etc.) and the matching serialiser/parser pair so every onboard producer emits and post-flight tooling reads the same wire format. Library choice (orjson or msgpack) is pinned at E-BOOT; the schema layer is library-agnostic. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-266_log_module +**Component**: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT) +**Tracker**: AZ-272 +**Epic**: AZ-247 (E-CC-FDR-CLIENT) + +## Problem + +C13 (FdrWriter) and every onboard producer must agree on a single, versioned wire format for FDR records. Without one frozen schema: +- Producers drift in field naming over time, breaking post-flight analysis. +- Forward-compatible parsing is impossible — a new field added in version N+1 silently breaks tooling pinned at version N. +- The cross-component "no silent drops" guarantee (AC-NEW-3) is unenforceable because the `kind=overrun` record has no canonical shape. + +## Outcome + +- A single `FdrRecord` definition is the only record type any onboard process emits, and the only one C13's writer thread + post-flight tooling parses. +- The schema carries a top-level `schema_version` integer; the parser is forward-compatible — a record at version N is readable by tooling pinned at version N-1 with documented field-set degradation rules. +- Adding a new record `kind` is a minor version bump; renaming or removing a field is a major version bump (covered by `Versioning Rules` in the contract). + +## Scope + +### Included + +- `FdrRecord` outer envelope: `schema_version: int`, `ts: str (ISO 8601 UTC, µs)`, `producer_id: str`, `kind: str`, `payload: object`. +- A closed enum of supported `kind` values for v1.0.0 covering: `log`, `vio.tick`, `state.tick`, `tile_match`, `overrun`, `segment_rollover`, `failed_tile_thumbnail`, `mid_flight_tile_snapshot`, `flight_header`, `flight_footer`. Per-`kind` payload shape is documented in the contract. +- A `serialise(record: FdrRecord) -> bytes` and `parse(buf: bytes) -> FdrRecord` pair. Library is pinned at E-BOOT (orjson or msgpack); the public API hides the choice. +- A forward-compat parser: unknown future fields inside `payload` are preserved on read (deserialised into a generic `extra: dict[str, Any]` bucket); unknown future `kind` values surface as `FdrRecord(kind=, payload=)` so tooling can skip them rather than crash. +- Public interface contract published at `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md`. + +### Excluded + +- The lock-free ring buffer (`FdrClient.enqueue`) — owned by AZ-XX (next task in this epic). +- Drop-oldest + overrun-emission policy — owned by the third task in this epic. +- The `FakeFdrSink` test double — owned by the fourth task in this epic. +- The C13 writer thread, segment files, and 64 GB cap — owned by E-C13 (AZ-248). + +## Acceptance Criteria + +**AC-1: One envelope, every kind** +Given any of the v1.0.0 `kind` values +When a producer constructs an `FdrRecord(kind=, payload=)` and `serialise` is called +Then the resulting bytes parse back to a deep-equal `FdrRecord` via `parse` + +**AC-2: Forward-compatible parser** +Given a record serialised at schema_version 1.1 (a hypothetical future minor) with an additional payload field `new_field` +When tooling pinned at schema_version 1.0 calls `parse` +Then the record parses successfully; `new_field` is preserved under `payload.extra["new_field"]` + +**AC-3: Unknown kind tolerated** +Given a record whose `kind` is not in the v1.0.0 closed enum +When `parse` runs +Then a valid `FdrRecord` is returned with `kind` set to the raw string and `payload` set to the raw decoded object — no exception is raised + +**AC-4: Schema version is mandatory** +Given a serialised record missing `schema_version` (or with a non-integer value) +When `parse` runs +Then `FdrSchemaError` is raised with a message naming the offending field + +**AC-5: Overrun record shape is canonical** +Given a `kind="overrun"` record +When constructed by any producer +Then `payload` MUST contain `producer_id: str` and `dropped_count: int (>0)` — schema validation rejects payloads missing either field + +**AC-6: Producer ID is required on every record** +Given any `FdrRecord` with an empty or missing top-level `producer_id` +When `serialise` runs +Then `FdrSchemaError` is raised — there are no anonymous records on the wire + +## Non-Functional Requirements + +**Performance** +- `serialise` p99 ≤ 20 µs on Tier-2 for a record with `len(payload) ≤ 16` scalar entries (this is the budget that lets the enqueue path stay within its 5 µs hot-path target — serialisation may run on the writer thread instead of the producer; the contract test asserts the producer path does not call `serialise`). +- `parse` p99 ≤ 50 µs on the same record shape. + +**Reliability** +- `serialise` and `parse` are pure functions: same input → byte-identical output (or deep-equal record). +- `FdrSchemaError` is the ONLY exception type either function raises on schema violation; no `KeyError`, `ValueError`, or library-specific exceptions leak to callers. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Round-trip every v1.0.0 kind | `parse(serialise(r)) == r` for every kind | +| AC-2 | Parse a synthetic v1.1 record (extra field added) at v1.0 | record parses; extra field preserved in `payload.extra` | +| AC-3 | Parse a record with `kind="future.kind"` | record parses; `kind` and `payload` opaque | +| AC-4 | Missing `schema_version` | `FdrSchemaError` mentions `schema_version` | +| AC-5 | `overrun` record missing `dropped_count` | `FdrSchemaError` mentions `dropped_count` | +| AC-6 | Serialise with empty `producer_id` | `FdrSchemaError` mentions `producer_id` | +| NFR-perf | Microbench `serialise` and `parse` on a 16-entry payload | p99 within budget over 10k iterations | +| NFR-reliability | Call `serialise` twice with same input | byte-identical outputs | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` v1.0.0. +- Underlying serialisation library (orjson vs. msgpack) is pinned in `pyproject.toml` at E-BOOT and must NOT leak through the public API — the contract talks about bytes in/out only. +- No new dependency beyond what AZ-263 / E-BOOT pinned. + +## Risks & Mitigation + +**Risk 1: Library choice (orjson vs msgpack) changes after consumers exist** +- *Risk*: Switching from JSON-bytes to msgpack-bytes after producers are wired forces a wire-format migration. +- *Mitigation*: The library is pinned in `pyproject.toml` at AZ-263 / E-BOOT; the contract's `Shape` section documents the chosen wire format and its `magic` prefix so post-flight tooling can detect a wrong-format file fast. + +**Risk 2: `payload.extra` preserves bytes-blob fields and bloats memory** +- *Risk*: A future schema adds a large binary field; old tooling preserves it under `extra` and balloons in-memory record size. +- *Mitigation*: The forward-compat rule documents that fields larger than 4 KiB MUST be referenced by sidecar path, not embedded — enforced in the contract `Invariants` section. + +## Runtime Completeness + +- **Named capability**: `FdrRecord` versioned schema + serialiser/parser pair (architecture / E-CC-FDR-CLIENT / AC-NEW-3, AC-8.5). +- **Production code that must exist**: real schema enforcement (every required field validated), real round-trip tested against the chosen library. +- **Allowed external stubs**: none — the library (orjson/msgpack) is the production dependency. +- **Unacceptable substitutes**: hand-rolled `repr()` -> `eval()` round-trip; "for now we just store dicts and worry about schema later"; serialiser that drops unknown fields silently (breaks forward-compat AC-2). + +## Contract + +This task produces the contract at `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-273_fdr_client_ringbuf.md b/_docs/02_tasks/todo/AZ-273_fdr_client_ringbuf.md new file mode 100644 index 0000000..3398416 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-273_fdr_client_ringbuf.md @@ -0,0 +1,151 @@ +# FdrClient Lock-Free SPSC Ring Buffer + Public API + +**Task**: AZ-273_fdr_client_ringbuf +**Name**: FdrClient Ring Buffer +**Description**: Implement the producer-side `FdrClient(producer_id)` and its lock-free single-producer / single-consumer (SPSC) ring buffer. `enqueue` is non-blocking even when the C13 writer thread is stalled. Capacity is configurable per producer via the cross-cutting Config block. The buffer exposes a hook the overrun-policy task (next PBI) plugs into; this task does NOT implement the drop-oldest emission itself. +**Complexity**: 5 points +**Dependencies**: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-269_config_loader, AZ-266_log_module +**Component**: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT) +**Tracker**: AZ-273 +**Epic**: AZ-247 (E-CC-FDR-CLIENT) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — the record envelope this client enqueues. +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — the Config object that carries this client's capacity setting. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — diagnostic logs emitted by this client (NOT on the steady-state hot path). + +## Problem + +Every onboard component needs to publish FDR records in real time without blocking on the writer thread, the disk, or any other producer. AC-NEW-3 ("no silent drops") and the steady-state `enqueue` p99 ≤ 5 µs budget rule out: +- Any lock-acquiring queue (Python `queue.Queue`, `threading.Lock`-protected list, asyncio queue). +- Any allocation on the steady-state path (no `dict.copy()`, no `list.append` that may resize, no `dataclasses.replace`). +- Any blocking I/O. + +Without a shared, contract-frozen client, every component would re-implement its own queue, drift on overrun semantics, and break the AC-NEW-3 guarantee within weeks of parallel development. + +## Outcome + +- A single `FdrClient(producer_id)` is the only handle any onboard producer ever holds; constructed by the composition root and injected into each component. +- `enqueue` p99 ≤ 5 µs on Tier-2 with no allocation on the steady-state path (pre-sized buffers; reused slots). +- `enqueue` NEVER blocks, regardless of writer-thread state. When the buffer is full, control returns to the caller in O(1); the overrun policy (drop-oldest + emit `kind="overrun"`) is implemented by the next PBI via the buffer's documented hook. +- The dequeue side (`pop_one` / iterator) is consumed exclusively by the C13 writer thread; the contract documents it as SPSC — multi-consumer is undefined behaviour and rejected by the contract test. + +## Scope + +### Included + +- `FdrClient(producer_id: str, capacity: int)` constructor + module-level `make_fdr_client(producer_id, config) -> FdrClient` factory that reads capacity from the cross-cutting `config.fdr_client..capacity` block (with documented default). +- `FdrClient.enqueue(record: FdrRecord) -> EnqueueResult` — lock-free, non-blocking, allocation-free on the steady-state path. Returns `EnqueueResult.OK` or `EnqueueResult.OVERRUN` (the next PBI consumes `OVERRUN`). +- A documented `on_overrun: Callable[[FdrRecord], None] | None` hook the overrun-policy PBI populates with the drop-oldest + record-emit closure. +- Single-consumer dequeue API for the C13 writer: `pop_one() -> FdrRecord | None` and `drain(max_records: int) -> list[FdrRecord]`. +- `flush() -> None` test-only method that blocks until the buffer is empty (used by `FakeFdrSink` and contract tests; production callers MUST NOT call this on the hot path). +- Diagnostic INFO log on construction (one-time, NOT on the steady-state hot path) via the shared logger. +- Public interface contract published at `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md`. + +### Excluded + +- The drop-oldest behaviour and the `kind="overrun"` record emission — owned by the next PBI in this epic. +- The C13 writer thread itself, segment files, segment rotation, 64 GB cap — owned by E-C13 (AZ-248). +- The `FakeFdrSink` for tests — owned by the fourth PBI in this epic. +- Multi-producer / multi-consumer ring buffer — out of scope; the contract is SPSC. +- The actual `FdrRecord` schema and serialiser — owned by AZ-272. + +## Acceptance Criteria + +**AC-1: Lock-free, never blocks** +Given an FdrClient with capacity 1024 and a writer thread that is stalled (does not dequeue) +When the producer calls `enqueue(record)` 1025 times in rapid succession +Then every call returns within 50 µs (no thread state ever transitions to BLOCKED), and the 1025th call returns `EnqueueResult.OVERRUN` + +**AC-2: Allocation-free steady-state** +Given an FdrClient warmed up with one prior `enqueue` +When the producer calls `enqueue(record)` for an in-buffer record (slot is free) +Then the call performs zero heap allocations (verified via `tracemalloc` snapshot diff: 0 new objects on the hot path) + +**AC-3: Capacity is config-driven** +Given the cross-cutting Config block sets `config.fdr_client..capacity = 4096` +When `make_fdr_client(producer_id, config)` runs +Then the returned client's internal buffer length is 4096 (verified via the test-only `_capacity()` introspection method) + +**AC-4: SPSC dequeue contract** +Given two threads concurrently call `pop_one()` +When both calls race +Then the contract test detects undefined behaviour (asserted via a contract test that wraps `pop_one` in a guard which raises `FdrSpscViolationError` on concurrent entry — the guard is opt-in for tests but documents the SPSC invariant) + +**AC-5: Overrun hook is wired** +Given an `FdrClient` with `on_overrun` set to a recording closure +When the buffer fills and the next `enqueue` would overrun +Then `on_overrun` is invoked exactly once per overrun event with the would-be-enqueued record (the closure decides what to do — drop-oldest + emit, log only, etc.; this PBI does NOT define that behaviour) + +**AC-6: flush() drains buffer** +Given an FdrClient with N records buffered and a consumer thread draining +When the test calls `flush()` +Then `flush()` returns only after `pop_one()` has been called N times (no records left in the buffer) + +**AC-7: producer_id is non-empty and stamped on every record** +Given a constructor call `FdrClient(producer_id="")` (empty string) +When construction runs +Then `ValueError` is raised — anonymous producers are forbidden + +## Non-Functional Requirements + +**Performance** +- `enqueue` p99 ≤ 5 µs on Tier-2 (Jetson Orin Nano Super) for a record carrying a `payload` dict of ≤ 16 scalar entries. Validated by a microbenchmark (10k iterations, warm cache). +- `pop_one` p99 ≤ 10 µs on Tier-2 under steady-state. +- Memory: per-producer ring buffer ≤ `capacity * sizeof(slot)` bytes; no unbounded growth. Pre-sized at construction. + +**Reliability** +- `enqueue` never raises into the caller. Schema violations from `FdrRecord` are caught and forwarded to the same `on_overrun` hook with a synthetic flag (the overrun-policy PBI decides what to do); the producer's hot path stays clean. +- Multiple `make_fdr_client(producer_id, config)` calls with the same `producer_id` return the same cached instance — there is exactly one FdrClient per producer_id per process. + +**Concurrency** +- SPSC: ONE producer thread MAY call `enqueue`, ONE consumer thread MAY call `pop_one` / `drain`. Multi-producer or multi-consumer use is undefined behaviour and detected by the contract guard (AC-4). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Stalled consumer + 1025 enqueues into a 1024-capacity client | Every call returns within 50 µs; #1025 returns `OVERRUN` | +| AC-2 | `tracemalloc` snapshot diff across one `enqueue` after warmup | Zero new objects allocated | +| AC-3 | `make_fdr_client("c1_vio", config_with_capacity_4096)` | `client._capacity() == 4096` | +| AC-4 | Two threads call `pop_one()` concurrently with the SPSC guard enabled | `FdrSpscViolationError` raised | +| AC-5 | Wire a recording `on_overrun`; force overrun | Closure invoked exactly once with the offending record | +| AC-6 | Enqueue N records, start a draining consumer, call `flush()` | `flush()` returns only after buffer is empty | +| AC-7 | `FdrClient(producer_id="")` | `ValueError` | +| NFR-perf | Microbench `enqueue` over 10k iterations on Tier-2 | p99 ≤ 5 µs | +| NFR-perf-pop | Microbench `pop_one` over 10k iterations | p99 ≤ 10 µs | +| NFR-reliability | Two `make_fdr_client("c1_vio", config)` calls | same instance returned | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` v1.0.0. +- SPSC only — multi-producer / multi-consumer is out of scope and the contract test asserts the SPSC guard exists. +- The lock-free implementation MAY use `multiprocessing.shared_memory`, `cffi`-backed atomics, a Cython extension, or pure Python with `array.array` + a single CAS-like primitive — implementation choice is internal to this PBI but MUST satisfy the allocation-free + non-blocking ACs above. Prefer the simplest working option that hits the budget; document the choice in the implementation report. +- No new dependency beyond what AZ-263 / E-BOOT pinned. + +## Risks & Mitigation + +**Risk 1: Pure-Python SPSC ring cannot hit the 5 µs p99 budget on Tier-2** +- *Risk*: CPython's GIL + dict operations push p99 above 5 µs on the Jetson. +- *Mitigation*: Bench against a `cffi` or Cython-backed SPSC ring as a fallback; the contract is library-agnostic so the implementation can swap without breaking consumers. Decision is taken inside this PBI's implementation phase with the microbench as the oracle. + +**Risk 2: Overrun hook called with record that holds a reference to caller-mutable state** +- *Risk*: Producer mutates `record.payload` after `enqueue`; the overrun closure sees the mutated value. +- *Mitigation*: `FdrRecord` is `@frozen` (per AZ-272 contract); the contract test verifies a producer cannot legally mutate a constructed record. Documented in the contract `Invariants`. + +**Risk 3: Cached FdrClient leaks across test cases** +- *Risk*: A pytest test mutates the module-level cache; subsequent tests get a stale FdrClient. +- *Mitigation*: A `_reset_for_tests()` private function (documented as test-only in the contract `Non-Goals`) clears the cache; integration test fixture calls it on teardown. + +## Runtime Completeness + +- **Named capability**: lock-free SPSC ring buffer + `FdrClient` public API (architecture / E-CC-FDR-CLIENT / AC-NEW-3, NFR `enqueue` p99 ≤ 5 µs). +- **Production code that must exist**: real lock-free SPSC primitive (no Python `queue.Queue`, no lock-acquiring fallback); real allocation-free hot path; real `on_overrun` hook plumbing. +- **Allowed external stubs**: none — the queue is the production runtime capability. +- **Unacceptable substitutes**: `queue.Queue`, `threading.Lock`-guarded list, `collections.deque` with a lock, "for now we just `time.sleep(0)` on overrun", or any implementation that allocates on the steady-state path. These would all silently break AC-NEW-3 the moment the writer thread stalls for >100 ms. + +## Contract + +This task produces the contract at `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-274_fdr_overrun_emission.md b/_docs/02_tasks/todo/AZ-274_fdr_overrun_emission.md new file mode 100644 index 0000000..22e3608 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-274_fdr_overrun_emission.md @@ -0,0 +1,125 @@ +# Drop-Oldest Policy + `kind="overrun"` Record Emission + +**Task**: AZ-274_fdr_overrun_emission +**Name**: FDR Overrun Policy +**Description**: Wire the producer-side overrun policy on top of the FdrClient ring buffer. When a producer's enqueue would overflow, the policy drops the OLDEST queued record from that producer's buffer to make room for the new record AND synthesises a `FdrRecord(kind="overrun", payload={producer_id, dropped_count})` that lands on the same queue. This is the production-side enforcement of AC-NEW-3 ("no silent drops"). +**Complexity**: 2 points +**Dependencies**: AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf +**Component**: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT) +**Tracker**: AZ-274 +**Epic**: AZ-247 (E-CC-FDR-CLIENT) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — defines the canonical shape of `kind="overrun"` records (consumed: `payload.producer_id` + `payload.dropped_count`). +- `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — defines the `on_overrun` hook this task implements + the "exactly-once" invariant. + +## Problem + +AZ-273 (FdrClient ring buffer) leaves the `on_overrun` hook unwired by default. In production, an unwired hook means the buffer silently drops `OVERRUN` events — directly violating AC-NEW-3 and breaking C13's invariant that every dropped record is recoverable from a `kind="overrun"` record on the FDR. This task closes that gap by providing the canonical drop-oldest hook and registering it via the composition root for every onboard producer. + +## Outcome + +- A single, contract-frozen drop-oldest hook is the only `on_overrun` callable any production FdrClient is wired to. Tests MAY substitute their own. +- For every burst that exceeds capacity, a coalesced `kind="overrun"` record is enqueued on the SAME producer's buffer carrying the originating producer's slug + `dropped_count` reflecting how many records were dropped in the burst (coalescing keeps the overrun record from itself triggering further overruns when bursts are sustained). +- The composition root wires the hook on every FdrClient created via `make_fdr_client` — consumers (component code) do not interact with the hook directly. + +## Scope + +### Included + +- A `default_overrun_policy(client: FdrClient) -> Callable[[FdrRecord], None]` factory that returns the canonical drop-oldest closure for the given client. +- Drop-oldest semantics: when `enqueue` returns `OVERRUN`, the closure pops one record from the buffer's tail (oldest), discards it, retries the new record's enqueue (one retry only), and arranges for a `kind="overrun"` record to land on the same buffer. If the retry also fails, the policy logs an ERROR via the shared logger (`kind="fdr.overrun_retry_failed"`) — this is rare; it implies the consumer is making zero progress. +- Coalescing: while a burst of consecutive overruns is in flight (consecutive `OVERRUN` returns within the same producer "tick"), the policy increments `dropped_count` on the in-flight overrun record instead of synthesising a new one per drop. The overrun record itself is enqueued at the END of the burst (next successful `enqueue` slot). +- Composition-root wiring: `make_fdr_client` is updated (or a new `wire_fdr_client_overrun(client)` helper is exposed and called inside `make_fdr_client`) so every production FdrClient is constructed with this policy attached. Tests that explicitly construct `FdrClient(...)` directly opt out by leaving `on_overrun` as `None`. +- Diagnostic ERROR log only when the retry-after-drop also fails (NOT on every overrun — overruns are normal under bursty load and would flood the log). + +### Excluded + +- The buffer itself, the `on_overrun` hook plumbing, and the SPSC contract — owned by AZ-273. +- The `FdrRecord` schema and the `kind="overrun"` payload definition — owned by AZ-272. +- The C13 writer thread's behaviour upon receiving an `overrun` record (it just logs it like any other record) — owned by E-C13 (AZ-248). +- `FakeFdrSink` — owned by the next PBI in this epic. + +## Acceptance Criteria + +**AC-1: Drop-oldest produces canonical overrun record** +Given an FdrClient with capacity 4 wired with `default_overrun_policy`, fully buffered with 4 user records +When the producer calls `enqueue` for a 5th record +Then the consumer side observes (in order): the 5th user record, then a `kind="overrun"` record whose `payload.producer_id` matches the originating producer and `payload.dropped_count == 1` + +**AC-2: Coalescing across a burst** +Given an FdrClient with capacity 4, fully buffered, and the consumer is stalled +When the producer calls `enqueue` 10 times in a row (8 of them overrun) +Then exactly ONE `kind="overrun"` record is emitted at the end of the burst with `payload.dropped_count == 8` + +**AC-3: Overrun record carries originating producer_id** +Given an FdrClient(producer_id="c1_vio") wired with the default policy +When the buffer overruns +Then the emitted overrun record has `payload.producer_id == "c1_vio"` (NOT `"shared.fdr_client"` — the OUTER envelope's `producer_id` may be `"shared.fdr_client"` per the schema contract, but the payload identifies the originating producer) + +**AC-4: Composition root wires every FdrClient** +Given a production process initialised via `compose_root(config)` +When the test inspects every constructed `FdrClient` in the resulting `RuntimeRoot` +Then every client has a non-None `on_overrun` set to a callable from `default_overrun_policy` + +**AC-5: Retry-after-drop failure logs ERROR** +Given a contrived test that monkey-patches the buffer so retry-after-drop ALSO returns `OVERRUN` (simulating a frozen consumer mid-policy) +When an overrun is triggered +Then exactly one ERROR log record is emitted with `kind="fdr.overrun_retry_failed"`; the policy does not loop indefinitely; the overrun record is dropped (test asserts no overrun record on the buffer in this pathological case) + +**AC-6: No log flood under sustained overruns** +Given an FdrClient under sustained overrun (1000 consecutive overruns) +When the policy runs +Then the shared logger receives at most 1 ERROR record per second related to overruns (rate cap on the diagnostic log; the FDR record itself is the canonical record of overruns) + +## Non-Functional Requirements + +**Performance** +- Steady-state overhead: when `on_overrun` is set but the buffer is NOT full (so the hook is never invoked), `enqueue` overhead from this PBI's wiring is ≤ 0.5 µs (effectively a single null-check per call). The 5 µs `enqueue` p99 budget MUST still hold. +- Overrun path overhead: the drop-oldest + retry sequence completes within 20 µs p99 on Tier-2 (it runs only on the cold path; cold-path budget is generous). + +**Reliability** +- The policy NEVER loops indefinitely on retry. One retry only; then ERROR-log + drop. +- The policy NEVER raises into the producer's `enqueue` caller. Any exception inside the closure is logged via `kind="fdr.overrun_policy_error"` and swallowed; the producer's hot path stays clean. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Capacity-4 buffer fully filled, then 5th enqueue with `default_overrun_policy` | Consumer sees 5th record + canonical overrun record (`dropped_count == 1`) | +| AC-2 | 10 consecutive overruns in one burst | Exactly one overrun record with `dropped_count == 8` | +| AC-3 | Overrun on FdrClient(producer_id="c1_vio") | Emitted overrun record `payload.producer_id == "c1_vio"` | +| AC-4 | Boot a stub composition root with 3 producers; inspect all FdrClients | Every client has `on_overrun != None` | +| AC-5 | Monkey-patched retry-after-drop also fails | Exactly one ERROR log; no overrun record on buffer; no infinite loop | +| AC-6 | 1000 consecutive overruns | Logger receives ≤ 1 ERROR/sec related to overruns | +| NFR-perf-steady | Microbench `enqueue` with hook set but not invoked | p99 overhead ≤ 0.5 µs vs unhooked | +| NFR-perf-overrun | Microbench drop-oldest + retry sequence | p99 ≤ 20 µs | +| NFR-reliability | Inject an exception into the closure; trigger overrun | Producer call returns normally; ERROR logged | + +## Constraints + +- The policy plugs into AZ-273's `on_overrun` hook ONLY — no other extension point. Behavioural deviation requires a new contract. +- Coalescing window is bounded by "until the next successful enqueue" — NOT by wall-clock time. Rationale: the buffer is the only synchronisation point; the writer thread drains it; once it drains one slot, the producer's next enqueue succeeds and that is the natural emission point for the overrun record. +- The overrun record's OUTER envelope `producer_id` is `"shared.fdr_client"` (per schema contract); the originating producer's slug is in `payload.producer_id`. + +## Risks & Mitigation + +**Risk 1: Overrun record itself causes another overrun** +- *Risk*: At the moment of overflow, enqueueing the synthesised overrun record might also fail. +- *Mitigation*: The drop-oldest sequence is "drop one → retry the user record → if successful, then enqueue the overrun record at the next slot the consumer drains". The overrun record is emitted at the END of the burst, on a slot known to be free. If the buffer is so degenerate that one drop is insufficient, the AC-5 ERROR-log path catches it. + +**Risk 2: Coalescing hides individual overruns under steady degradation** +- *Risk*: A long-stalled consumer produces one `dropped_count=10000` record at flush time; tooling cannot reconstruct fine-grained timing. +- *Mitigation*: The coalescing scope is "consecutive overruns until next successful enqueue". As soon as the consumer drains one slot, the overrun record is emitted with the count up to that point. Tooling can correlate against the drained record's `ts` to reconstruct timing windows. Documented in the schema contract's invariants. + +**Risk 3: Composition-root wiring drift** +- *Risk*: A future component constructs `FdrClient(...)` directly instead of using `make_fdr_client(...)`, ending up with `on_overrun = None` and silent drops in production. +- *Mitigation*: AC-4's contract test scans the constructed `RuntimeRoot` for any FdrClient with `on_overrun is None` and fails. Documented as a code-review Phase 2 (Spec Compliance) check tied to the fdr_client_protocol contract. + +## Runtime Completeness + +- **Named capability**: drop-oldest + `kind="overrun"` record emission policy (architecture / E-CC-FDR-CLIENT / AC-NEW-3). +- **Production code that must exist**: real drop-oldest closure, real overrun-record synthesis, real composition-root wiring of every producer. +- **Allowed external stubs**: tests MAY replace `on_overrun` with a recording closure; production wiring MUST NOT. +- **Unacceptable substitutes**: `pass` as the hook ("for now we just log a warning"), in-memory counter without record emission ("we'll add the record later"), or relying on the C13 writer to synthesise overrun records (it cannot — only the producer side knows the burst boundary). diff --git a/_docs/02_tasks/todo/AZ-275_fake_fdr_sink.md b/_docs/02_tasks/todo/AZ-275_fake_fdr_sink.md new file mode 100644 index 0000000..d742c2f --- /dev/null +++ b/_docs/02_tasks/todo/AZ-275_fake_fdr_sink.md @@ -0,0 +1,128 @@ +# FakeFdrSink for Component-Level Tests + +**Task**: AZ-275_fake_fdr_sink +**Name**: FakeFdrSink +**Description**: An in-process, in-memory test double for `FdrClient` that conforms to the `fdr_client_protocol` contract's public surface and lets component-level tests assert on every record their code emits to the FDR. Drop-in replacement for `FdrClient` everywhere it is injected; no writer thread, no segment files, no real ring buffer — just a list-of-records the test inspects. +**Complexity**: 2 points +**Dependencies**: AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf +**Component**: shared.fdr_client (cross-cutting; epic AZ-247 / E-CC-FDR-CLIENT) +**Tracker**: AZ-275 +**Epic**: AZ-247 (E-CC-FDR-CLIENT) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — the public surface this fake conforms to. +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — the record envelope this fake stores in memory. + +## Problem + +Component-level tests (every component under `tests/unit/components//` and `tests/integration//`) must assert on what their code writes to the FDR. Without a fake: +- Tests would have to spin up the C13 writer thread + a tmp segment file just to read records back — slow, brittle, cross-component coupling. +- Tests would all reach into `FdrClient`'s private buffer state, freezing internal layout into every test and blocking future implementation changes. + +A simple, contract-conforming `FakeFdrSink` lets each component's test assert on records via a stable public API — and crucially, the same API every other component test uses, so test infrastructure does not fork per component. + +## Outcome + +- Tests inject `FakeFdrSink(producer_id="c1_vio")` wherever production code expects an `FdrClient`. The component code is unchanged; the test reads `sink.records` after exercising the component. +- Every assertion the contract test of `fdr_client_protocol` makes against a real `FdrClient` ALSO holds against `FakeFdrSink` — except the lock-free / allocation-free / SPSC-guard NFRs (those are real-buffer concerns and are explicitly out of scope for the fake). +- Tests can opt in to drop-oldest semantics (`FakeFdrSink(capacity=N, with_default_overrun_policy=True)`) when verifying overrun behaviour, or leave it disabled and rely on unbounded list mode for general assertions. + +## Scope + +### Included + +- `FakeFdrSink(producer_id: str, capacity: int | None = None, with_default_overrun_policy: bool = False)` constructor implementing the `FdrClient` public surface from `fdr_client_protocol.md`: + - `enqueue`, `pop_one`, `drain`, `flush`, `producer_id`, `on_overrun` getter/setter. +- An `FakeFdrSink.records: list[FdrRecord]` property returning the records currently in-buffer in FIFO order. Tests use this directly for assertions. +- An `FakeFdrSink.all_records_ever: list[FdrRecord]` property returning every record ever enqueued, INCLUDING records dropped by the overrun policy when it is active. Lets tests assert on what the producer TRIED to send vs. what the buffer KEPT. +- Behaviour parity with `FdrClient` for the contract-relevant subset: + - Returns `EnqueueResult.OVERRUN` when `capacity` is set and the buffer is full. + - Invokes `on_overrun` exactly once per overrun event when wired. + - Stamps `producer_id` correctly per the protocol (does NOT mutate `record.producer_id`). +- A pytest fixture (`fake_fdr_sink`) under `tests/conftest.py` that constructs a default-configuration sink and yields it to tests. + +### Excluded + +- The lock-free SPSC ring buffer, allocation-free hot path, and SPSC guards — owned by AZ-273 (this is a fake; real concurrency primitives are explicitly NOT replicated). +- The drop-oldest closure itself — owned by AZ-274; the fake imports and reuses it when the user opts in via `with_default_overrun_policy=True`. +- The `FdrRecord` schema — owned by AZ-272. +- The C13 writer thread, segment files, etc. — owned by E-C13 (AZ-248). +- A "fake C13 writer" that drains the sink — out of scope. Tests that need the drained side use `pop_one` / `drain` directly on the fake. + +## Acceptance Criteria + +**AC-1: Drop-in for FdrClient public surface** +Given any production code that takes an `FdrClient` parameter (e.g. `Vio(fdr=fdr_client, ...)`) +When the test passes a `FakeFdrSink` instead +Then the production code's calls (`enqueue`, `flush`) work identically; no AttributeError, no signature mismatch + +**AC-2: records reflects in-buffer state** +Given a `FakeFdrSink` with no capacity limit +When the producer enqueues 3 records, then the test calls `pop_one()` once +Then `sink.records` returns the 2 remaining records in FIFO order + +**AC-3: all_records_ever captures dropped records** +Given a `FakeFdrSink(capacity=2, with_default_overrun_policy=True)` filled to capacity +When the producer enqueues a 3rd record (drop-oldest fires) +Then `sink.records` has 2 entries (newest 2) AND `sink.all_records_ever` has 3 entries (all of them, including the dropped one) + +**AC-4: Overrun policy parity with real FdrClient** +Given a `FakeFdrSink(capacity=4, with_default_overrun_policy=True)` +When the test reproduces AC-1 from AZ-274 (overflow + canonical overrun record) +Then the same assertion that holds against real `FdrClient` holds against `FakeFdrSink` — same overrun record shape, same coalescing across bursts + +**AC-5: pytest fixture available** +Given a test file imports the standard project conftest +When the test signature is `def test_x(fake_fdr_sink): ...` +Then pytest injects a default-configuration `FakeFdrSink` and yields it; teardown clears the sink + +**AC-6: producer_id is preserved** +Given `FakeFdrSink(producer_id="c2_vpr")` and an enqueued record carrying `producer_id="c2_vpr"` +When the test inspects `sink.records[0]` +Then `records[0].producer_id == "c2_vpr"` (the fake does NOT rewrite producer_id) + +## Non-Functional Requirements + +**Performance** +- `enqueue` p99 ≤ 100 µs on Tier-2 (developer machines + CI). The fake is not in the production critical path; the budget exists only to keep tests fast (10k assertions in a long fixture should add < 1 s). + +**Reliability** +- The fake is single-threaded only. Concurrent `enqueue` / `pop_one` is undefined behaviour and not tested. Documented in the docstring. + +**Compatibility** +- The fake's public surface mirrors the `fdr_client_protocol.md` contract version it conforms to. The fake's docstring records the contract version. Bumping the protocol contract major version requires bumping the fake's surface in lock-step. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Inject `FakeFdrSink` into a stub component that expects `FdrClient` | No AttributeError; calls succeed | +| AC-2 | 3 enqueues + 1 pop on unbounded sink | `len(sink.records) == 2` in FIFO order | +| AC-3 | Capacity-2 sink with overrun policy + 3 enqueues | `len(sink.records) == 2`, `len(sink.all_records_ever) == 3` | +| AC-4 | Re-run AZ-274 AC-1 + AC-2 against the fake | Same overrun record shape; same coalescing | +| AC-5 | A trivial test using `fake_fdr_sink` fixture | Fixture provides a clean sink per test | +| AC-6 | Construct sink + enqueue with explicit producer_id | producer_id preserved on the popped record | + +## Constraints + +- Public surface is fixed by `fdr_client_protocol.md` v1.0.0. The fake is allowed to expose ADDITIONAL test-only attributes (`records`, `all_records_ever`) — these are documented as fake-only and never appear on the real `FdrClient` (so production code accidentally using them fails the type checker). +- The fake lives at `src/gps_denied_onboard/fdr_client/fakes.py` — a separate module from the production code so production imports never pick it up. Tests import `from gps_denied_onboard.fdr_client.fakes import FakeFdrSink`. +- The fake reuses `default_overrun_policy` from AZ-274 verbatim; it does NOT re-implement the policy. + +## Risks & Mitigation + +**Risk 1: Fake drift from real client** +- *Risk*: Engineers add a method to `FdrClient` and forget to mirror it on `FakeFdrSink`; tests pass against the fake but production fails. +- *Mitigation*: A contract test (`tests/contract/fdr_client_fake_parity.py`) iterates over every public method on `FdrClient` and asserts the same method exists on `FakeFdrSink` with a compatible signature. Failure mode is loud and immediate. + +**Risk 2: Tests reach into `_records` private state, freezing implementation** +- *Risk*: A test does `sink._buffer[3]` instead of `sink.records[3]`; later refactor breaks the test. +- *Mitigation*: `records` and `all_records_ever` are the documented public access; pyright/mypy mark `_buffer` as private with `_` prefix; code review catches private-state access. + +## Runtime Completeness + +- **Named capability**: `FakeFdrSink` test double — it is NOT a runtime capability; it is test infrastructure. Production code MUST NOT import from `fakes.py` (verified by import-linter rule in the project's `pyproject.toml`). +- **Production code that must exist**: import-linter rule preventing `src/gps_denied_onboard/**/*.py` (excluding `tests/`) from importing `gps_denied_onboard.fdr_client.fakes`. Otherwise none — this PBI's deliverable is test infrastructure. +- **Allowed external stubs**: this IS the stub. It is allowed in tests only. +- **Unacceptable substitutes**: production code wiring `FakeFdrSink` instead of `FdrClient` (would silently disable real FDR writes); per-test ad-hoc fakes that drift from the contract. diff --git a/_docs/02_tasks/todo/AZ-276_imu_preintegrator.md b/_docs/02_tasks/todo/AZ-276_imu_preintegrator.md new file mode 100644 index 0000000..b37cbfe --- /dev/null +++ b/_docs/02_tasks/todo/AZ-276_imu_preintegrator.md @@ -0,0 +1,138 @@ +# ImuPreintegrator Helper Module + +**Task**: AZ-276_imu_preintegrator +**Name**: ImuPreintegrator Helper +**Description**: Implement the shared `ImuPreintegrator` helper that wraps GTSAM's `PreintegrationCombinedParams` + `PreintegratedCombinedMeasurements` so C1 (VIO) and C5 (StateEstimator) consume one canonical preintegration of every FC IMU window. Single-threaded by design; one instance per writer thread, bound by the composition root. Bias drift remains the consumer's responsibility. +**Complexity**: 2 points +**Dependencies**: AZ-263_initial_structure +**Component**: shared.helpers.imu_preintegrator (cross-cutting; epic AZ-264 / E-CC-HELPERS) +**Tracker**: AZ-276 +**Epic**: AZ-264 (E-CC-HELPERS) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_helpers/imu_preintegrator.md` — frozen public interface this task produces. +- `_docs/02_document/common-helpers/01_helper_imu_preintegrator.md` — design rationale and consumer mapping. + +## Problem + +C1's VIO loop and C5's state estimator both consume the same FC IMU window every keyframe. Without a shared preintegrator: +- They drift into two slightly-different integrations of the same physical IMU stream (different sample-rejection rules, different bias-application order). +- The GTSAM `CombinedImuFactor` shape that goes into C5's iSAM2 graph diverges from the one C1 uses for its own pose update, breaking the "single source of IMU truth" invariant in `solution.md`. +- Per-deployment IMU noise covariances (which live in `CameraCalibration`) get parsed twice, with subtle unit differences. + +## Outcome + +- A single `ImuPreintegrator` is the only path through which any onboard process integrates IMU samples for a GTSAM `CombinedImuFactor`. C1 and C5 import it; nothing else does. +- The composition root binds ONE instance per writer thread; the helper's contract test confirms it does not acquire any locks (so no surprise serialisation under load). +- Sample monotonicity is enforced — non-monotonic samples raise `ImuPreintegrationError` before any state is mutated. +- Re-bias is explicit: `reset_with_bias` is called by consumers when their bias estimate changes; the helper never re-estimates bias internally. + +## Scope + +### Included + +- `ImuPreintegrator` class + factory `make_imu_preintegrator(calibration: CameraCalibration) -> ImuPreintegrator`. +- Integration entrypoints: `integrate_sample(ImuSample)`, `integrate_window(ImuWindow)`. +- Factor accessors: `current_preintegration() -> CombinedImuFactor`, `reset_for_new_keyframe() -> CombinedImuFactor` (destructive). +- Bias control: `reset_with_bias(ImuBias) -> None`. +- `ImuPreintegrationError` exception type re-exported alongside the helper. +- Re-export of GTSAM's `CombinedImuFactor` (or a thin alias) so consumers do not import GTSAM directly. +- Public interface contract published at `_docs/02_document/contracts/shared_helpers/imu_preintegrator.md`. + +### Excluded + +- IMU sample acquisition / FC adapter integration — C8. +- Bias estimation / re-bias logic — C1, C5. +- Multi-threaded sample feeding — out of contract; helper is single-thread by design. +- Serialising preintegrated factors to FDR records — C13. +- The ImuSample / ImuWindow / ImuBias DTOs themselves — owned by `_types/nav.py` (AZ-263). + +## Acceptance Criteria + +**AC-1: Round-trip preintegration** +Given a synthetic IMU sequence of 100 samples with strictly-monotonic `ts_ns` +When the producer calls `integrate_sample` 100 times then `current_preintegration()` +Then a `CombinedImuFactor` is returned whose `deltaTij` equals the time span and whose `delta_pose` is non-zero + +**AC-2: Strict monotonicity rejects non-monotonic samples** +Given a preintegrator with the last integrated sample at `ts_ns = T` +When `integrate_sample(sample)` is called with `sample.ts_ns <= T` +Then `ImuPreintegrationError` is raised AND the preintegrator's internal accumulators are unchanged (a subsequent valid sample integrates as if the bad one never came) + +**AC-3: `reset_for_new_keyframe` is destructive** +Given a preintegrator with N integrated samples +When `reset_for_new_keyframe()` is called +Then the returned factor reflects all N samples AND a subsequent `current_preintegration()` (with no further samples) raises `ImuPreintegrationError` + +**AC-4: Re-bias affects subsequent samples only** +Given a sequence: `reset_with_bias(bias_a)`, integrate 50 samples, `reset_with_bias(bias_b)`, integrate 50 more +When `current_preintegration()` is called +Then the resulting factor reflects bias_a applied to samples 1–50 and bias_b applied to samples 51–100 (not bias_b retroactively) + +**AC-5: Determinism** +Given two instances constructed from the same calibration and fed the same `(bias, samples)` sequence +When both call `current_preintegration()` +Then the outputs are deep-equal + +**AC-6: Single-threaded, lock-free** +Given the helper's source code +When inspected by the contract test (static analysis OR runtime reflection) +Then no `threading.Lock`, `RLock`, `Semaphore`, or `mutex` is acquired anywhere in the integration path + +**AC-7: No upward imports (Layer 1 invariant)** +Given the helper module +When a static-import check runs across `gps_denied_onboard.helpers.imu_preintegrator` +Then it imports ONLY from `_types`, GTSAM, numpy, and stdlib — no `gps_denied_onboard.components.*` imports anywhere + +## Non-Functional Requirements + +**Performance** +- `integrate_sample` p99 ≤ 200 µs on Tier-2 (Jetson Orin Nano Super) — overhead vs. inline GTSAM PIM ≤ 5 % (per E-CC-HELPERS hot-path NFR). +- `current_preintegration` p99 ≤ 100 µs on the same hardware. + +**Reliability** +- Pure deterministic: same inputs → byte-equal `CombinedImuFactor` outputs. +- `ImuPreintegrationError` is the ONLY exception type the public surface raises on schema/timestamp violation; GTSAM's lower-level exceptions MUST be wrapped. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | 100 monotonic samples → `current_preintegration()` | factor `deltaTij` ≈ time span; non-zero `delta_pose` | +| AC-2 | non-monotonic sample injection | `ImuPreintegrationError`; internal state unchanged (next valid sample integrates correctly) | +| AC-3 | `reset_for_new_keyframe` then `current_preintegration` | second call raises `ImuPreintegrationError` (state cleared) | +| AC-4 | re-bias mid-window | resulting factor distinguishes bias_a vs bias_b epochs | +| AC-5 | two instances, same input | deep-equal factor outputs | +| AC-6 | static / runtime lock check | no lock acquisition on the integration path | +| AC-7 | importlinter / grep gate | no `gps_denied_onboard.components.*` imports | +| NFR-perf | microbench `integrate_sample` (10k iterations on Tier-2 fixture) | p99 ≤ 200 µs; overhead ≤ 5 % | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_helpers/imu_preintegrator.md` v1.0.0. +- Layer 1 Foundation only (per `module-layout.md` § Allowed Dependencies). NO upward imports. +- GTSAM is the single math backend — do not introduce a second IMU-preintegration library. +- No new dependency beyond what AZ-263 / E-BOOT pinned. + +## Risks & Mitigation + +**Risk 1: Concurrent calls from a misconfigured composition root silently corrupt the GTSAM PIM accumulator** +- *Risk*: Two threads call `integrate_sample` simultaneously; GTSAM's PIM is not thread-safe; numerical drift goes undiagnosed. +- *Mitigation*: Helper is single-threaded by contract; the composition root binds one instance per writer thread. The contract test (AC-6) asserts no internal locking — making it a hard error if a future change tries to "make it thread-safe" instead of fixing the composition. + +**Risk 2: Sample-monotonicity-rejection silently masks an upstream FC clock skew** +- *Risk*: A real IMU stream produces a non-monotonic sample (clock jitter); the helper rejects it; the consumer never learns. +- *Mitigation*: `ImuPreintegrationError` carries the offending vs. previous timestamp in its message so the consumer's catch-and-log path can record it as an FDR `kind="imu.skew"` event. + +## Runtime Completeness + +- **Named capability**: GTSAM `CombinedImuFactor` preintegration via `PreintegrationCombinedParams` + `PreintegratedCombinedMeasurements` (architecture / E-CC-HELPERS / `01_helper_imu_preintegrator.md`). +- **Production code that must exist**: real GTSAM-backed integration; real noise-model parsing from `CameraCalibration`; real strict-monotonic guard. +- **Allowed external stubs**: none — GTSAM is the production runtime. +- **Unacceptable substitutes**: pure-numpy "approximate" preintegration that ignores GTSAM's covariance propagation; deterministic-fallback that returns a zero factor; "for now we just integrate position with Euler" placeholder. Each would silently break C5's iSAM2 covariance honesty (AC-NEW-4). + +## Contract + +This task produces the contract at `_docs/02_document/contracts/shared_helpers/imu_preintegrator.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-277_se3_utils.md b/_docs/02_tasks/todo/AZ-277_se3_utils.md new file mode 100644 index 0000000..7238447 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-277_se3_utils.md @@ -0,0 +1,152 @@ +# SE3Utils Helper Module + +**Task**: AZ-277_se3_utils +**Name**: SE3Utils Helper +**Description**: Implement the shared `SE3Utils` helper for SE(3) ↔ 4×4-matrix conversion and Lie-algebra exp/log/adjoint, backed by GTSAM `Pose3` primitives. Used wherever a consumer needs a 6-vector twist, a Jacobian over an SE(3) operation, or a deterministic conversion between matrix and pose forms — i.e., C1, C2.5, C3, C3.5, C4, C5, C8. Stateless; pure functions; strict caller-orthogonalisation contract. +**Complexity**: 2 points +**Dependencies**: AZ-263_initial_structure +**Component**: shared.helpers.se3_utils (cross-cutting; epic AZ-264 / E-CC-HELPERS) +**Tracker**: AZ-277 +**Epic**: AZ-264 (E-CC-HELPERS) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_helpers/se3_utils.md` — frozen public interface this task produces. +- `_docs/02_document/common-helpers/02_helper_se3_utils.md` — design rationale and consumer mapping. + +## Problem + +Seven components (C1, C2.5, C3, C3.5, C4, C5, C8) need to cross the matrix-vs-pose boundary: +- C4's `solvePnPRansac` returns a 4×4 matrix; C5's iSAM2 graph wants a GTSAM `Pose3`. +- C1's relative-pose update needs `log_map` for covariance recovery. +- C8 encodes pose as a 6-vector for FC adapter emission. + +Without a shared helper: +- Each component re-implements the conversion, drifting on rotation conventions, sign conventions, or near-identity edge cases. +- Subtle differences in `det(R)` validation (some silently re-orthogonalise, others reject) break the "same pose in, same pose out" invariant across components. +- Any future change (e.g., switching from GTSAM `Pose3` to `manifpy`) becomes a 7-place coordinated edit. + +## Outcome + +- A single `helpers.se3_utils` module is the only place that constructs a `Pose3` from a matrix or vice-versa across the codebase. Component imports go through the helper. +- All conversions are pure functions: same input → byte-equal numpy / GTSAM output. +- Strict orthogonal-rotation contract: `matrix_to_se3` rejects non-orthogonal or negative-determinant rotations with `Se3InvalidMatrixError` instead of silently fixing them. Callers are responsible for orthogonalisation; the rejection forces the bug back to the source. +- Near-identity Lie-algebra inputs (twist norm < 1e-10) are stable — `exp_map` falls back to the small-angle Taylor expansion documented in GTSAM rather than NaN-ing on `sin(θ)/θ`. + +## Scope + +### Included + +- `matrix_to_se3(T_4x4) -> SE3`, `se3_to_matrix(SE3) -> np.ndarray`. +- `exp_map(xi) -> SE3`, `log_map(SE3) -> np.ndarray`, `adjoint(SE3) -> np.ndarray`. +- `is_valid_rotation(R_3x3, *, atol)` predicate for callers to check before calling `matrix_to_se3`. +- `Se3InvalidMatrixError` exception type. +- Re-export of GTSAM `Pose3` as `SE3` so consumers do not import GTSAM directly. +- Public interface contract published at `_docs/02_document/contracts/shared_helpers/se3_utils.md`. + +### Excluded + +- Quaternion conversions — consumers convert via numpy / GTSAM directly. +- SE(2) helpers — out of scope. +- Pose interpolation / Slerp — out of scope. +- Higher-order manifold ops (parallel transport, composition Jacobians) — out of scope. + +## Acceptance Criteria + +**AC-1: 4×4 ↔ SE3 round-trip** +Given a randomly-sampled valid `T_4x4` (orthogonal rotation, positive determinant, identity bottom row) +When `matrix_to_se3` then `se3_to_matrix` runs +Then the recovered matrix matches the input via `np.allclose(..., atol=1e-9)` + +**AC-2: Lie-algebra round-trip** +Given a random twist `xi` of shape `(6,)` and norm ≈ 1.0 +When `exp_map(xi)` then `log_map(...)` runs +Then the recovered twist matches `xi` via `np.allclose(..., atol=1e-9)` + +**AC-3: Near-identity Lie stability** +Given `xi = [1e-12, 1e-12, 1e-12, 1e-12, 1e-12, 1e-12]` +When `exp_map(xi)` runs +Then the result is the identity pose within `atol=1e-9`; no exception, no NaN + +**AC-4: Strict orthogonality rejection** +Given `T_4x4` whose `R` has `||R^T R - I||_F = 1e-3` +When `matrix_to_se3(T)` runs +Then `Se3InvalidMatrixError` is raised AND the helper does NOT silently re-orthogonalise (the message names the deviation magnitude) + +**AC-5: Mirror rejection** +Given `T_4x4` with `det(R) ≈ -1` +When `matrix_to_se3(T)` runs +Then `Se3InvalidMatrixError` is raised mentioning the negative determinant + +**AC-6: Block-layout guard** +Given `T_4x4` with bottom row `[0, 0, 0, 2]` (or any deviation from `[0, 0, 0, 1]`) +When `matrix_to_se3(T)` runs +Then `Se3InvalidMatrixError` is raised mentioning the bottom row + +**AC-7: dtype contract** +Given `T_4x4` with `dtype=float32` +When `matrix_to_se3(T)` runs +Then `Se3InvalidMatrixError` is raised mentioning dtype (helpers operate strictly on `float64`) + +**AC-8: Determinism** +Given the same `T_4x4` (or `xi`) +When converted twice through any helper function +Then both outputs are byte-equal + +**AC-9: No upward imports (Layer 1 invariant)** +Given the helper module +When a static-import check runs +Then it imports ONLY from `_types`, GTSAM, numpy, and stdlib — no `gps_denied_onboard.components.*` imports anywhere + +## Non-Functional Requirements + +**Performance** +- Each helper function p99 ≤ 50 µs on Tier-2 — overhead vs. inline GTSAM ≤ 5 % (per E-CC-HELPERS hot-path NFR). + +**Reliability** +- Pure deterministic; same input → byte-equal output. +- `Se3InvalidMatrixError` is the ONLY exception type the public surface raises on shape / orthogonality / dtype violations. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `np.allclose(se3_to_matrix(matrix_to_se3(T)), T)` for 100 random valid `T` | all pass within `atol=1e-9` | +| AC-2 | `np.allclose(log_map(exp_map(xi)), xi)` for 100 random `xi` (norm ≈ 1.0) | all pass within `atol=1e-9` | +| AC-3 | `exp_map([1e-12]*6)` | identity pose within `atol=1e-9`; no NaN | +| AC-4 | non-orthogonal `T` | `Se3InvalidMatrixError`; message names deviation | +| AC-5 | `det(R) = -1` `T` | `Se3InvalidMatrixError`; mentions determinant | +| AC-6 | bottom row `[0, 0, 0, 2]` | `Se3InvalidMatrixError`; mentions bottom row | +| AC-7 | `float32` dtype | `Se3InvalidMatrixError`; mentions dtype | +| AC-8 | call any helper twice with same input | byte-equal outputs | +| AC-9 | static import scan | only `_types`, GTSAM, numpy, stdlib | +| NFR-perf | microbench each helper (10k iterations on Tier-2 fixture) | p99 ≤ 50 µs each | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_helpers/se3_utils.md` v1.0.0. +- Layer 1 Foundation only. +- GTSAM is the single math backend; numpy fallback only when GTSAM does not expose the primitive. +- No new dependency beyond what AZ-263 / E-BOOT pinned. + +## Risks & Mitigation + +**Risk 1: Silent re-orthogonalisation hides upstream rotation drift** +- *Risk*: A future change "softens" `matrix_to_se3` to silently re-orthogonalise inputs; consumers no longer learn that their rotation source is producing non-orthogonal matrices. +- *Mitigation*: AC-4 makes strict rejection part of the contract. The contract test enforces that `Se3InvalidMatrixError` is raised, not absorbed. + +**Risk 2: GTSAM API drift between minor versions** +- *Risk*: `Pose3.expmap` signature changes; this helper breaks on a GTSAM upgrade. +- *Mitigation*: GTSAM is pinned in `pyproject.toml` at AZ-263 / E-BOOT; this helper's tests are the canary that detects drift before consumers do. + +## Runtime Completeness + +- **Named capability**: SE(3) ↔ matrix conversion + Lie-algebra exp/log/adjoint via GTSAM `Pose3` primitives (architecture / E-CC-HELPERS / `02_helper_se3_utils.md`). +- **Production code that must exist**: real GTSAM-backed conversions; real strict-orthogonality guard; real small-angle Taylor fallback for near-identity exp. +- **Allowed external stubs**: numpy fallback only where GTSAM does not expose the primitive (e.g., adjoint matrix construction). +- **Unacceptable substitutes**: silent re-orthogonalisation; "for now we just call `np.linalg.logm`" (numerically inferior, no Jacobian); skipping near-identity small-angle handling (NaN risk). + +## Contract + +This task produces the contract at `_docs/02_document/contracts/shared_helpers/se3_utils.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-278_lightglue_runtime.md b/_docs/02_tasks/todo/AZ-278_lightglue_runtime.md new file mode 100644 index 0000000..bdf17b1 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-278_lightglue_runtime.md @@ -0,0 +1,146 @@ +# LightGlueRuntime Helper Module (R14 fix) + +**Task**: AZ-278_lightglue_runtime +**Name**: LightGlueRuntime Helper +**Description**: Implement the shared `LightGlueRuntime` helper that owns the LightGlue inference engine handle for both C2.5 (single-pair inlier counting) and C3 (heavier matching pass). This is the structural fix for R14 (the original C2.5 ↔ C3 import cycle): the runtime sits at Layer 1 with no `components.*` imports, so the cycle becomes impossible to express. Single CUDA stream; concurrent access forbidden by contract; composition root binds to the single F3 hot-path thread. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure +**Component**: shared.helpers.lightglue_runtime (cross-cutting; epic AZ-264 / E-CC-HELPERS) +**Tracker**: AZ-278 +**Epic**: AZ-264 (E-CC-HELPERS) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_helpers/lightglue_runtime.md` — frozen public interface this task produces. +- `_docs/02_document/common-helpers/03_helper_lightglue_runtime.md` — design rationale and R14 context. + +## Problem + +C2.5 (Re-rank) and C3 (CrossDomainMatcher) both call LightGlue. In cycle 1 of `_docs/02_document/epics.md`, LightGlue ownership was ambiguous and produced R14: a circular import / runtime dependency between C2.5 and C3 (the "K=10 → N=3 funnel" both wanted to own the engine). Without a shared runtime: +- The engine is built / loaded twice, doubling GPU memory at takeoff (Tier-2 has only 8 GB). +- C2.5 and C3 drift on engine version pinning, producing inconsistent matches. +- Their import cycle is a recurring footgun: any future refactor will tempt one to import from the other. + +## Outcome + +- A single `LightGlueRuntime` instance is constructed once at takeoff by the composition root from C7's `deserialize_engine(LIGHTGLUE_ENGINE_CACHE_ENTRY)` and is constructor-injected into BOTH C2.5 and C3. +- The C2.5 ↔ C3 import cycle is structurally impossible: the runtime lives at Layer 1 (`helpers/`) and imports zero `components.*` modules. Both consumers depend on the helper; neither depends on the other. +- Concurrent access is rejected at runtime by an explicit guard (`LightGlueConcurrentAccessError`), preserving the single-CUDA-stream invariant. The composition root binds the runtime to the single F3 hot-path thread; AC-4 of the contract is the canary that catches future composition-root mistakes. +- The helper exposes no `set_*` / `update_*` methods — once constructed, the runtime's behaviour is fixed. + +## Scope + +### Included + +- `LightGlueRuntime(engine_handle: EngineHandle)` constructor. +- `match(features_a: KeypointSet, features_b: KeypointSet) -> CorrespondenceSet` — single-pair path used by C2.5. +- `match_batch(features_a_list, features_b_list) -> list[CorrespondenceSet]` — batch path used by C3. +- `descriptor_dim() -> int` accessor for shape validation upstream of `match`. +- Concurrent-access guard that raises `LightGlueConcurrentAccessError` on overlapping `match` / `match_batch` entries. +- `LightGlueRuntimeError` (construction / dim mismatch) and `LightGlueConcurrentAccessError` (concurrent entry) exception types. +- Public interface contract published at `_docs/02_document/contracts/shared_helpers/lightglue_runtime.md`. + +### Excluded + +- Engine compilation / serialisation — C7. +- Engine filename schema — `helpers.engine_filename_schema` (separate task in this epic). +- Engine cache management / takeoff load — C10. +- Backbone-specific feature extraction (DISK / ALIKED / XFeat) — C3 / C7. +- Multi-GPU / multi-stream / mixed-backbone — out of scope for v1.0.0. +- The `EngineHandle` Protocol itself — owned by `_types/manifests.py` (AZ-263) so Layer 1 can reference it without depending on C7. + +## Acceptance Criteria + +**AC-1: Single-pair match (C2.5 path)** +Given a pair of `KeypointSet`s with matching descriptor dim and a synthetic-overlap fixture +When `match(features_a, features_b)` runs +Then a `CorrespondenceSet` is returned with `len > 0` and the inlier-count helper used by C2.5 finds the expected count + +**AC-2: Batch match (C3 path)** +Given three pairs of `KeypointSet`s +When `match_batch([a1, a2, a3], [b1, b2, b3])` runs +Then three `CorrespondenceSet`s are returned in input order; per-pair invariants match the single-pair path + +**AC-3: Descriptor-dim mismatch rejected** +Given features whose `descriptor_dim` does not match the engine's expected dim +When `match` runs +Then `LightGlueRuntimeError` is raised with a message naming both the expected and actual dims + +**AC-4: Concurrent access rejected** +Given two threads call `match` simultaneously on the same `LightGlueRuntime` instance +When the second call enters +Then `LightGlueConcurrentAccessError` is raised in the second thread; the first thread completes normally + +**AC-5: Construction-time guard** +Given `LightGlueRuntime(engine_handle=None)` +When construction runs +Then `LightGlueRuntimeError` is raised mentioning `engine_handle` + +**AC-6: No upward imports — R14 structural fix** +Given the helper module +When a static-import check runs across `gps_denied_onboard.helpers.lightglue_runtime` +Then it imports ONLY from `_types`, numpy, and stdlib — NO imports from `gps_denied_onboard.components.*` (verified by importlinter or grep gate in CI) + +**AC-7: Determinism downstream of the engine** +Given the same `(features_a, features_b)` pair matched twice with the same `engine_handle` +When `match` runs both times +Then both `CorrespondenceSet` outputs are byte-equal (engine determinism is a C7 concern; this AC asserts the helper itself adds no non-determinism) + +## Non-Functional Requirements + +**Performance** +- `match` p99 ≤ 30 ms on Tier-2 with the production DISK+LightGlue engine on a typical K=10 candidate pair (matches the per-frame budget for C2.5's K=10 → N=3 funnel). +- Helper-level overhead (excluding the engine call itself) ≤ 100 µs — verified via a benchmark that swaps in a stub engine handle. + +**Reliability** +- `LightGlueRuntimeError` and `LightGlueConcurrentAccessError` are the ONLY exception types the public surface raises. Engine-internal exceptions MUST be wrapped. +- Pure-deterministic given a deterministic engine; the helper itself adds no random state. + +**Concurrency** +- Single-thread by contract. The concurrent-access guard is the runtime invariant detector — any composition-root regression that wires the runtime into multiple threads is caught immediately rather than producing GPU memory corruption. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | single-pair match on synthetic-overlap fixture | non-empty `CorrespondenceSet` | +| AC-2 | batch of 3 pairs | three results in input order; per-pair invariants match AC-1 | +| AC-3 | dim-mismatched features | `LightGlueRuntimeError`; message names expected & actual dims | +| AC-4 | two threads call `match` simultaneously | one succeeds; the second raises `LightGlueConcurrentAccessError` | +| AC-5 | construct with `engine_handle=None` | `LightGlueRuntimeError` | +| AC-6 | importlinter / grep gate over `helpers/lightglue_runtime.py` | no `components.*` imports | +| AC-7 | same pair matched twice | byte-equal outputs (with deterministic stub engine) | +| NFR-perf | microbench `match` overhead with stub engine (10k iterations on Tier-2 fixture) | helper overhead ≤ 100 µs | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_helpers/lightglue_runtime.md` v1.0.0. +- Layer 1 Foundation only. NO upward imports — this is the load-bearing constraint for the R14 fix. +- The `EngineHandle` Protocol must be defined in `_types/manifests.py` (AZ-263 / E-BOOT) so this helper can reference it without importing C7. If `_types/manifests.py` does not yet define the Protocol surface (`forward(...)`, `descriptor_dim`), this task adds it — that is the only `_types` edit allowed by this task. +- No new dependency beyond what AZ-263 / E-BOOT pinned. + +## Risks & Mitigation + +**Risk 1: Composition root accidentally creates two runtimes (one for C2.5, one for C3)** +- *Risk*: Future composition-root refactor instantiates `LightGlueRuntime` twice; engine memory doubles, behaviour drifts. +- *Mitigation*: The composition-root contract test (E-CC-CONF / AZ-246, AZ-269/AZ-270 in scope) already verifies cardinality of cross-cutting helpers. This task's contract documents that EXACTLY ONE instance is expected; the composition-root validator is the enforcement point. + +**Risk 2: Concurrent-access guard introduces hot-path overhead** +- *Risk*: A naive `threading.Lock` on every `match` call adds 100s of µs. +- *Mitigation*: The guard uses a non-blocking `threading.local()`-style check or a `Lock(blocking=False).acquire()` pattern that simply RAISES on contention rather than serialising callers — the contract is "concurrent calls are a bug", not "serialise concurrent callers". NFR-perf microbench validates the overhead budget. + +**Risk 3: A future backbone needs a different match shape** +- *Risk*: A new feature backbone produces 5-tuple correspondences instead of the current 4-tuple (e.g., adds confidence per match). +- *Mitigation*: The contract version bump path is documented (`Versioning Rules` section). Adding a field is non-breaking IF consumers tolerate the extra field; otherwise it is a major-version contract change with a deprecation pass. + +## Runtime Completeness + +- **Named capability**: shared LightGlue inference runtime with single-CUDA-stream guarantee + R14 structural cycle fix (architecture / E-CC-HELPERS / `03_helper_lightglue_runtime.md`). +- **Production code that must exist**: real `EngineHandle`-backed match dispatch; real concurrent-access guard; real descriptor-dim validation. +- **Allowed external stubs**: a deterministic stub `EngineHandle` is allowed in tests (and recommended for AC-7 determinism) but production paths use C7's real engine. +- **Unacceptable substitutes**: bypassing the concurrent-access guard with `threading.Lock` (silently serialising callers); allowing each consumer to construct its own runtime; reintroducing a C2.5 → C3 (or C3 → C2.5) import to "share state". Any of those reintroduces R14. + +## Contract + +This task produces the contract at `_docs/02_document/contracts/shared_helpers/lightglue_runtime.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-279_wgs_converter.md b/_docs/02_tasks/todo/AZ-279_wgs_converter.md new file mode 100644 index 0000000..e25073d --- /dev/null +++ b/_docs/02_tasks/todo/AZ-279_wgs_converter.md @@ -0,0 +1,153 @@ +# WgsConverter Helper Module + +**Task**: AZ-279_wgs_converter +**Name**: WgsConverter Helper +**Description**: Implement the shared `WgsConverter` helper for WGS84 ↔ local-tangent-plane (ENU) ↔ tile-pixel coordinate conversions, backed by `pyproj`. Used by C4, C5, C6, C8, C10, C11, and C12 — every component that crosses the geographic-vs-local-frame boundary. Stateless static-only design (per `coderule.mdc`); slippy-map tile convention matches `satellite-provider`'s on-disk layout. +**Complexity**: 2 points +**Dependencies**: AZ-263_initial_structure +**Component**: shared.helpers.wgs_converter (cross-cutting; epic AZ-264 / E-CC-HELPERS) +**Tracker**: AZ-279 +**Epic**: AZ-264 (E-CC-HELPERS) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_helpers/wgs_converter.md` — frozen public interface this task produces. +- `_docs/02_document/common-helpers/04_helper_wgs_converter.md` — design rationale and consumer mapping. + +## Problem + +Seven components (C4, C5, C6, C8, C10, C11, C12) need to cross the geographic-vs-local-frame boundary: +- C4 compares pose-in-WGS to pose-in-ENU; C5 initialises iSAM2 from a WGS origin. +- C6's tile bbox queries map between lat/lon and tile-pixel `(zoom, x, y)`. +- C8 encodes pose for FC emission; C10 / C11 resolve operator-entered bboxes to tile lists; C12 takes the operator's bbox input. + +Without a shared helper: +- Each component re-derives the WGS84 → ECEF → ENU pipeline; sign conventions (ENU vs NED) drift; altitude treatment (ellipsoidal vs orthometric) diverges. +- Tile-xy conversions go through OSM-style math in some places and Mercator-projection in others, breaking on-disk compatibility with `satellite-provider`'s `{zoom}/{x}/{y}.jpg` layout. +- A future datum or geoid change becomes a 7-place coordinated edit instead of a single helper update. + +## Outcome + +- A single `helpers.wgs_converter` module is the only place that performs WGS84 / ECEF / ENU / tile-xy conversions across the codebase. Component imports go through the helper. +- All conversions are pure static functions: same input → byte-equal output (deep-equal numpy / `LatLonAlt`). +- ENU sign convention is locked to `(east, north, up)` and documented; consumers cannot drift to NED accidentally. +- Slippy-map tile convention matches `satellite-provider`'s on-disk layout — the contract test pins the `(zoom=18, lat=50.45, lon=30.52) → (x, y)` round-trip against a known-good fixture. +- Out-of-range inputs (zoom > 22, lat outside Web-Mercator-valid range, ECEF shape mismatch, tile-xy out of `[0, 2^zoom)`) raise `WgsConversionError` rather than silently producing garbage. + +## Scope + +### Included + +- Static methods on `WgsConverter`: `latlonalt_to_ecef`, `ecef_to_latlonalt`, `latlonalt_to_local_enu`, `local_enu_to_latlonalt`, `latlon_to_tile_xy`, `tile_xy_to_latlon_bounds`. +- `WgsConversionError` exception type. +- Public interface contract published at `_docs/02_document/contracts/shared_helpers/wgs_converter.md`. + +### Excluded + +- Datum-shift logic / non-WGS84 datums — out of scope for v1.0.0. +- UTM / MGRS conversions — out of scope. +- Geoid-height corrections (orthometric vs. ellipsoidal altitude) — out of scope; the contract documents that altitude is ellipsoidal. +- Vincenty / great-circle distance helpers — out of scope. +- Body-frame ↔ ECEF rotation transforms — `helpers.se3_utils` + per-deployment `CameraCalibration`. +- The `LatLonAlt` / `BoundingBox` DTOs themselves — owned by `_types/` (AZ-263). + +## Acceptance Criteria + +**AC-1: ECEF round-trip** +Given `p = LatLonAlt(50.0, 30.0, 100.0)` +When `ecef_to_latlonalt(latlonalt_to_ecef(p))` runs +Then the returned `LatLonAlt` matches `p` within `atol=1e-9` deg lat/lon and `1e-6` m altitude + +**AC-2: ENU round-trip within 10 km** +Given an `origin` and a `p` ~10 km away +When `local_enu_to_latlonalt(origin, latlonalt_to_local_enu(origin, p))` runs +Then the returned `LatLonAlt` matches `p` within 1 m horizontal + 1 cm vertical + +**AC-3: Slippy-map tile round-trip at z18** +Given `(zoom=18, lat=50.45, lon=30.52)` +When `tile_xy_to_latlon_bounds(zoom, *latlon_to_tile_xy(zoom, lat, lon))` runs +Then the returned bounding box contains the input lat/lon AND the `(x, y)` matches the OSM-pinned fixture for the same coordinates + +**AC-4: Web-Mercator latitude range guard** +Given `lat = 95.0` passed to `latlon_to_tile_xy` +When the call runs +Then `WgsConversionError` is raised mentioning the Web-Mercator-valid range `[-85.0511, 85.0511]` + +**AC-5: Zoom range guard** +Given `zoom = 25` +When `latlon_to_tile_xy` or `tile_xy_to_latlon_bounds` runs +Then `WgsConversionError` is raised mentioning the supported zoom range `[0, 22]` + +**AC-6: Tile-xy range guard** +Given `(zoom=18, x=2^18, y=0)` +When `tile_xy_to_latlon_bounds` runs +Then `WgsConversionError` is raised mentioning the valid `(x, y)` range `[0, 2^zoom)` + +**AC-7: ECEF shape contract** +Given an array of shape `(2,)` passed to `ecef_to_latlonalt` +When the call runs +Then `WgsConversionError` is raised mentioning the expected shape `(3,)` + +**AC-8: Determinism** +Given the same input +When any helper function is called twice +Then both outputs are byte-equal + +**AC-9: No upward imports (Layer 1 invariant)** +Given the helper module +When a static-import check runs +Then it imports ONLY from `_types`, `pyproj`, numpy, and stdlib — no `gps_denied_onboard.components.*` imports anywhere + +## Non-Functional Requirements + +**Performance** +- No specific latency budget per `_docs/02_document/common-helpers/04_helper_wgs_converter.md` (consumers are pre-flight / post-landing). Each function p99 ≤ 200 µs on Tier-2 as a sanity bound. + +**Reliability** +- Pure deterministic; same input → byte-equal output. +- `WgsConversionError` is the ONLY exception type the public surface raises on shape / range violations. `pyproj`'s lower-level exceptions MUST be wrapped. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | ECEF round-trip on 100 random valid `LatLonAlt`s | all match within `atol=1e-9` deg + `1e-6` m | +| AC-2 | ENU round-trip on 100 origin/point pairs within 10 km | all match within 1 m + 1 cm | +| AC-3 | Slippy-map round-trip at z18 with OSM-pinned fixture | `(x, y)` matches fixture; bounds contain input | +| AC-4 | `latlon_to_tile_xy(18, 95.0, 0.0)` | `WgsConversionError`; mentions Web-Mercator range | +| AC-5 | `latlon_to_tile_xy(25, 0, 0)` | `WgsConversionError`; mentions zoom range | +| AC-6 | `tile_xy_to_latlon_bounds(18, 2**18, 0)` | `WgsConversionError`; mentions tile-xy range | +| AC-7 | `ecef_to_latlonalt(np.zeros(2))` | `WgsConversionError`; mentions shape `(3,)` | +| AC-8 | each helper called twice with same input | byte-equal outputs | +| AC-9 | importlinter / grep gate | no `components.*` imports | +| NFR-perf | microbench each helper (10k iterations on Tier-2 fixture) | p99 ≤ 200 µs each | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_helpers/wgs_converter.md` v1.0.0. +- Layer 1 Foundation only. +- `pyproj` is the single geodesy backend; pinned in `pyproject.toml` at AZ-263 / E-BOOT. +- Static-only design satisfies `coderule.mdc` ("only use static methods for pure self-contained computations") — every operation is a pure mathematical function of its arguments. +- No new dependency beyond what AZ-263 / E-BOOT pinned. + +## Risks & Mitigation + +**Risk 1: Tangent-plane approximation degrades silently beyond 100 km** +- *Risk*: A consumer (e.g., C12 operator tooling with a continent-scale bbox) calls `latlonalt_to_local_enu` on a point 500 km from origin; the helper returns a result with O(1 km) error; consumer uses it as ground truth. +- *Mitigation*: The contract `Invariants` section documents the 100 km validity range. Consumers that need wider range explicitly chain ECEF↔ENU through a closer origin. + +**Risk 2: Datum drift if `pyproj` upgrades silently change WGS84 parameters** +- *Risk*: A future `pyproj` minor version changes the WGS84 ellipsoid parameters; all conversions shift by sub-metre amounts, breaking the round-trip ACs. +- *Mitigation*: `pyproj` is pinned at AZ-263; round-trip ACs are the canary that detects drift on dependency upgrade. + +## Runtime Completeness + +- **Named capability**: WGS84 ↔ ECEF ↔ ENU ↔ tile-xy conversions via `pyproj` (architecture / E-CC-HELPERS / `04_helper_wgs_converter.md`). +- **Production code that must exist**: real `pyproj`-backed conversions; real slippy-map tile math matching `satellite-provider`'s on-disk layout. +- **Allowed external stubs**: none — `pyproj` is the production runtime. +- **Unacceptable substitutes**: hand-rolled flat-earth ENU approximation (silently breaks AC-2 beyond a few km); custom Mercator tile math that drifts from OSM convention (breaks `satellite-provider` compatibility); skipping out-of-range guards (silent garbage for high latitudes). + +## Contract + +This task produces the contract at `_docs/02_document/contracts/shared_helpers/wgs_converter.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-280_sha256_sidecar.md b/_docs/02_tasks/todo/AZ-280_sha256_sidecar.md new file mode 100644 index 0000000..8ec2a26 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-280_sha256_sidecar.md @@ -0,0 +1,154 @@ +# Sha256Sidecar Helper Module + +**Task**: AZ-280_sha256_sidecar +**Name**: Sha256Sidecar Helper +**Description**: Implement the shared `Sha256Sidecar` helper that owns the atomic-write + SHA-256 content-hash sidecar pattern (D-C10-3). Every persistent artifact that takeoff-load (F2) must verify gets written atomically AND has a `.sha256` sidecar that the verifier can independently recompute. Used by C6 (FAISS index, descriptor sidecar), C7 (engine cache + INT8 calibration cache), C10 (Manifest), and C11 (tile artifact verification). Stateless static-only design. +**Complexity**: 2 points +**Dependencies**: AZ-263_initial_structure +**Component**: shared.helpers.sha256_sidecar (cross-cutting; epic AZ-264 / E-CC-HELPERS) +**Tracker**: AZ-280 +**Epic**: AZ-264 (E-CC-HELPERS) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — frozen public interface this task produces. +- `_docs/02_document/common-helpers/05_helper_sha256_sidecar.md` — design rationale and consumer mapping (D-C10-3). + +## Problem + +The takeoff-load gate (F2) verifies four classes of persistent artifact: FAISS index + descriptor sidecar (C6), TensorRT engine cache + INT8 calibration cache (C7), Manifest (C10), and tile artifacts (C11). Each artifact must be written atomically (no partial files) AND must have a hash sidecar the verifier can independently recompute. + +Without a shared helper: +- C6 / C7 / C10 / C11 each grow their own atomic-write + hash implementation; subtle differences in temp-file naming, rename ordering, or sidecar format break the cross-component verifier the moment one drifts. +- The Manifest aggregate hash (which covers many files) goes through path-ordering logic that is implemented in only one place; if that ordering ever differs across a writer and a verifier, the entire cache root looks corrupt. +- An attacker (or accidental `rsync`) replaces `engine.engine` after `engine.engine.sha256` was written; without independent verification, takeoff-load accepts the swapped file. + +## Outcome + +- A single `helpers.sha256_sidecar` module is the only path through which any onboard process writes hash-verified artifacts. +- Atomic write is a hard contract: the temp-file → rename pattern guarantees no partial file ever appears at the target path. A fault between the bytes-flushed point and the rename leaves either the previous version or no file at all — never a half-written one. +- `verify(path)` recomputes the digest from the file's bytes; it does NOT trust the sidecar's value alone. A swapped artifact with a stale sidecar is detected. +- `aggregate_hash` is order-deterministic (sorts paths first), so the Manifest aggregate is reproducible across writer and verifier. +- The sidecar format is intentionally trivial (lowercase hex digest, no JSON wrapper, no trailing newline) so any small script can verify a single artifact without pulling in the helper. + +## Scope + +### Included + +- `Sha256Sidecar` static methods: `write_atomic`, `write_atomic_and_sidecar`, `verify`, `aggregate_hash`. +- `Sha256SidecarError` exception type wrapping underlying `OSError` and capturing missing/malformed sidecar conditions. +- Public interface contract published at `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md`. + +### Excluded + +- Cryptographic signing — this helper is corruption + accidental-replacement defense only; signing is out of scope (mid-flight tile gen has its own per-flight signing key path elsewhere). +- Streaming hashing for payloads larger than RAM — out of scope; the helper's API is `payload: bytes`. +- Compression / on-disk encoding — payloads are written verbatim. +- Sidecar versioning — there is no version byte. +- Filesystem-type detection (warning when run on NFS / overlayfs) — documented in contract Caveats; not enforced at runtime. + +## Acceptance Criteria + +**AC-1: Round-trip write + verify** +Given a 1 MiB random payload +When `write_atomic_and_sidecar(path, payload)` runs followed by `verify(path)` +Then `verify` returns True AND the sidecar at `path.sha256` contains a 64-char lowercase hex digest matching `hashlib.sha256(payload).hexdigest()` + +**AC-2: Atomicity — no partial file on fault** +Given a fault is injected between the temp-file flush and the rename (e.g., monkey-patch `os.replace` to raise `OSError`) +When `write_atomic(path, payload)` runs and raises +Then `path` does NOT exist (or, if it pre-existed, its bytes are unchanged); no `*.tmp` or partial file remains at the target name + +**AC-3: Independent verification rejects swapped payloads** +Given an artifact is written via `write_atomic_and_sidecar`, then the file at `path` is overwritten out-of-band with different bytes +When `verify(path)` runs +Then it returns False (NOT True; it must NOT trust the sidecar value alone) + +**AC-4: Missing sidecar is an error, not False** +Given an artifact exists at `path` but `path.sha256` was deleted +When `verify(path)` runs +Then `Sha256SidecarError` is raised with a message naming the missing sidecar (the helper does NOT silently return False — that would conflate "corrupt artifact" with "missing sidecar") + +**AC-5: Malformed sidecar is rejected** +Given a sidecar containing `not a hex digest` or a digest of wrong length +When `verify(path)` runs +Then `Sha256SidecarError` is raised mentioning malformed sidecar content + +**AC-6: Aggregate hash is order-deterministic** +Given three files `a`, `b`, `c` and their hashes +When `aggregate_hash([a, b, c])` and `aggregate_hash([c, a, b])` run +Then both calls return the same hex digest (the implementation sorts paths internally) + +**AC-7: Aggregate hash rejects missing files** +Given a list including a non-existent path +When `aggregate_hash` runs +Then `Sha256SidecarError` is raised mentioning the missing path + +**AC-8: Sidecar format strictness** +Given the sidecar written by `write_atomic_and_sidecar` +When the file's bytes are read +Then the bytes are EXACTLY the 64-char lowercase hex digest — no JSON wrapper, no trailing newline, no whitespace + +**AC-9: No upward imports (Layer 1 invariant)** +Given the helper module +When a static-import check runs +Then it imports ONLY from `_types`, `atomicwrites`, `hashlib`, `pathlib`, and stdlib — no `gps_denied_onboard.components.*` imports anywhere + +## Non-Functional Requirements + +**Performance** +- No specific latency budget per `_docs/02_document/common-helpers/05_helper_sha256_sidecar.md` (consumers are pre-flight / post-landing). Sanity bound: `write_atomic_and_sidecar` of a 1 MiB payload ≤ 50 ms on Tier-2. + +**Reliability** +- `Sha256SidecarError` is the ONLY exception type the public surface raises on filesystem / sidecar errors. `OSError` MUST be wrapped so callers do not have to handle two error hierarchies. +- Pure deterministic: same payload always produces the same digest. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Round-trip write + verify on 1 MiB random payload | sidecar matches `hashlib.sha256(payload).hexdigest()`; `verify` True | +| AC-2 | Inject `OSError` between flush and rename | no partial file remains at target name | +| AC-3 | Overwrite payload after sidecar is written | `verify` returns False | +| AC-4 | Delete sidecar; call `verify` | `Sha256SidecarError`; mentions missing sidecar | +| AC-5 | Malformed sidecar content | `Sha256SidecarError`; mentions malformed sidecar | +| AC-6 | `aggregate_hash` with two different orderings | byte-equal digests | +| AC-7 | `aggregate_hash` with a missing path | `Sha256SidecarError`; mentions missing path | +| AC-8 | Read sidecar bytes after `write_atomic_and_sidecar` | exactly 64 hex chars; no newline / whitespace / JSON | +| AC-9 | importlinter / grep gate | no `components.*` imports | +| NFR-perf | Microbench `write_atomic_and_sidecar` of 1 MiB payload | ≤ 50 ms on Tier-2 | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` v1.0.0. +- Layer 1 Foundation only. +- `atomicwrites` is the single atomic-rename backend; pinned in `pyproject.toml` at AZ-263 / E-BOOT. +- Static-only design satisfies `coderule.mdc`. +- No new dependency beyond what AZ-263 / E-BOOT pinned. +- Production cache root MUST live on a local POSIX filesystem (NFS / SMB / overlayfs are unsupported per the helper's atomic-rename invariant). Documented in deployment artifacts; not enforced at runtime. + +## Risks & Mitigation + +**Risk 1: A future helper change relaxes atomicity to "best-effort"** +- *Risk*: Someone replaces the temp-file → rename pattern with a direct write under the rationale "rename is slow on certain filesystems"; takeoff-load occasionally sees partial files. +- *Mitigation*: AC-2 makes atomicity a hard test. Any regression that loses the rename is caught immediately. + +**Risk 2: `aggregate_hash` ordering drifts between writer and verifier** +- *Risk*: A future change adds case-insensitive sorting or strips path prefixes; writer and verifier disagree; cache root looks corrupt. +- *Mitigation*: AC-6 pins the deterministic-ordering invariant; the contract spells out the exact format (`\0\n` lines, lexicographically sorted by full path). + +**Risk 3: Sidecar format ambiguity (someone wraps the digest in JSON)** +- *Risk*: A future contributor "improves" the sidecar to be JSON for "extensibility"; verification scripts that expect raw hex break. +- *Mitigation*: AC-8 pins the exact byte-level format. Versioning rules force a major bump for any format change. + +## Runtime Completeness + +- **Named capability**: atomic-write + SHA-256 content-hash sidecar (D-C10-3 / `05_helper_sha256_sidecar.md`). +- **Production code that must exist**: real `atomicwrites`-backed atomic rename; real `hashlib.sha256` digesting; real independent verify. +- **Allowed external stubs**: none — `atomicwrites` and `hashlib` are stdlib + production deps. +- **Unacceptable substitutes**: direct write (loses atomicity); trusting the sidecar value without recomputing the file's hash; JSON-wrapped sidecar; case-insensitive aggregate ordering. + +## Contract + +This task produces the contract at `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-281_engine_filename_schema.md b/_docs/02_tasks/todo/AZ-281_engine_filename_schema.md new file mode 100644 index 0000000..6a1d4da --- /dev/null +++ b/_docs/02_tasks/todo/AZ-281_engine_filename_schema.md @@ -0,0 +1,158 @@ +# EngineFilenameSchema Helper Module + +**Task**: AZ-281_engine_filename_schema +**Name**: EngineFilenameSchema Helper +**Description**: Implement the shared `EngineFilenameSchema` helper for the self-describing `.engine` filename schema (D-C10-7). TensorRT engines are NOT portable across `(SM, JetPack, TRT, precision)` tuples; encoding the tuple in the filename makes mismatch instantly visible at takeoff load (F2). Used by C7 (writes engines on compile, reads on `deserialize_engine`) and C10 (compiles engines via C7 and writes them to the cache root). Stateless static-only design. +**Complexity**: 2 points +**Dependencies**: AZ-263_initial_structure +**Component**: shared.helpers.engine_filename_schema (cross-cutting; epic AZ-264 / E-CC-HELPERS) +**Tracker**: AZ-281 +**Epic**: AZ-264 (E-CC-HELPERS) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — frozen public interface this task produces. +- `_docs/02_document/common-helpers/06_helper_engine_filename_schema.md` — design rationale (D-C10-7). + +## Problem + +TensorRT engines are not portable. An engine compiled for SM 87 / JetPack 6.2 / TRT 10.3 / FP16 will fail to deserialize — or, worse, deserialize and silently produce wrong output — on a host with a different `(sm, jp, trt, precision)` tuple. Without a self-describing filename: +- C7's `deserialize_engine` cannot tell whether an engine in the cache root matches the host capabilities until it tries to load it (an expensive, non-cheap, partially-side-effecting operation). +- C10 has to maintain an out-of-band sidecar mapping filenames to tuples; that sidecar drifts. +- An operator who copies an engine from a different deployment by mistake gets opaque "deserialize failed" errors at takeoff instead of a clear "engine was built for sm87, host is sm72". + +## Outcome + +- A single `helpers.engine_filename_schema` module is the only path through which any onboard process composes or parses `.engine` filenames. +- The schema makes `(model_name, sm, jetpack, trt, precision)` part of the filename: `{model}__sm{SM}_jp{JP}_trt{TRT}_{precision}.engine`. F2 takeoff load uses `matches_host` to decide which engines to deserialize and which to refuse before paying the deserialise cost. +- The schema is strict — invalid model names, non-dotted version strings, unknown precisions are rejected at `build` time; malformed filenames are rejected at `parse` time. Both raise `EngineFilenameSchemaError` with messages that name the offending field. +- Round-trip identity: `parse(build(*args)) == EngineCacheKey(*args)` for any valid args. Round-trip is the contract test that catches any future format drift. + +## Scope + +### Included + +- `EngineFilenameSchema` static methods: `build`, `parse`, `matches_host`. +- `EngineFilenameSchemaError` exception type. +- Public interface contract published at `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md`. + +### Excluded + +- Schema versioning (no `schema_version` field) — adding a new tuple dimension is a Plan-phase carryforward. +- Engine compilation / compatibility resolution — C7. +- Hot-loading / lazy materialisation — C7. +- Filename collision detection across cache roots — C10's Manifest. +- The `EngineCacheKey` / `HostCapabilities` types themselves — owned by `_types/manifests.py` (AZ-263). + +## Acceptance Criteria + +**AC-1: Reference example builds correctly** +Given `("ultravpr", 87, "6.2", "10.3", "fp16")` +When `build` runs +Then the result is exactly `"ultravpr__sm87_jp6.2_trt10.3_fp16.engine"` + +**AC-2: Round-trip identity** +Given 10 random valid tuples +When each round-trips through `parse(build(*args))` +Then each produces deep-equal `EngineCacheKey` outputs + +**AC-3: Host-match exact** +Given a filename built for `(sm=87, jp=6.2, trt=10.3)` and a `HostCapabilities(sm=87, jp=6.2, trt=10.3)` +When `matches_host` runs +Then the result is True + +**AC-4: Host-mismatch on any tuple element returns False (no exception)** +Given a filename built for `(sm=87, jp=6.2, trt=10.3)` and a host with `sm=72` +When `matches_host` runs +Then the result is False (NOT an exception — tuple mismatch is the expected "not a match" path) + +**AC-5: Precision enum strictness** +Given `build(..., precision="bf16")` +When the call runs +Then `EngineFilenameSchemaError` is raised mentioning the allowed enum `{fp16, int8, mixed}` + +**AC-6: Model-name character set** +Given `build("UltraVPR", ...)` (uppercase letters) +When the call runs +Then `EngineFilenameSchemaError` is raised mentioning the allowed `[a-z0-9_]` set + +**AC-7: Reserved separator collision** +Given `build("ultra__vpr", ...)` (double underscore in model name) +When the call runs +Then `EngineFilenameSchemaError` is raised mentioning the reserved `__` separator + +**AC-8: Version format strictness** +Given `build(..., jetpack="6.2.1", ...)` (three-segment version) +When the call runs +Then `EngineFilenameSchemaError` is raised mentioning the dotted `.` format + +**AC-9: Parse rejects malformed filenames** +Given `parse("not_an_engine_file.bin")` +When the call runs +Then `EngineFilenameSchemaError` is raised + +**AC-10: Parse requires `.engine` suffix** +Given `parse("ultravpr__sm87_jp6.2_trt10.3_fp16")` (missing `.engine`) +When the call runs +Then `EngineFilenameSchemaError` is raised mentioning the required suffix + +**AC-11: No upward imports (Layer 1 invariant)** +Given the helper module +When a static-import check runs +Then it imports ONLY from `_types`, `re`, and stdlib — no `gps_denied_onboard.components.*` imports anywhere + +## Non-Functional Requirements + +**Performance** +- No specific latency budget per `_docs/02_document/common-helpers/06_helper_engine_filename_schema.md` (consumers are pre-flight / takeoff-load). Sanity bound: each helper call ≤ 50 µs on Tier-2. + +**Reliability** +- Pure deterministic; same input → byte-equal output. +- `EngineFilenameSchemaError` is the ONLY exception type the public surface raises on validation / parse errors. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | reference example | exact filename match | +| AC-2 | round-trip 10 random valid tuples | deep-equal `EngineCacheKey` outputs | +| AC-3 | matching host | True | +| AC-4 | mismatched `sm` | False; no exception | +| AC-5 | `precision="bf16"` | `EngineFilenameSchemaError`; mentions enum | +| AC-6 | uppercase model name | `EngineFilenameSchemaError`; mentions `[a-z0-9_]` | +| AC-7 | double-underscore model name | `EngineFilenameSchemaError`; mentions reserved separator | +| AC-8 | three-segment version | `EngineFilenameSchemaError`; mentions dotted format | +| AC-9 | malformed filename | `EngineFilenameSchemaError` | +| AC-10 | missing `.engine` suffix | `EngineFilenameSchemaError`; mentions suffix | +| AC-11 | importlinter / grep gate | no `components.*` imports | +| NFR-perf | microbench each helper (10k iterations on Tier-2 fixture) | p99 ≤ 50 µs each | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` v1.0.0. +- Layer 1 Foundation only. +- Static-only design satisfies `coderule.mdc`. +- No new dependency beyond what AZ-263 / E-BOOT pinned (only `re` and stdlib are needed). +- The `EngineCacheKey` / `HostCapabilities` types live in `_types/manifests.py` (AZ-263 responsibility). + +## Risks & Mitigation + +**Risk 1: A future format change breaks existing cache roots** +- *Risk*: Adding a tuple dimension (e.g., `BUILD_*` flag combination) requires re-writing every existing `.engine` filename; deployments with stale cache roots fail silently. +- *Mitigation*: The contract `Versioning Rules` mandate a major-version bump for any format change. C7's `deserialize_engine` should also reject unrecognised filename patterns rather than guess; that is C7's responsibility to wire on top of this helper's `parse`. + +**Risk 2: `matches_host` returns False without explanation** +- *Risk*: An operator copies an engine from a different deployment; takeoff-load skips it; the operator sees "no engine matches host" without knowing which tuple element mismatched. +- *Mitigation*: This helper is just the predicate. The error-surfacing UX is C7's / C10's responsibility — they call `parse` to extract the engine's tuple AND read `host_capabilities`, then format an actionable error. The contract documents the predicate's "True iff all tuple elements match" semantics so consumers can produce that message themselves. + +## Runtime Completeness + +- **Named capability**: self-describing engine filename schema (D-C10-7 / `06_helper_engine_filename_schema.md`). +- **Production code that must exist**: real format builder + parser + host-match predicate; real strict validation for all five tuple elements. +- **Allowed external stubs**: none — pure string parsing on stdlib. +- **Unacceptable substitutes**: `f"{model}_{sm}_{jp}_{trt}_{precision}.engine"` (single underscore separators ambiguate `model` from `sm`); silently truncating `jetpack="6.2.1"` to `6.2`; matching host with substring instead of exact-equality. + +## Contract + +This task produces the contract at `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-282_ransac_filter.md b/_docs/02_tasks/todo/AZ-282_ransac_filter.md new file mode 100644 index 0000000..a7e7264 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-282_ransac_filter.md @@ -0,0 +1,162 @@ +# RansacFilter Helper Module + +**Task**: AZ-282_ransac_filter +**Name**: RansacFilter Helper +**Description**: Implement the shared `RansacFilter` helper that wraps OpenCV's RANSAC inlier filtering and reprojection-residual computation. Used by C2.5 (single-pair LightGlue inlier counting), C3 (2D-2D RANSAC over cross-domain correspondences), C3.5 (residual recompute after AdHoP refinement), and C4 (per-frame final reprojection residual for FDR provenance). Stateless static-only design; deterministic by setting OpenCV's RNG seed. +**Complexity**: 2 points +**Dependencies**: AZ-263_initial_structure, AZ-277_se3_utils +**Component**: shared.helpers.ransac_filter (cross-cutting; epic AZ-264 / E-CC-HELPERS) +**Tracker**: AZ-282 +**Epic**: AZ-264 (E-CC-HELPERS) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_helpers/ransac_filter.md` — frozen public interface this task produces. +- `_docs/02_document/contracts/shared_helpers/se3_utils.md` — the `SE3` (re-exported GTSAM `Pose3`) type used in `compute_reprojection_residual`. +- `_docs/02_document/common-helpers/07_helper_ransac_filter.md` — design rationale. + +## Problem + +Four components run RANSAC over 2D-2D correspondences (C2.5, C3, C3.5) or compute reprojection residuals after PnP (C4). Without a shared helper: +- Each component grows its own `cv2.findHomography(..., cv2.RANSAC)` wrapper; some set the RNG seed, some don't, producing non-determinism in tests. +- The "what counts as the residual" definition drifts — some compute mean, some median, some mean-of-squares — so the FDR `mre_px` field means different things in different components. +- C4's "final per-frame residual" used by FDR provenance ends up with a slightly different formula than C3.5's "did refinement help?" residual; cross-component comparisons in post-flight analysis become apples-to-oranges. + +## Outcome + +- A single `helpers.ransac_filter` module is the only path through which any onboard process runs 2D-2D RANSAC or computes a reprojection residual. +- RANSAC is deterministic given a fixed seed: `cv2.setRNGSeed(0)` (or the explicit `seed` kwarg where supported) is set inside `filter_correspondences` so the same input correspondences always produce the same `RansacResult`. +- Residual statistic is fixed to the MEDIAN (in pixels), not mean — outliers in the 2D residual distribution do not bias the consumer's quality signal. Documented in the contract. +- C4's `solvePnPRansac` continues to use OpenCV's internal RANSAC (out of contract); this helper covers the standalone 2D-2D case + the standalone reprojection-residual computation that lives outside the PnP call. + +## Scope + +### Included + +- `RansacFilter` static methods: `filter_correspondences`, `compute_reprojection_residual`. +- `RansacResult` frozen dataclass with `inlier_correspondences`, `inlier_count`, `outlier_count`, `median_residual_px`. +- `RansacFilterError` exception type. +- Public interface contract published at `_docs/02_document/contracts/shared_helpers/ransac_filter.md`. + +### Excluded + +- 2D-3D RANSAC inside `solvePnPRansac` — OpenCV does it internally; this helper does not wrap it. +- Per-component RANSAC threshold defaults — they are documented in C2.5, C3, C3.5, C4 specs. This helper takes the threshold as a parameter. +- Adaptive RANSAC (PROSAC, USAC) — out of scope for v1.0.0. +- GPU-accelerated RANSAC — out of scope for v1.0.0. +- Confidence / iteration-count tuning of `cv2.findHomography` — exposed only via `ransac_threshold_px` for v1.0.0. + +## Acceptance Criteria + +**AC-1: Clean correspondences yield 100 % inliers + zero residual** +Given 100 correspondences generated from a known homography (no outliers, no noise) +When `filter_correspondences` runs with `ransac_threshold_px=1.5` +Then `RansacResult` has `inlier_count == 100`, `outlier_count == 0`, `median_residual_px ≈ 0.0` within `atol=1e-6` + +**AC-2: Mixed correspondences produce expected inlier band** +Given 80 inliers (perfect homography) + 20 outliers (random noise) +When `filter_correspondences` runs with `ransac_threshold_px=1.5` +Then `inlier_count` ∈ `[78, 82]` (RANSAC noise tolerance) AND `outlier_count == 100 - inlier_count` + +**AC-3: Determinism — fixed seed** +Given the same input correspondences +When `filter_correspondences` runs twice +Then both `RansacResult` outputs are byte-equal (numpy arrays equal-by-value, all fields match) + +**AC-4: Reprojection residual ≈ 0 on clean inliers + known pose** +Given 4+ correspondences generated from a known camera intrinsics + pose pair (no noise) +When `compute_reprojection_residual` runs +Then the result is ≈ 0.0 within `atol=1e-6` + +**AC-5: Empty inlier array → NaN, no exception** +Given an empty inlier array (shape `(0, 4)`) +When `compute_reprojection_residual` runs +Then it returns `NaN` (no exception) + +**AC-6: Shape contract on correspondences** +Given correspondences with shape `(N, 3)` (missing one coordinate column) +When `filter_correspondences` runs +Then `RansacFilterError` is raised mentioning the expected shape `(N, 4)` + +**AC-7: Threshold guard** +Given `ransac_threshold_px = -1.0` +When `filter_correspondences` runs +Then `RansacFilterError` is raised mentioning the positive-threshold requirement + +**AC-8: Minimum point count** +Given correspondences with shape `(3, 4)` (fewer than 4 points) +When `filter_correspondences` runs +Then `RansacFilterError` is raised mentioning the 4-point minimum for homography RANSAC + +**AC-9: K shape contract in residual call** +Given `K` of shape `(4, 4)` +When `compute_reprojection_residual` runs +Then `RansacFilterError` is raised mentioning the expected `(3, 3)` shape + +**AC-10: No upward imports (Layer 1 invariant)** +Given the helper module +When a static-import check runs +Then it imports ONLY from `_types`, `helpers.se3_utils`, `cv2`, `numpy`, and stdlib — no `gps_denied_onboard.components.*` imports anywhere + +## Non-Functional Requirements + +**Performance** +- `filter_correspondences` p99 ≤ 5 ms on Tier-2 for `N=200` correspondences (matches the C3 cross-domain matcher's per-candidate budget). +- `compute_reprojection_residual` p99 ≤ 1 ms on Tier-2 for `I=100` inliers. +- Helper-level overhead vs. inline OpenCV ≤ 5 % (per E-CC-HELPERS hot-path NFR). + +**Reliability** +- `RansacFilterError` is the ONLY exception type the public surface raises on shape / dtype / threshold violations. OpenCV's lower-level exceptions MUST be wrapped. +- Determinism: same input → byte-equal `RansacResult` outputs (RNG seed is set inside the helper). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | 100 clean correspondences, threshold 1.5 px | `inlier_count == 100`; `median_residual_px ≈ 0.0` | +| AC-2 | 80 inliers + 20 outliers | `inlier_count ∈ [78, 82]` | +| AC-3 | same input run twice | byte-equal `RansacResult` outputs | +| AC-4 | clean inliers + known pose | residual ≈ 0.0 | +| AC-5 | empty inlier array | returns `NaN`; no exception | +| AC-6 | shape `(N, 3)` | `RansacFilterError`; mentions `(N, 4)` | +| AC-7 | `ransac_threshold_px = -1.0` | `RansacFilterError`; mentions positive threshold | +| AC-8 | shape `(3, 4)` | `RansacFilterError`; mentions 4-point minimum | +| AC-9 | `K.shape = (4, 4)` | `RansacFilterError`; mentions `(3, 3)` | +| AC-10 | importlinter / grep gate | no `components.*` imports | +| NFR-perf | microbench `filter_correspondences` (10k iters on Tier-2 fixture, N=200) | p99 ≤ 5 ms | +| NFR-perf-residual | microbench `compute_reprojection_residual` (10k iters on Tier-2 fixture, I=100) | p99 ≤ 1 ms | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_helpers/ransac_filter.md` v1.0.0. +- Layer 1 Foundation only. +- `cv2` is the single RANSAC backend; pinned in `pyproject.toml` at AZ-263 / E-BOOT (OpenCV ≥ 4.12.0 per D-CROSS-CVE-1). +- Static-only design satisfies `coderule.mdc`. +- Helper depends on `helpers.se3_utils` for the `SE3` type alias (allowed — same Layer 1). +- No new dependency beyond what AZ-263 / E-BOOT pinned. + +## Risks & Mitigation + +**Risk 1: Determinism regression on OpenCV upgrade** +- *Risk*: A future OpenCV version changes the RNG seeding API; tests start failing intermittently because the fixed-seed contract no longer holds. +- *Mitigation*: AC-3 is the canary. The helper's own test suite catches the regression on dependency upgrade before consumers do. The contract test should also pin a known-good `RansacResult` fixture for a reference input so any drift surfaces immediately. + +**Risk 2: Median vs mean residual confusion** +- *Risk*: A future contributor "improves" the helper to return mean residual; consumers' quality thresholds (which were tuned against median) become looser; FDR data is inconsistent across cycles. +- *Mitigation*: The contract `Invariants` section pins median. AC-1's `≈ 0.0` clean residual is the same for both, but AC-2's mixed-quality residual would shift on a mean→median change; the test fixture pins enough digits that the regression is caught. + +**Risk 3: 4-point minimum too low for noisy inputs** +- *Risk*: With exactly 4 points, RANSAC has zero redundancy; one bad point makes the homography unstable. +- *Mitigation*: The 4-point minimum is the OpenCV minimum — consumers that need higher minimums set them via `min_inliers` and check `RansacResult.inlier_count` themselves. The contract documents that `min_inliers` is informational; the consumer is the gatekeeper. + +## Runtime Completeness + +- **Named capability**: 2D-2D RANSAC inlier filtering + median reprojection residual via OpenCV (architecture / E-CC-HELPERS / `07_helper_ransac_filter.md`). +- **Production code that must exist**: real `cv2.findHomography(..., cv2.RANSAC)`-backed filtering; real `cv2.projectPoints`-backed residual computation; real fixed-seed determinism wiring. +- **Allowed external stubs**: none — OpenCV is the production runtime. +- **Unacceptable substitutes**: hand-rolled RANSAC (numerical instability, no OpenCV-tested edge-case handling); skipping RNG seed (silent intermittent test failures); switching residual statistic from median to mean (changes consumer-visible quality signal). + +## Contract + +This task produces the contract at `_docs/02_document/contracts/shared_helpers/ransac_filter.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-283_descriptor_normaliser.md b/_docs/02_tasks/todo/AZ-283_descriptor_normaliser.md new file mode 100644 index 0000000..5779602 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-283_descriptor_normaliser.md @@ -0,0 +1,175 @@ +# DescriptorNormaliser Helper Module + +**Task**: AZ-283_descriptor_normaliser +**Name**: DescriptorNormaliser Helper +**Description**: Implement the shared `DescriptorNormaliser` helper that L2-normalises descriptors so cosine similarity aligns with FAISS HNSW's inner-product metric. Used by C2 (query-side per-frame embedding before FAISS lookup), C2.5 (descriptor pre-processing for re-rank), C3 (descriptor pre-processing for cross-domain matching), and C10 (corpus-side per-tile embedding before FAISS index population). The same helper on both sides is what guarantees the index returns useful neighbours rather than garbage. Stateless static-only; dtype-preserving (`float16`/`float32` in → same out). +**Complexity**: 2 points +**Dependencies**: AZ-263_initial_structure +**Component**: shared.helpers.descriptor_normaliser (cross-cutting; epic AZ-264 / E-CC-HELPERS) +**Tracker**: AZ-283 +**Epic**: AZ-264 (E-CC-HELPERS) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md` — frozen public interface this task produces. +- `_docs/02_document/common-helpers/08_helper_descriptor_normaliser.md` — design rationale (FAISS metric alignment, corpus-vs-query coupling). + +## Problem + +FAISS HNSW operates on Euclidean / inner-product spaces, but the upstream backbones (UltraVPR, MegaLoc, MixVPR, SelaVPR, EigenPlaces, NetVLAD, SALAD) emit raw cosine-similar embeddings. The standard FAISS-idiomatic recipe is "L2-normalise both sides + use inner-product metric" — but this is fragile: +- If C10 (corpus side, pre-flight) and C2 (query side, runtime) drift on whether to normalise, or how to handle zero-norm vectors, the FAISS index returns garbage. +- If one side silently up-casts `float16` to `float32`, the index gets built with a different precision than the queries, producing wrong neighbours. +- A future contributor "improves" one side's normalisation (e.g., adds whitening) without the other; recall drops silently. + +## Outcome + +- A single `helpers.descriptor_normaliser` module is the only path through which any onboard process L2-normalises descriptors. +- The metric ("inner_product") is exposed via `descriptor_metric()` so C6's `DescriptorIndex.search_topk` and C10's index-build code consult the same source — no hard-coded `"l2"` or `"cosine"` strings anywhere. +- dtype is preserved: `float16` in → `float16` out (preserves the precision the backbone chose); `float32` in → `float32` out. No silent up-cast. +- Zero-norm input vectors are returned as the zero vector (no division-by-zero); callers filter or accept that such descriptors will match nothing. +- L2 normalisation is idempotent (byte-equal for `float32`, near-byte-equal for `float16` due to half-precision rounding) so accidentally normalising twice is harmless. + +## Scope + +### Included + +- `DescriptorNormaliser` static methods: `l2_normalise(descriptor) -> ndarray`, `l2_normalise_batch(descriptors) -> ndarray`, `descriptor_metric() -> str`. +- `DescriptorNormaliserError` exception type. +- Public interface contract published at `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md`. + +### Excluded + +- Whitening / mean-subtraction — out of scope; consumers that need it apply it before / after this helper. +- PCA / dimensionality reduction — out of scope. +- GPU-accelerated normalisation — out of scope for v1.0.0. +- Quantisation (PQ, IVF) — owned by C6 / C10 around the FAISS index. +- Auto-detection of descriptor dim — helper is shape-agnostic for any `D >= 1`. + +## Acceptance Criteria + +**AC-1: Unit-vector example** +Given `np.array([3.0, 4.0], dtype=float32)` +When `l2_normalise` runs +Then the result equals `np.array([0.6, 0.8], dtype=float32)` within `atol=1e-6`; norm ≈ 1.0 + +**AC-2: Batch normalisation** +Given `np.array([[3.0, 4.0], [1.0, 0.0]], dtype=float32)` +When `l2_normalise_batch` runs +Then the rows are `[0.6, 0.8]` and `[1.0, 0.0]`; each row's norm is ≈ 1.0 + +**AC-3: dtype preservation — float16** +Given a random `float16` descriptor of dim 512 +When `l2_normalise` runs +Then `result.dtype == float16` AND `np.linalg.norm(result.astype(float32))` ≈ 1.0 within `atol=1e-3` + +**AC-4: dtype preservation — float32** +Given a random `float32` descriptor of dim 512 +When `l2_normalise` runs +Then `result.dtype == float32` AND `np.linalg.norm(result)` ≈ 1.0 within `atol=1e-6` + +**AC-5: Zero-vector handling** +Given `np.zeros(128, dtype=float32)` +When `l2_normalise` runs +Then the result is `np.zeros(128, dtype=float32)` (no exception, no NaN) + +**AC-6: Idempotence — float32** +Given a random `float32` descriptor `x` +When `l2_normalise(l2_normalise(x))` runs +Then it is byte-equal to `l2_normalise(x)` + +**AC-7: Idempotence — float16** +Given a random `float16` descriptor `x` +When `l2_normalise(l2_normalise(x))` runs +Then it matches `l2_normalise(x)` within `atol=1e-3` (half-precision rounding) + +**AC-8: No in-place mutation** +Given `x` is a `float32` descriptor +When `l2_normalise(x)` runs +Then `x` is bit-identical to its original value + +**AC-9: Metric source of truth** +Given a call to `descriptor_metric()` +When it runs +Then it returns the string `"inner_product"` + +**AC-10: dtype contract — float64 rejected** +Given a `float64` array +When `l2_normalise` runs +Then `DescriptorNormaliserError` is raised mentioning `float16` / `float32` only + +**AC-11: Shape contract — 1-D for single, 2-D for batch** +Given a 2-D array passed to `l2_normalise` (single) +When the call runs +Then `DescriptorNormaliserError` is raised mentioning the 1-D shape requirement +And given a 1-D array passed to `l2_normalise_batch` +When the call runs +Then `DescriptorNormaliserError` is raised mentioning the 2-D shape requirement + +**AC-12: No upward imports (Layer 1 invariant)** +Given the helper module +When a static-import check runs +Then it imports ONLY from `_types`, numpy, and stdlib — no `gps_denied_onboard.components.*` imports anywhere + +## Non-Functional Requirements + +**Performance** +- `l2_normalise` p99 ≤ 50 µs on Tier-2 for `D=512` (matches the per-frame VPR query budget). +- `l2_normalise_batch` p99 ≤ 5 ms on Tier-2 for `(N=1000, D=512)` (matches the C10 batch index-build chunk size). +- Helper-level overhead vs. inline `x / np.linalg.norm(x)` ≤ 5 % (per E-CC-HELPERS hot-path NFR). + +**Reliability** +- `DescriptorNormaliserError` is the ONLY exception type the public surface raises on shape / dtype violations. +- Pure deterministic; same input → byte-equal output (within `float16` rounding). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `[3.0, 4.0]` fp32 | result `[0.6, 0.8]` within `atol=1e-6` | +| AC-2 | batch `[[3, 4], [1, 0]]` | rows `[0.6, 0.8]` and `[1.0, 0.0]` | +| AC-3 | random fp16 dim-512 | result.dtype fp16; norm ≈ 1.0 within `atol=1e-3` | +| AC-4 | random fp32 dim-512 | result.dtype fp32; norm ≈ 1.0 within `atol=1e-6` | +| AC-5 | zero vector | returned as zero vector; no exception | +| AC-6 | double-normalise fp32 | byte-equal to single-normalise | +| AC-7 | double-normalise fp16 | matches single-normalise within `atol=1e-3` | +| AC-8 | mutation check | input unchanged after call | +| AC-9 | `descriptor_metric()` | exact string `"inner_product"` | +| AC-10 | fp64 input | `DescriptorNormaliserError`; mentions fp16/fp32 | +| AC-11 | 2-D into single, 1-D into batch | `DescriptorNormaliserError` for each | +| AC-12 | importlinter / grep gate | no `components.*` imports | +| NFR-perf | microbench `l2_normalise` (D=512, 10k iterations on Tier-2 fixture) | p99 ≤ 50 µs; overhead ≤ 5 % | +| NFR-perf-batch | microbench `l2_normalise_batch` (N=1000, D=512, 10k iterations) | p99 ≤ 5 ms | + +## Constraints + +- Public surface frozen by `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md` v1.0.0. +- Layer 1 Foundation only. +- Numpy is the single backend; numpy-CUDA may be used opportunistically but the contract surface is dtype-only (no GPU array types leak through). +- Static-only design satisfies `coderule.mdc`. +- No new dependency beyond what AZ-263 / E-BOOT pinned. + +## Risks & Mitigation + +**Risk 1: Silent dtype up-cast hides corpus-vs-query precision drift** +- *Risk*: A future change up-casts `float16` → `float32` "for numerical stability"; C10's corpus is built with fp32 normalisations while C2's queries are still fp16 raw embeddings; recall silently drops. +- *Mitigation*: AC-3 / AC-4 pin dtype preservation. The contract test is the canary. + +**Risk 2: Whitening creep** +- *Risk*: A contributor adds optional whitening "for `MixVPR` only" inside this helper; one consumer calls it with `whiten=True`, the other doesn't; index becomes inconsistent. +- *Mitigation*: The contract `Non-Goals` explicitly excludes whitening. Whitening lives elsewhere (or not at all in v1.0.0). A whitening-related contract change is a major version with a forced index rebuild. + +**Risk 3: Zero-norm vectors crash the FAISS index build** +- *Risk*: Zero-norm input → `nan / 0` propagates into the FAISS index; the index becomes corrupt. +- *Mitigation*: AC-5 pins the zero-vector handling: zero in → zero out, no NaN. C10 / C2 are responsible for filtering zero descriptors before / after this helper if they don't want them in the index. + +## Runtime Completeness + +- **Named capability**: L2 normalisation aligning cosine-similar embeddings to FAISS inner-product metric (architecture / E-CC-HELPERS / `08_helper_descriptor_normaliser.md`). +- **Production code that must exist**: real numpy-backed L2 normalisation; real dtype-preserving path; real zero-norm-safe handling. +- **Allowed external stubs**: none — numpy is stdlib-tier production dep. +- **Unacceptable substitutes**: silent dtype up-cast; `np.divide(x, np.linalg.norm(x))` without zero-norm guard (NaN propagation); hard-coded metric string in C6 / C10 instead of consulting `descriptor_metric()`. + +## Contract + +This task produces the contract at `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-291_c13_writer_thread.md b/_docs/02_tasks/todo/AZ-291_c13_writer_thread.md new file mode 100644 index 0000000..e519cd5 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-291_c13_writer_thread.md @@ -0,0 +1,171 @@ +# C13 Writer Thread + Segment File Lifecycle + +**Task**: AZ-291_c13_writer_thread +**Name**: C13 Writer Thread +**Description**: Implement the single-writer thread that drains every onboard producer's `FdrClient` SPSC ring buffer and persists records to per-flight segment files on the companion's NVM. Owns segment file open/append/close, atomic per-segment rotation when the configured per-segment size cap is reached, and the cross-process FDR-root `filelock` so the operator-side post-flight reader cannot collide with an in-flight writer. This task is the foundation every other E-C13 task (header/footer accounting, 64 GB cap policy, mid-flight tile snapshot, thumbnail rate cap, takeoff abort) builds on. +**Complexity**: 5 points +**Dependencies**: AZ-263_initial_structure, AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf, AZ-266_log_module, AZ-269_config_loader +**Component**: c13_fdr (epic AZ-248 / E-C13) +**Tracker**: AZ-291 +**Epic**: AZ-248 (E-C13) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — wire format for every record this thread serialises to the segment file. +- `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — defines `pop_one()` / `drain()` consumer-side surface this thread invokes per registered producer. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — operational log shape this thread uses for ERROR/WARN/INFO messages (segment open/rotate/write failure). +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that carries `flight_root`, segment-size cap, and registered producer set. + +## Problem + +Every onboard component publishes FDR records via its `FdrClient` SPSC ring buffer (AZ-273), but those buffers are write-only from the producer side. Without a single, contract-frozen writer thread: + +- Buffers fill up and overruns dominate within seconds — the AC-NEW-3 "no silent drops" guarantee is unenforceable because nothing drains them. +- No segment file ever lands on disk — post-flight retrieval has nothing to read. +- Multiple ad-hoc writers would race on segment rotation, corrupting partially-written records. +- Operator workstation reads (post-flight via E-C12) and a misbehaving "still flying" writer process would race on the FDR root without `filelock` enforcement. + +This task delivers exactly one thread that owns the entire write side of the FDR. + +## Outcome + +- A single `FileFdrWriter` instance, constructed once per flight by the composition root, runs one background thread that consumes records from every registered producer's `FdrClient` and appends them to the current open segment file in the per-flight directory under `flight_root`. +- Segment files roll over atomically when the configured per-segment size cap is reached: the current segment is closed and `fsync`ed, the next segment is opened via `atomicwrites`, and the writer continues without dropping records or losing wire-format alignment. +- The FDR root holds a `filelock` for the entire flight; the operator-side reader (future E-C12 retrieval task) MUST acquire the same lock before reading. Two airborne writer processes against the same `flight_root` is a constructor-time `FdrConcurrentWriterError`. +- A mid-flight filesystem write failure (ENOSPC, EIO) is logged via the shared logger at ERROR + a STATUSTEXT alert is requested through the C8 GCS adapter; the writer transitions to a degraded "drop-and-log" mode so the rest of the system keeps emitting external positions, but operators are alerted. + +## Scope + +### Included + +- `FileFdrWriter(flight_root: Path, config: FdrWriterConfig, fdr_clients: Sequence[FdrClient], gcs_alert: Callable[[str], None])` constructor. +- `start()` method that opens segment 0 under `flight_root//segment-0000.fdr`, acquires the FDR-root `filelock`, and starts the background thread. +- `stop()` method that signals the thread to drain remaining records, closes the current segment with `fsync`, releases the `filelock`, and joins. +- Background thread loop: per registered producer, call `drain(max_records=batch_size)` (batch size from config), serialise each `FdrRecord` via `fdr_record_schema.serialise`, append to the current segment with a length-prefixed framing identical to what `parse` reads, and rotate when the segment exceeds the per-segment size cap. +- Atomic per-segment rotation using `atomicwrites`: open the next segment under a temp path, swap to the canonical name only after the previous segment is closed + `fsync`ed. +- Cross-process `filelock` on `flight_root/.fdr.lock` held for the entire flight; constructor-time `FdrConcurrentWriterError` if the lock is already held. +- Mid-flight write failure handling: catch `OSError` around segment append/rotate, log ERROR via the shared logger (`kind="fdr.write_failure"`), invoke `gcs_alert(message)`, set internal `is_degraded = True`. Subsequent `drain` calls continue to dequeue records (so producer buffers don't grow unboundedly) but discard them with a per-second-rate-capped ERROR log; recovery is out of scope (operator must land + retry). +- Public read-only introspection: `current_segment_path() -> Path`, `current_segment_bytes() -> int`, `segments_written() -> int`, `is_rolling() -> bool` (true while a rotation is in progress). +- Diagnostic INFO log on `start()` and on each successful segment rotation; DEBUG log per record only when explicitly enabled in config (defaults off — DEBUG-per-record would flood at 100 Hz aggregate). +- Filesystem layout: `flight_root//segment-NNNN.fdr` (4-digit zero-padded segment number, `.fdr` suffix). The `` directory is created on `start()` from `FlightHeader.flight_id` (header content is owned by AZ-248-2 / task #2; this task accepts the flight_id as a constructor argument or via an open-time setter). + +### Excluded + +- `FlightHeader` / `FlightFooter` records and `records_written` / `records_dropped_overrun` accounting — owned by task #2 of this epic. +- 64 GB total-flight cap + oldest-segment-dropped policy + `kind="segment_rollover"` record emission — owned by task #3 of this epic. (This task implements per-segment-size rotation only; per-flight-cap enforcement is a higher policy layer that observes segments rolled by this task.) +- Mid-flight tile snapshot path + `kind="mid_flight_tile_snapshot"` payload handling — owned by task #4. +- Failed-tile thumbnail rate limiter + AC-8.5 `RawFrameWriteForbiddenError` enforcement — owned by task #5. +- Takeoff abort wiring on `FdrOpenError` — owned by task #6. +- Producer-side `FdrClient` ring buffer + `on_overrun` policy — owned by AZ-273 + AZ-274. +- Post-flight segment file reader — out of scope this cycle (future E-C12 task). +- `FdrRecord` schema and `serialise` / `parse` implementations — owned by AZ-272. + +## Acceptance Criteria + +**AC-1: Single writer thread drains every registered producer** +Given 3 `FdrClient` instances each with 100 records buffered +When `FileFdrWriter.start()` is called and the test waits 1 s +Then segment 0 on disk contains all 300 records (parsed via `fdr_record_schema.parse` in deterministic order per-producer, interleaving allowed across producers) + +**AC-2: Per-segment rotation at configured size cap** +Given `FdrWriterConfig.segment_size_bytes = 4096` and a producer enqueuing fixed-size records that cross 4096 bytes after N writes +When the writer runs +Then segment 0 on disk is ≤ 4096 bytes (within one record's worth of overshoot), segment 1 is opened atomically, and `parse(segment_0_bytes ++ segment_1_bytes)` yields all records in order with no truncation, no overlap, and no corruption at the rotation boundary + +**AC-3: Atomic rotation does not lose records under crash** +Given a writer that has just appended a record to segment N and is mid-rotation to segment N+1 +When the test simulates a crash (kill before `atomicwrites` finalises N+1) +Then on restart segment N is intact and parseable to the last record before rotation; segment N+1 either does not exist or is intact and parseable from offset 0 — there is no half-written intermediate file at the canonical segment N+1 path + +**AC-4: Cross-process filelock prevents concurrent writers** +Given `FileFdrWriter` is running and holds the lock at `flight_root/.fdr.lock` +When a second `FileFdrWriter` constructor is called against the same `flight_root` +Then the second constructor raises `FdrConcurrentWriterError` and does NOT create a second writer thread or touch any segment file + +**AC-5: Mid-flight ENOSPC degrades gracefully + alerts via GCS** +Given the writer is running and the underlying filesystem returns `OSError(ENOSPC)` on the next segment append +When the writer encounters the failure +Then (a) one ERROR log record is emitted with `kind="fdr.write_failure"` carrying `errno=ENOSPC`, (b) `gcs_alert(message)` is invoked exactly once with a message identifying the failure, (c) `is_degraded` becomes True, (d) subsequent `drain` calls still dequeue from the producer buffers (no unbounded growth on the producer side), (e) the per-second ERROR-log cap kicks in if the failure repeats (≤ 1 ERROR/sec related to write failures) + +**AC-6: stop() drains, fsyncs, releases lock** +Given a running writer with N records buffered across all producers +When `stop()` is called +Then (a) all N records are appended and `fsync`ed before the method returns, (b) the FDR-root `filelock` is released (a subsequent constructor against the same `flight_root` succeeds), (c) the current segment file is closed and not held open by any descriptor + +**AC-7: Segment file layout is exactly `/segment-NNNN.fdr`** +Given `flight_id="abc123-def4-..."` and 3 segment rotations during the flight +When `stop()` returns +Then `flight_root/abc123-def4-.../` contains exactly `segment-0000.fdr`, `segment-0001.fdr`, `segment-0002.fdr`, `segment-0003.fdr` (and nothing else from this writer); each is independently parseable as a stream of length-prefixed `FdrRecord`s + +**AC-8: Steady-state writer thread does not block any producer** +Given a producer enqueuing at 200 Hz steady-state and a writer-thread that takes 4 ms to serialise + append a record (well under the per-record budget) +When the test runs for 60 s +Then the producer's `FdrClient` reports zero `EnqueueResult.OVERRUN` results from this scenario (the writer keeps up with steady state; overrun under burst is a separate concern owned by AZ-273 + AZ-274) + +## Non-Functional Requirements + +**Performance** +- Aggregate writer throughput ≥ 200 Hz sustained on Tier-2 (Jetson Orin Nano Super) under the workload defined by C13-PT-01 (~100 Hz combined producer rate). Headroom of 2× is the design margin. +- Per-record serialise + append p95 ≤ 5 ms (matches C13-PT-01 budget). +- Segment rotation completes in ≤ 50 ms p99 (so a rotation does not stall the writer past one record's worth of producer buffer headroom). +- `start()` returns within 100 ms after segment 0 is open and the thread is running (not blocking takeoff readiness). + +**Reliability** +- The writer thread NEVER raises into the constructor's caller after `start()` returns. All runtime errors are caught and either (a) logged + degraded, or (b) coerced into a `stop()`-and-rethrow path that the composition root observes via a documented exit hook. +- Segment files are append-only between rotations: the writer NEVER seeks backward, NEVER overwrites a closed segment, NEVER truncates the current segment. +- `fsync` is called after every segment rotation (so a power loss preserves all closed segments). Per-record `fsync` is NOT required; the per-segment cap is the durability boundary. + +**Concurrency** +- The writer thread is the ONLY consumer of every registered producer's `FdrClient` (matches AZ-273's SPSC contract — each `FdrClient` has exactly one consumer thread; this is it). +- The `start()` / `stop()` methods are NOT thread-safe to each other; the composition root calls each exactly once per `FileFdrWriter` lifetime. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | 3 FdrClients × 100 buffered records → start writer, wait, parse segment 0 | All 300 records present, in-order per-producer | +| AC-2 | segment_size_bytes=4096; emit fixed-size records across the cap | Segment 0 ≤ 4096 + 1 record overshoot; segment 1 contains the rest; concatenated parse yields all records in order | +| AC-3 | Kill writer mid-rotation (after segment N close, before segment N+1 finalise) | On restart, segment N parses cleanly; segment N+1 is either absent or parseable from offset 0 | +| AC-4 | Two FileFdrWriter constructors against the same flight_root | Second raises `FdrConcurrentWriterError`; first remains untouched | +| AC-5 | Inject `OSError(ENOSPC)` on segment append | One ERROR log; gcs_alert called once; is_degraded=True; producers still drained; subsequent failures log-rate-capped | +| AC-6 | stop() with N records buffered | All N records on disk; fsync called; filelock released | +| AC-7 | Run a 3-rotation flight, inspect filesystem | Exactly 4 files: `segment-0000.fdr` through `segment-0003.fdr` | +| AC-8 | 200 Hz producer, 60 s, writer running | Zero overrun results from steady-state load | +| NFR-perf-throughput | C13-PT-01 microbench | ≥ 200 Hz sustained on Tier-2 | +| NFR-perf-rotation | Microbench rotation step | p99 ≤ 50 ms | +| NFR-reliability-fsync | Track fsync calls during a 5-segment flight | fsync called once per segment close | +| NFR-reliability-no-seek | Open the segment file with a tracing layer; assert no `lseek` backward | No backward seeks observed | + +## Constraints + +- One concrete writer per project (`FileFdrWriter`); no `FdrWriter` Protocol abstraction unless and until a second writer is needed (per architecture description.md "single concrete `FileFdrWriter` behind a `FdrWriter` interface" — the interface is the boundary the composition root injects against, but only one implementation exists this cycle). +- Segment files use the same wire format as `serialise` / `parse` from AZ-272 (fdr_record_schema). The framing on disk is length-prefixed records back-to-back (length is a `uint32` little-endian header before each `serialise`d byte string); the framing is documented in the implementation report and is internal to C13 — no separate contract file this cycle. +- Dependencies pinned at AZ-263 / E-BOOT only: `atomicwrites`, `filelock`. No new project dependency is introduced by this task. +- The per-segment size cap and batch size for `drain()` are config-driven via `FdrWriterConfig` from `composition_root_protocol`; defaults are documented in the implementation report and chosen so steady-state Tier-2 throughput passes C13-PT-01. +- The writer thread runs at NORMAL priority. No real-time scheduling. The "writer must keep up at 200 Hz" budget is met by serialisation efficiency, not by priority elevation. +- Cross-process safety is `flight_root`-scoped, not segment-scoped. The lock is acquired ONCE on `start()` and released ONCE on `stop()`. + +## Risks & Mitigation + +**Risk 1: `atomicwrites` fsyncs the directory on Linux but the underlying filesystem doesn't honour it** +- *Risk*: The Tier-2 filesystem (likely ext4 on the Jetson NVM) honours `fsync` but in degraded conditions (e.g. overlayfs, tmpfs for fixtures) the rotation atomicity guarantee weakens. +- *Mitigation*: AC-3 explicitly tests under a real ext4 mount (or `tmpfs` with documented caveat); the implementation report documents the supported filesystem set. + +**Risk 2: Single writer thread becomes a bottleneck when a producer suddenly bursts** +- *Risk*: The writer thread serves N producers serially within a `drain` loop; one slow producer's records starve others. +- *Mitigation*: `drain(max_records=batch_size)` enforces fair round-robin across producers — each producer's batch is bounded so no single producer monopolises a tick. AC-8 measures steady-state behaviour; burst-handling lives in producer-side overrun policy (AZ-274). + +**Risk 3: `filelock` held across an unclean exit leaves the flight_root locked** +- *Risk*: Companion process killed (e.g. brownout) without `stop()` running; next boot finds the lock file present and refuses to construct a new writer. +- *Mitigation*: `filelock` uses POSIX advisory locks via `fcntl` — the kernel releases them on process death automatically. The lock file itself may linger but the lock state does not. Documented in the implementation report; AC-4 verifies the live-process case. + +**Risk 4: ENOSPC degraded mode produces unbounded log records** +- *Risk*: A persistent ENOSPC under sustained load could log 200/sec. +- *Mitigation*: Per-second rate cap on `kind="fdr.write_failure"` ERROR records (AC-5e). The first failure is always emitted; subsequent failures within the same second are coalesced. + +## Runtime Completeness + +- **Named capability**: single-writer thread + segment file lifecycle (architecture / E-C13 / AC-NEW-3 every-payload-class-from-t=0; no silent drops). +- **Production code that must exist**: real background thread, real `drain` loop across registered FdrClients, real segment file open/append/close with `atomicwrites`, real `filelock` acquire/release on `flight_root`, real ENOSPC handler with shared-logger ERROR + GCS alert. +- **Allowed external stubs**: tests MAY substitute a `FakeGcsAlert` (collects messages); production wiring uses the real C8 GCS adapter via the composition root. +- **Unacceptable substitutes**: `time.sleep`-driven polling without a real producer-buffer drain, in-memory list "for now" instead of segment files on disk, `pickle` or any non-`fdr_record_schema` serialiser, omitting `fsync` ("we'll add durability later"), or omitting `filelock` ("companion is single-process anyway"). diff --git a/_docs/02_tasks/todo/AZ-292_c13_flight_header_footer.md b/_docs/02_tasks/todo/AZ-292_c13_flight_header_footer.md new file mode 100644 index 0000000..9e981f6 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-292_c13_flight_header_footer.md @@ -0,0 +1,150 @@ +# C13 FlightHeader / FlightFooter + Accounting + +**Task**: AZ-292_c13_flight_header_footer +**Name**: C13 Flight Header/Footer + Accounting +**Description**: Wire the writer thread's flight-lifetime contract: an `open_flight(header: FlightHeader)` method that emits a single `kind="flight_header"` record as the first record of segment 0, a `close_flight() -> FlightFooter` method that emits a single `kind="flight_footer"` record as the last record before drain + stop, and the cross-flight running counters (`records_written`, `records_dropped_overrun`, `bytes_written`, `rollover_count`) that the footer reports. This is what makes a flight directory self-describing — without it, post-flight tooling cannot verify completeness or attribute drops to producers. +**Complexity**: 3 points +**Dependencies**: AZ-291_c13_writer_thread, AZ-272_fdr_record_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c13_fdr (epic AZ-248 / E-C13) +**Tracker**: AZ-292 +**Epic**: AZ-248 (E-C13) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — defines the canonical shape of `kind="flight_header"` and `kind="flight_footer"` payloads (consumed: every required field on each kind). +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config snapshot + signing-key-rotation-event + manifest-content-hashes the composition root passes into the FlightHeader. + +## Problem + +The writer thread from AZ-291 drains and persists FdrRecords, but at flight-time there is currently no canonical first record (which would identify the flight + carry the build/config snapshot the operator needs to reproduce post-flight) and no canonical last record (which would close the flight + report what was actually written vs. dropped). Without: + +- A `flight_header` record written FIRST, the operator post-flight has no flight_id, no build manifest hash, no config snapshot — so the FDR cannot be uniquely attributed and provenance is broken. +- A `flight_footer` record written LAST, post-flight tooling cannot distinguish a clean shutdown from a power-loss truncation, and AC-NEW-3 traceability ("how many records were dropped per producer this flight") has no canonical answer. +- Cross-flight running counters fed into the footer, the AC-NEW-3 "every drop visible" guarantee degrades into "every drop visible only inside individual records" — there is no single number the operator can audit at landing time. + +## Outcome + +- The writer's `open_flight(header)` method opens segment 0 (the path is created by AZ-291's `start()`) and writes a `kind="flight_header"` record as the first record on disk; `open_flight` returning successfully is the precondition every other onboard component uses to consider the FDR "ready" (this is the AC-NEW-3 every-payload-class-from-t=0 readiness gate the takeoff path checks — task #6 wires the gate, this task makes it observable). +- The writer maintains four monotonic counters across the entire flight: `records_written` (per-record on every successful append), `records_dropped_overrun` (incremented when the writer observes a `kind="overrun"` record from any producer — `payload.dropped_count` is added), `bytes_written` (cumulative serialised bytes), `rollover_count` (incremented per per-segment rotation from AZ-291). +- The writer's `close_flight()` method writes a single `kind="flight_footer"` record carrying those four counters + flight-end timestamp + flight_id, drains remaining records (per AZ-291's `stop()` contract), `fsync`s, releases the filelock, and returns the same FlightFooter to the caller. +- The `FlightFooter` is the canonical authoritative summary: post-flight tooling that finds a footer record with mismatched counts vs. the actual segment file contents reports a corruption finding; tooling that does NOT find a footer record marks the flight as truncated. + +## Scope + +### Included + +- `FlightHeader` dataclass: `flight_id: UUID`, `flight_started_at_iso: str`, `flight_started_at_monotonic_ns: int`, `config_snapshot: dict`, `signing_key_rotation_event: dict`, `manifest_content_hashes: dict[str, str]`, `build_info: dict` (commit hash, build date, BUILD_* flag set per ADR-002). +- `FlightFooter` dataclass: `flight_id: UUID`, `flight_ended_at_iso: str`, `flight_ended_at_monotonic_ns: int`, `records_written: int`, `records_dropped_overrun: int`, `bytes_written: int`, `rollover_count: int`, `clean_shutdown: bool`. +- `FileFdrWriter.open_flight(header: FlightHeader) -> None` (extends AZ-291's writer): validates `header.flight_id` matches the `flight_id` `start()` was constructed with; serialises `header` into a `kind="flight_header"` `FdrRecord` (envelope `producer_id="shared.fdr_client"`); appends as the first record of segment 0; raises `FdrOpenError` on failure (the actual takeoff-abort wiring is task #6, this task only raises the right exception type). +- `FileFdrWriter.close_flight() -> FlightFooter` (extends AZ-291's writer): synthesises the `FlightFooter` from the running counters; serialises into a `kind="flight_footer"` `FdrRecord`; appends as the last record before drain-and-stop; returns the FlightFooter to the caller. +- Counter integration with AZ-291's writer loop: `records_written` increments on each successful `serialise + append`; `bytes_written` increments by `len(serialised)`; `rollover_count` increments per AZ-291's rotation event; `records_dropped_overrun` is updated by inspecting incoming `kind="overrun"` records and adding `payload.dropped_count`. +- `current_size_bytes() -> int` and `is_rolling() -> bool` exposed on the writer (interface methods promised by `_docs/02_document/components/14_c13_fdr/description.md` § 2). `current_size_bytes` returns the cumulative `bytes_written`; `is_rolling` is task #1's per-segment-rotation flag re-exposed here for completeness of the public surface. +- A diagnostic INFO log on `open_flight` (one record: `kind="fdr.flight_open"; flight_id`) and `close_flight` (one record: `kind="fdr.flight_close"; records_written; records_dropped_overrun; bytes_written; rollover_count; clean_shutdown`). +- A `clean_shutdown=True` set by `close_flight`; `False` if the writer detects it is being torn down without `close_flight` ever called (e.g. via a process-exit hook the composition root installs — wiring of the hook is owned by the composition root, this task only writes the path that decides the flag value). + +### Excluded + +- Background writer thread + segment file lifecycle — owned by AZ-291. +- 64 GB total-flight cap + oldest-segment-dropped + `kind="segment_rollover"` record emission — owned by task #3 (the `rollover_count` this task maintains is incremented PER SEGMENT regardless of whether the cap-policy task is online; once task #3 ships, segment_rollover records are emitted on top of the existing per-segment rotations from task #1). +- Mid-flight tile snapshot path / failed-tile thumbnail rate cap — tasks #4 and #5. +- `FdrOpenError`-driven takeoff abort wiring in the composition root — owned by task #6 (this task only raises the right exception type from `open_flight`; the abort path that translates the exception into "do NOT open the FC adapter" is the next task). +- Composing the `FlightHeader` content (config snapshot, signing key state, manifest hashes) — that is the composition root's responsibility; this task accepts the constructed header. +- Process-exit hook installation — owned by the composition root; this task only sets the `clean_shutdown` flag based on whether `close_flight` was reached. + +## Acceptance Criteria + +**AC-1: flight_header is the first record of segment 0** +Given a valid `FlightHeader` and a constructed-but-not-yet-started writer +When `start()` followed by `open_flight(header)` runs +Then segment 0's first parsed record is `FdrRecord(kind="flight_header", payload=)` with `payload.flight_id == header.flight_id` and the record sits at byte offset 0 (no other record precedes it) + +**AC-2: flight_footer is the last record before clean stop** +Given a writer with N producer records appended and `clean_shutdown` reachable +When `close_flight()` is called +Then the last parsed record across all segments is `FdrRecord(kind="flight_footer", payload=)` with `clean_shutdown=True`; the returned FlightFooter equals the on-disk footer payload deep-equal + +**AC-3: counters reflect actual on-disk reality** +Given a flight with R producer records, D overrun-record drops, S segment rotations +When `close_flight()` runs and the test parses the footer +Then `records_written == R + 2` (the +2 is the header + footer themselves), `records_dropped_overrun == D`, `bytes_written == sum(len(serialised(r)) for r in [header, *records, footer])`, `rollover_count == S` + +**AC-4: open_flight raises FdrOpenError on disk failure** +Given a `flight_root` whose segment 0 path cannot be opened (e.g. read-only mount) +When `open_flight(header)` runs +Then `FdrOpenError` is raised; no `flight_header` record lands on disk; the writer is in the `start()`-failed state with the filelock released + +**AC-5: open_flight rejects flight_id mismatch** +Given a writer constructed with `flight_id=A` and an `open_flight(header)` where `header.flight_id=B` +When `open_flight` runs +Then `FdrOpenError` is raised with a message naming the mismatch; no `flight_header` record lands on disk + +**AC-6: close_flight without open_flight raises** +Given a writer where `start()` ran but `open_flight()` was never called +When `close_flight()` is called +Then `FdrCloseWithoutOpenError` is raised; no `flight_footer` is appended; the writer transitions to stopped (filelock released, segment closed if any data was written) + +**AC-7: clean_shutdown=False on uncleansed teardown** +Given a writer that `start()` + `open_flight()` ran and was then torn down via the composition-root process-exit hook (without `close_flight()` having been called) +When the test parses the resulting FDR directory +Then either (a) no `flight_footer` exists (truncated flight detected), OR (b) a `flight_footer` exists with `clean_shutdown=False` — implementation chooses; the contract is that `clean_shutdown=True` MUST NOT appear when `close_flight` was not called, but writing a partial footer is allowed + +**AC-8: records_dropped_overrun aggregates payload.dropped_count** +Given the writer observes 5 `kind="overrun"` records with `payload.dropped_count` values [3, 7, 2, 11, 4] +When `close_flight()` runs +Then `records_dropped_overrun == 27` (sum of all dropped_count values, NOT the count of overrun records — the count is observable from the records themselves) + +## Non-Functional Requirements + +**Performance** +- `open_flight` returns within 50 ms p99 (it serialises one record + appends it; no network or compute beyond `serialise`). +- `close_flight` returns within 200 ms p99 for typical flights (it triggers the writer's drain-and-stop sequence, but the per-record cost is dominated by `fsync` and the typical residual buffer is small). +- Counter updates on the steady-state path add ≤ 0.5 µs per record (atomic increments; no locking — the writer thread is the sole mutator). + +**Reliability** +- The four counters are write-once-per-record from the writer thread (the writer is the sole mutator); reads from outside the thread (e.g. `current_size_bytes()`) MUST be atomic snapshots — Python's GIL covers this for `int`, but the implementation MUST NOT introduce any non-atomic compound update. +- `close_flight()` is idempotent on success: a second call returns the same FlightFooter without writing again, OR raises `FdrAlreadyClosedError` — implementation chooses; the contract test covers either outcome and asserts no double-write of the footer. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | start + open_flight + parse segment 0 | Record at offset 0 is `flight_header` with matching flight_id | +| AC-2 | open_flight + N producer records + close_flight | Last record across segments is `flight_footer`; returned footer == on-disk footer deep-equal; clean_shutdown=True | +| AC-3 | Run a flight with known R, D, S; parse footer counters | counters match (records_written, records_dropped_overrun, bytes_written, rollover_count) | +| AC-4 | open_flight against read-only flight_root | `FdrOpenError`; no header on disk; filelock released | +| AC-5 | open_flight with mismatched flight_id | `FdrOpenError`; message names the mismatch | +| AC-6 | close_flight without open_flight | `FdrCloseWithoutOpenError`; no footer written | +| AC-7 | start + open_flight + tear down without close_flight | No flight_footer OR flight_footer with clean_shutdown=False | +| AC-8 | Inject 5 overrun records with known dropped_counts | records_dropped_overrun == sum of dropped_count | +| NFR-perf-open | Microbench open_flight | p99 ≤ 50 ms | +| NFR-perf-close | Microbench close_flight | p99 ≤ 200 ms | +| NFR-perf-counters | Microbench writer loop with counter updates vs. without | overhead ≤ 0.5 µs per record | +| NFR-reliability-idempotent-close | call close_flight twice | second returns same footer OR raises FdrAlreadyClosedError; no double-write | + +## Constraints + +- `FlightHeader.config_snapshot` MUST be JSON-safe (no Python objects); the composition root is responsible for serialising the typed Config dataclass into a plain dict before constructing the header. +- `FlightHeader.manifest_content_hashes` MUST be a `dict[str, str]` of `{relative_path: sha256_hex}`; relative-path keys are repository-rooted (matches the helper from AZ-280 sha256_sidecar's invariants). +- The footer's `clean_shutdown` flag is the ONLY way to distinguish a graceful landing from a crash; do NOT add a separate "fault" record kind for this purpose. +- This task does NOT add new Python dependencies — `uuid`, `datetime`, and `time.monotonic_ns` are stdlib. + +## Risks & Mitigation + +**Risk 1: FlightHeader carries secrets via config_snapshot** +- *Risk*: A composition-root config block contains an API key (e.g. satellite-provider) and ends up in the FDR — operator workstations now hold credentials in plain JSON. +- *Mitigation*: The composition root scrubs known-secret fields (per the redacted-config helper from AZ-269) before constructing the header. AC validation here checks the dict is JSON-safe; the secret-scrub is owned by the composition root and is out of scope for this task. Documented in the constraints. + +**Risk 2: Counters drift under writer-thread crash** +- *Risk*: A crash mid-flight leaves the in-memory counters un-flushed; the post-flight reader infers different counts from segment-walking than the (absent) footer. +- *Mitigation*: The footer is the authoritative summary on clean shutdown; on crash the operator MUST re-derive counters from segment scan and treat the absence of a footer as a known signal. AC-7 covers this. + +**Risk 3: open_flight side-effects on failure** +- *Risk*: `open_flight` opens segment 0, writes a partial header, then fails — leaving a half-written first record on disk. +- *Mitigation*: `open_flight` writes the header via `serialise(header_record)` first, computes the byte string, then performs a single `write()` + `fsync()`; on failure the segment file is closed and unlinked (since segment 0 is empty by construction at this point, deletion is safe). AC-4 covers this. + +## Runtime Completeness + +- **Named capability**: per-flight self-describing FDR (architecture / E-C13 / AC-NEW-3 every-payload-class-from-t=0; AC-NEW-3 audit trail). +- **Production code that must exist**: real `FlightHeader` and `FlightFooter` dataclasses, real header/footer record append paths, real four-counter accounting in the writer-thread loop, real `clean_shutdown` flag. +- **Allowed external stubs**: none — the header/footer + counters are the production runtime audit capability. +- **Unacceptable substitutes**: header-or-footer-only emission ("we'll add the other one later"), counter values stored only in logs ("the log file is the audit trail"), or counters that DON'T include header/footer in `records_written` ("only producer records count") — the latter would force operators to do special-case math at audit time and is exactly the kind of off-by-N bug AC-NEW-3 traceability is meant to prevent. diff --git a/_docs/02_tasks/todo/AZ-293_c13_capacity_cap_policy.md b/_docs/02_tasks/todo/AZ-293_c13_capacity_cap_policy.md new file mode 100644 index 0000000..2e7b373 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-293_c13_capacity_cap_policy.md @@ -0,0 +1,158 @@ +# C13 64 GB Capacity Cap + Oldest-Segment-Dropped Policy + +**Task**: AZ-293_c13_capacity_cap_policy +**Name**: C13 Capacity Cap Policy +**Description**: Enforce the per-flight ≤ 64 GB cap from AC-NEW-3 by observing the segment files written by the writer thread (AZ-291), deleting the oldest CLOSED segment when the cumulative on-disk size of the flight directory crosses the configured cap (default 64 GB; configurable down for tests), and emitting a `kind="segment_rollover"` `FdrRecord` carrying the dropped segment number, byte count freed, and total bytes after the drop. The drop is ALWAYS recorded — there is no config flag that silences `segment_rollover` records (per AC-NEW-3 + ADR-008 + C13-ST-01). The currently-open segment is NEVER dropped; only sealed segments older than the current one are eligible. +**Complexity**: 5 points +**Dependencies**: AZ-291_c13_writer_thread, AZ-292_c13_flight_header_footer, AZ-272_fdr_record_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c13_fdr (epic AZ-248 / E-C13) +**Tracker**: AZ-293 +**Epic**: AZ-248 (E-C13) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — defines the canonical shape of `kind="segment_rollover"` payloads (consumed: `old_segment`, `new_segment`, `total_bytes_after`). +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config block carrying `flight_cap_bytes` (default 64 GB; lowered in tests). + +## Problem + +The writer thread from AZ-291 rotates per-segment when the per-segment size cap is reached, but does NOT enforce the per-flight 64 GB cap from AC-NEW-3. Without: + +- Drop policy when the flight directory crosses 64 GB, the writer would either run out of disk (likely on the Jetson NVM where other binaries live) or fail with `ENOSPC` and degrade per AC-5 of AZ-291. AC-NEW-3 requires the cap to be ENFORCED, not detected. +- Oldest-segment-dropped semantics, the cap could be enforced by truncating the current segment — which would corrupt records mid-write and break the wire-format invariant from AZ-272. +- A `kind="segment_rollover"` record per drop, the drop is silent — directly violating AC-NEW-3 ("no silent drops") and the C13-ST-01 security test ("no config flag silences these record kinds"). The drop record is ALSO the post-flight tooling's only way to learn that the flight USED to have records the file directory no longer contains. + +## Outcome + +- After every per-segment rotation the writer performs (AZ-291), this task checks whether the flight directory's cumulative on-disk byte size exceeds `flight_cap_bytes`. If yes, it deletes the oldest CLOSED segment (segment 0 first, then segment 1, etc., never the currently-open segment) and repeats until the directory size is back under cap. +- For each drop, a `kind="segment_rollover"` `FdrRecord` is enqueued via the shared `FdrClient` for `producer_id="shared.fdr_client"`. The record carries `payload.old_segment` (the segment number that was deleted), `payload.new_segment` (the writer's currently-open segment number), and `payload.total_bytes_after` (the post-drop on-disk byte count). +- The cap is configurable via `composition_root_protocol`'s `flight_cap_bytes` field (default 64 GB; tests use 4 KiB or similar to exercise the policy without filling real disks). +- The cap policy NEVER drops the currently-open segment (would interrupt mid-record); NEVER drops `segment-0000.fdr` if it contains the `flight_header` UNLESS the directory is so over-cap that no other segment exists to drop (in that case the operator's flight has exceeded what the cap can absorb and a hard ERROR + GCS alert path is triggered, distinct from the normal drop path). +- The post-flight reader uses the sequence of `segment_rollover` records to reconstruct what was dropped vs. what was retained, and the `FlightFooter`'s `rollover_count` (from AZ-292) reports the total number of cap-driven drops. + +## Scope + +### Included + +- A `CapacityCapPolicy(writer: FileFdrWriter, cap_bytes: int, fdr_client: FdrClient)` class wired into `FileFdrWriter` via a documented post-rotation hook. +- The hook is invoked AFTER every successful per-segment rotation (AZ-291's rotation completion path); it walks `flight_root//`, sums on-disk byte sizes of all `segment-NNNN.fdr` files (excluding the currently-open segment whose byte count comes from the writer's running `bytes_written` counter), and decides whether to drop. +- Drop ordering: oldest segment first. Segment numbers are monotonic (per AZ-291's filesystem layout `segment-NNNN.fdr`), so "oldest" = lowest segment number among CLOSED segments. +- Drop mechanics: `os.unlink` the segment file, increment the writer's `rollover_count` (the counter from AZ-292), enqueue the `kind="segment_rollover"` record via the shared `FdrClient`. The record's `payload.old_segment` is the deleted segment number; `payload.new_segment` is the writer's currently-open segment; `payload.total_bytes_after` is recomputed after the unlink. +- Loop until under cap: if a single drop does not bring the directory under cap (e.g. very large segments + long flight), drop the next-oldest segment and emit another `segment_rollover` record. AC-3 covers loop termination. +- Special-case "only segment 0 with header remains, AND it is over cap by itself": this is the operator-error case (cap configured smaller than a single segment + header). Hard-fail: log ERROR `kind="fdr.cap_misconfigured"`, invoke the GCS alert (the same one AZ-291 wires for ENOSPC), and refuse to drop `segment-0000.fdr`. The flight continues in degraded mode — segments accumulate on disk past the cap until either a normal drop becomes possible or the operator lands. +- A diagnostic INFO log per drop (`kind="fdr.cap_drop"; old_segment; new_segment; total_bytes_after`) — distinct from the FDR record itself; the log line is for operator debugging, the FDR record is the canonical audit trail. +- Configuration: `flight_cap_bytes` is a single integer field on the `FdrWriterConfig` consumed via `composition_root_protocol`; the default is `64 * 1024**3` (64 GiB exactly per AC-NEW-3); valid range is `1024 .. 2**40` (1 KiB minimum for tests, 1 TiB maximum sanity bound). +- The cap policy does NOT have a config flag to disable it. The implementation MUST NOT expose a "disable cap" boolean on any Config block — verified by C13-ST-01 (that test scans the config schema for any flag that could disable rollover-drop emission). + +### Excluded + +- Per-segment file rotation itself — owned by AZ-291. +- `FlightHeader` / `FlightFooter` accounting and `rollover_count` storage — owned by AZ-292 (this task increments the counter; the counter itself lives in the writer). +- The `kind="segment_rollover"` payload schema — owned by AZ-272 (this task constructs records that conform to that schema). +- Mid-flight tile snapshot path and failed-tile thumbnail rate cap — tasks #4 and #5. +- ENOSPC degraded-mode handling — owned by AZ-291 (this task uses the same GCS alert callable for the cap-misconfigured edge case). +- Post-flight reader logic that reconstructs dropped data from the rollover records — out of scope this cycle. +- Cross-flight retention (deleting OLD flight directories to free disk) — out of scope; the cap is per-flight, the operator manages cross-flight cleanup. + +## Acceptance Criteria + +**AC-1: Drop oldest closed segment when directory exceeds cap** +Given a flight directory with segments 0..3 each sized 100 KiB, currently-open segment 4 at 50 KiB, and `flight_cap_bytes = 350 KiB` +When the writer rotates to segment 5 (segment 4 is now closed at 100 KiB; total = 500 KiB > 350 KiB cap) +Then segment 0 is unlinked from disk; the writer's `rollover_count` increments by 1; a `kind="segment_rollover"` record lands on the FDR with `payload.old_segment=0`, `payload.new_segment=5`, `payload.total_bytes_after == sum(file_sizes(segment-0001..segment-0005))` + +**AC-2: Loop until under cap** +Given a flight directory with segments 0..9 each 100 KiB and `flight_cap_bytes = 350 KiB`, currently-open segment 10 +When the post-rotation hook runs +Then segments 0, 1, 2, 3, 4, 5, 6 are deleted (in order); 7 `kind="segment_rollover"` records land on the FDR (one per drop); the directory total falls to ≤ 350 KiB + +**AC-3: Loop terminates even when bytes_after never reaches cap (degenerate case)** +Given a contrived test where `cap_bytes` is 100 KiB but the currently-open segment alone is already 200 KiB, AND only segment 0 (containing the flight_header) closed before +When the post-rotation hook runs +Then segment 0 is NOT dropped (it contains the header); ONE ERROR log (`kind="fdr.cap_misconfigured"`) is emitted; ONE GCS alert is invoked; the loop terminates within bounded time (≤ 100 ms p99); the flight continues in degraded mode + +**AC-4: Currently-open segment is NEVER dropped** +Given a flight directory with segments 0..2 closed and segment 3 currently open +When the post-rotation hook runs (after rotating to segment 4) AND the cap is exceeded by the currently-open segment alone +Then segment 4 (the new currently-open segment) is NOT dropped; older segments (0, 1, 2, 3) are dropped first per the oldest-first rule + +**AC-5: segment_rollover record contains canonical fields** +Given any cap-driven drop event +When the test parses the resulting `segment_rollover` record +Then `payload` has exactly `old_segment` (int), `new_segment` (int), `total_bytes_after` (int >= 0); the OUTER envelope's `producer_id == "shared.fdr_client"` (per the schema contract); the record's `ts` is within 100 ms of the `os.unlink` call + +**AC-6: No config flag disables segment_rollover emission** +Given the project's full Config schema and every documented config preset +When the test scans config classes for a field that could suppress `kind="segment_rollover"` records (per C13-ST-01) +Then no such field exists; injecting a synthetic preset that attempts to suppress the record fails type-check or runtime validation + +**AC-7: Default cap is exactly 64 GiB** +Given a default `FdrWriterConfig` constructed with no overrides +When the test reads `cap_bytes` +Then `cap_bytes == 64 * 1024**3` (exactly 64 GiB) + +**AC-8: rollover_count from FlightFooter matches segment_rollover record count** +Given a flight that triggered N cap-driven drops over its lifetime +When `close_flight()` runs and the test parses the footer +Then `footer.rollover_count == N + per_segment_rotations` (the AZ-292 counter increments on EVERY rotation; cap-driven drops add to it; the segment_rollover record count provides cross-validation against the cap-driven subset) + +## Non-Functional Requirements + +**Performance** +- Post-rotation hook execution time p99 ≤ 50 ms per rotation under steady-state (one drop per rotation at most, typical case). Per AC-2 worst case, multiple drops may extend the hook; the implementation MUST NOT block the writer thread's drain loop for more than 100 ms total even under worst-case multi-drop bursts (cap configured very low for tests). +- `os.unlink` on the per-flight NVM (typical Jetson Orin Nano Super filesystem) takes < 5 ms p99 for files up to 256 MiB; the implementation relies on this, no async unlink. +- Directory scan for byte counting uses a per-flight sorted-segment-list cached by the policy class (refreshed on each rotation), NOT a fresh `os.scandir` per check — `os.scandir` cost grows with segment count and would dominate for long flights. + +**Reliability** +- The cap policy MUST NOT delete a segment whose deletion is in progress (idempotency: a re-entry into the hook before `os.unlink` returns is impossible because the writer thread is the sole invoker, but the policy MUST handle the case where a previous unlink left a stale entry in the cached segment list — refresh the list from disk on every entry). +- A failed `os.unlink` (e.g. read-only filesystem, ENOENT for an already-deleted segment due to operator manual intervention) is logged at WARN with `kind="fdr.cap_unlink_failed"` and the policy continues to the next-oldest segment; it does NOT halt the writer. +- The `segment_rollover` record is enqueued via the shared `FdrClient` (which has its own overrun policy from AZ-274); if the FdrClient's buffer is full at the moment of drop, the record itself may overrun — that is fine, AZ-274's overrun policy emits a `kind="overrun"` record with `producer_id="shared.fdr_client"` and the drop is still observable through AZ-292's `records_dropped_overrun` counter. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | 4 closed segments × 100 KiB + 50 KiB open + cap=350 KiB; trigger rotation to segment 5 | Segment 0 deleted; one segment_rollover record with correct payload fields | +| AC-2 | 10 closed segments × 100 KiB + cap=350 KiB; one rotation | 7 oldest segments deleted (in order); 7 segment_rollover records; final dir total ≤ 350 KiB | +| AC-3 | Cap=100 KiB, segment 3 currently open at 200 KiB, only segment 0 (header) closed | Segment 0 NOT deleted; one ERROR log "fdr.cap_misconfigured"; one GCS alert; hook terminates ≤ 100 ms | +| AC-4 | Currently-open segment exceeds cap by itself; older segments exist | Older segments drop first; currently-open never dropped | +| AC-5 | Trigger any drop; parse the resulting segment_rollover record | payload has exactly old_segment / new_segment / total_bytes_after; outer producer_id == "shared.fdr_client"; ts within 100 ms of unlink | +| AC-6 | Scan Config class hierarchy for "disable_segment_rollover" / "suppress_*" / "no_rollover" fields | None found; synthetic config trying to disable fails validation | +| AC-7 | Default `FdrWriterConfig()` | `cap_bytes == 64 * 1024**3` | +| AC-8 | Run a flight with N cap-driven drops + M per-segment rotations; parse footer + segment_rollover records | `footer.rollover_count == N + M`; segment_rollover record count == N | +| NFR-perf-hook | Microbench post-rotation hook with 1 drop | p99 ≤ 50 ms | +| NFR-perf-multi-drop | Microbench worst-case multi-drop burst | total ≤ 100 ms | +| NFR-reliability-stale-list | Manually delete a segment file under the policy; trigger hook | WARN log "fdr.cap_unlink_failed"; policy continues | + +## Constraints + +- The cap is enforced ONLY via oldest-segment-dropped. The implementation MUST NOT truncate any segment file, MUST NOT modify any record once written, MUST NOT seek into closed segments. AZ-291's "append-only between rotations" invariant extends to "no in-place modification across the entire flight". +- The cap is applied to the SUM of all on-disk segment file sizes (closed + currently-open). Sidecar files outside the segment files (e.g. mid-flight tile snapshots from task #4 — those land in a separate path under `flight_root//tiles/`) are NOT counted toward the cap; their cap is owned by task #4. This task's cap is segment-file-only. +- The cap policy hook is wired by the composition root, NOT by AZ-291's writer constructor (so AZ-291 stays focused on per-segment lifecycle without knowing about per-flight cap policy). The composition root injects the policy as a callback the writer invokes after each rotation. +- The configuration field name is `flight_cap_bytes`; renaming is a breaking change requiring a major bump on `composition_root_protocol`. +- The `kind="segment_rollover"` record is mandatory per AC-NEW-3 + ADR-008 + C13-ST-01. There is no future PBI that adds an opt-out flag — that is a contract test, not a code-review preference. + +## Risks & Mitigation + +**Risk 1: Filesystem reports cached size, drop appears not to free space** +- *Risk*: Some filesystems lazily release `unlink`ed inodes; `os.statvfs` immediately after `unlink` shows the bytes still allocated; the cap policy thinks it needs to drop more. +- *Mitigation*: The policy uses `os.path.getsize` summed across actual segment files, NOT `statvfs` of the mount. Once the segment file is `unlink`ed, it no longer appears in `os.scandir` and is not summed. This is correct independent of inode-release timing. + +**Risk 2: Operator manually deletes a segment mid-flight** +- *Risk*: An operator with shell access to the companion deletes `segment-0001.fdr`; the policy's cached segment list is now stale. +- *Mitigation*: AC-NFR-reliability-stale-list — the policy refreshes from `os.scandir` on every hook entry, logs WARN if a previously-tracked segment is missing, and continues. Treat operator interference as out-of-band noise, not a failure mode. + +**Risk 3: Cap policy and AZ-291's per-segment rotation race** +- *Risk*: The policy reads the segment list while AZ-291 is opening a new segment; the new segment file may exist but be empty. +- *Mitigation*: The hook is invoked SYNCHRONOUSLY by AZ-291's rotation completion path (not by a separate thread or timer). The writer thread is the sole mutator; there is no concurrent rotation. AC-1 verifies this end-to-end. + +**Risk 4: GCS alert flooded by cap-misconfigured edge case** +- *Risk*: AC-3 path triggers GCS alerts on every rotation; alerts overwhelm the GCS link. +- *Mitigation*: Per-flight rate cap on `kind="fdr.cap_misconfigured"` GCS alerts — at most one per flight, since the misconfig is a flight-level constant. After the first alert, subsequent occurrences are logged at ERROR but NOT alerted. + +## Runtime Completeness + +- **Named capability**: per-flight 64 GB cap enforcement with oldest-segment-dropped + canonical drop-record emission (architecture / E-C13 / AC-NEW-3, ADR-008). +- **Production code that must exist**: real `os.unlink` on segment files, real `FdrRecord(kind="segment_rollover")` enqueue via the shared FdrClient, real config-driven cap reading, real loop-until-under-cap with degenerate-case handling. +- **Allowed external stubs**: tests MAY stub the `FdrClient` (use FakeFdrSink from AZ-275) and the GCS alert callable; production wiring uses the real instances via the composition root. +- **Unacceptable substitutes**: cap detection without enforcement ("we just log a warning when we exceed cap"), per-record drop instead of per-segment drop ("simpler to drop the oldest record"), in-place segment truncation ("avoid the unlink overhead"), suppressing the segment_rollover record under any config preset ("debug builds don't need the audit trail"), or replacing the cap policy with cross-flight cleanup ("we'll delete old FLIGHTS to make room"). All of those break AC-NEW-3 + ADR-008. diff --git a/_docs/02_tasks/todo/AZ-294_c13_mid_flight_tile_snapshot.md b/_docs/02_tasks/todo/AZ-294_c13_mid_flight_tile_snapshot.md new file mode 100644 index 0000000..7e0946a --- /dev/null +++ b/_docs/02_tasks/todo/AZ-294_c13_mid_flight_tile_snapshot.md @@ -0,0 +1,165 @@ +# C13 Mid-Flight Tile Snapshot Path + Filesystem Layout + +**Task**: AZ-294_c13_mid_flight_tile_snapshot +**Name**: C13 Mid-Flight Tile Snapshot Path +**Description**: Implement the sidecar-file path that persists mid-flight orthorectified tile snapshots produced by C6 / C11 (per AC-8.4 / F4) onto the per-flight FDR tree, and emit the corresponding `kind="mid_flight_tile_snapshot"` `FdrRecord` carrying a pointer (`snapshot_path` + `captured_at`) — NOT the JPEG bytes — so the FdrRecord schema's "embedded binary blobs ≤ 4 KiB" invariant is preserved. The sidecar files live under `flight_root//tiles/.jpg`. This task does NOT generate the tiles (C6 / C11 own that); it provides the FDR-side storage layout, the sidecar write helper, and the pointer-record emission path. +**Complexity**: 3 points +**Dependencies**: AZ-291_c13_writer_thread, AZ-272_fdr_record_schema, AZ-263_initial_structure, AZ-269_config_loader +**Component**: c13_fdr (epic AZ-248 / E-C13) +**Tracker**: AZ-294 +**Epic**: AZ-248 (E-C13) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — defines the `kind="mid_flight_tile_snapshot"` payload shape (`snapshot_path`, `captured_at`) AND the ≤ 4 KiB inline-blob invariant this task respects by emitting a pointer instead of bytes. +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config block carrying the per-flight tile cap byte budget (`tile_snapshot_cap_bytes`, default ~50 MiB per `description.md` storage estimate). + +## Problem + +Mid-flight tile snapshots are generated by C6 / C11 (per F4 mid-flight tile gen) at sizes 50–200 KiB each, up to ~50 MB per 8 h flight. They cannot be inlined into FdrRecords (the schema invariant caps inline blobs at 4 KiB) and they cannot live in the segment file (segment files are append-only streams of FdrRecords; appending arbitrary JPEG bytes would break the record framing AZ-291 + AZ-272 jointly establish). + +Without a sidecar path: +- Producers (C6 / C11) have no canonical filesystem location to write the JPEGs. Each component would invent its own, drifting on naming and breaking post-flight retrieval. +- The FdrRecord that ties the JPEG to a frame_id / tile_id / timestamp would either go missing (no record at all) or violate the schema invariant (inlining the JPEG bytes), poisoning the whole FDR. +- The per-flight tile cap (~50 MB per `description.md`) has no enforcement layer; a runaway tile producer could exhaust the same NVM the segment files compete for. + +## Outcome + +- A `MidFlightTileSnapshotSink(flight_root: Path, flight_id: UUID, fdr_client: FdrClient, config: TileSnapshotConfig)` class is the single sidecar write path. C6 / C11 producers call its `write_snapshot(tile_id: str, jpeg_bytes: bytes, captured_at: datetime, frame_id: int | None) -> Path` method; this task does NOT produce the JPEG itself. +- Sidecar files land at `flight_root//tiles/.jpg`; the directory `tiles/` is created on first write (lazy creation — empty flights leave no `tiles/` directory). +- Per call, ONE `kind="mid_flight_tile_snapshot"` FdrRecord is enqueued via the shared FdrClient with `payload.snapshot_path = "tiles/.jpg"` (relative to `flight_root//` so the FDR is portable) and `payload.captured_at = `. The JPEG bytes are NEVER inlined. +- The per-flight tile cap (`tile_snapshot_cap_bytes`, default 64 MiB to comfortably fit the ~50 MB worst case from `description.md`) is enforced via oldest-tile-dropped policy, mirroring the segment cap policy from AZ-293 but scoped to the `tiles/` subdirectory and emitted as a `kind="overrun"` record (NOT `segment_rollover` — that kind is reserved for segment-file drops). Each tile drop emits a record with `payload.producer_id="shared.fdr_client"` and `payload.dropped_count=1`. +- The sink is thread-safe for many producers (C6, C11 may call concurrently from different threads); the file write itself uses `atomicwrites` to avoid partial JPEGs on crash. + +## Scope + +### Included + +- `MidFlightTileSnapshotSink` class as defined above. +- `write_snapshot(tile_id: str, jpeg_bytes: bytes, captured_at: datetime, frame_id: int | None = None) -> Path`: + 1. Validate `len(jpeg_bytes) <= jpeg_max_bytes` (default 256 KiB; rejects with `TileSnapshotTooLargeError` — not infinite-trust on producers). + 2. Validate `tile_id` matches `[a-zA-Z0-9_-]{1,128}` (rejects with `TileSnapshotInvalidIdError`). + 3. Compute the absolute sidecar path; create `flight_root//tiles/` if missing (`os.makedirs(exist_ok=True)`). + 4. Write the JPEG via `atomicwrites` (temp file + `os.rename` after `fsync`). + 5. Enqueue the `kind="mid_flight_tile_snapshot"` FdrRecord with relative path + ISO timestamp + optional `frame_id`. + 6. Check the cap (sum of bytes under `tiles/`). If over cap, drop the oldest `tile_id`-by-`captured_at` and emit an overrun record. + 7. Return the absolute sidecar path to the caller (so the producer can log it if needed). +- `tile_snapshot_cap_bytes` config field (`composition_root_protocol`); default `64 * 1024**2` (64 MiB). +- `jpeg_max_bytes` config field; default `256 * 1024` (256 KiB; per `description.md` "50–200 KB each", 256 KiB gives a small safety margin while bounding adversarial growth). +- Thread-safe API: a single `threading.Lock` around the cap-check + drop sequence (the file write itself is `atomicwrites` so it is independently safe). The lock is held for ≤ 5 ms p99; `write_snapshot` is NOT a hot path (tiles are sparse — ~0.01–0.1 Hz typical). +- A diagnostic INFO log on each successful write (`kind="fdr.tile_snapshot_written"; tile_id; size_bytes`) and WARN on each cap-driven drop (`kind="fdr.tile_snapshot_dropped"; tile_id; size_bytes_freed; cap_bytes_after`). +- Recovery on existing `tiles/` directory: on construction, the sink scans `flight_root//tiles/` for any pre-existing tiles (e.g. from a crashed and resumed flight via the same flight_id); the cap policy treats them as in-cap unless they push the directory over cap. No tiles are auto-deleted on construction; only on overflow. +- The sink does NOT interact with AZ-293's segment cap policy directly. The `tiles/` subdirectory is excluded from segment-cap accounting (per AZ-293 constraint "sidecar files outside the segment files are NOT counted toward the cap"); the tile cap is independent. + +### Excluded + +- Generating tile JPEGs (orthorectification, downsampling, encoding) — owned by F4 / C6 / C11 producers. +- The `kind="mid_flight_tile_snapshot"` payload schema — owned by AZ-272. +- Post-flight retrieval / upload of tile sidecars — owned by C12 post-landing upload trigger (out of scope this cycle). +- Failed-tile thumbnail rate limiter — owned by task #5 (this task is for SUCCESS-path tile snapshots from F4; failed-tile thumbnails are a separate, AC-8.5-governed forensic category). +- Per-segment file rotation, 64 GB cap on segments, header/footer accounting — owned by AZ-291 / AZ-292 / AZ-293. +- Compression of the JPEG (already JPEG; no further compression). +- Encryption / signing of the JPEG — out of scope this cycle. + +## Acceptance Criteria + +**AC-1: write_snapshot persists JPEG to canonical sidecar path** +Given a sink constructed for `flight_root=/tmp/fdr` and `flight_id=abc-123`, and a JPEG byte string of size 100 KiB +When `write_snapshot(tile_id="t_001", jpeg_bytes=<100 KiB>, captured_at=now)` is called +Then `/tmp/fdr/abc-123/tiles/t_001.jpg` exists on disk; its bytes equal the input JPEG byte-for-byte; the file is fully written (no temp file artifacts left) + +**AC-2: Pointer record is enqueued, not inline bytes** +Given the same call as AC-1 +When the consumer drains the FdrClient and parses the record +Then ONE record with `kind="mid_flight_tile_snapshot"` is observed; `payload.snapshot_path == "tiles/t_001.jpg"` (relative to flight directory); `payload.captured_at` is the ISO 8601 string of the input timestamp; `payload.frame_id` matches the input (or is absent if input was None); NO `payload.jpeg_bytes` field exists; the serialised record is < 1 KiB + +**AC-3: Cap-driven drop emits overrun record + deletes oldest tile** +Given `tile_snapshot_cap_bytes=200 KiB`, three tiles already on disk: `t_old.jpg=100 KiB` (captured_at=t0), `t_mid.jpg=80 KiB` (t1), `t_new.jpg=100 KiB` (t2) +When `write_snapshot(tile_id="t_overflow", jpeg_bytes=<60 KiB>, captured_at=t3)` is called +Then `t_old.jpg` is deleted (oldest by `captured_at`); the new tile is persisted; ONE `kind="overrun"` record is enqueued with `payload.producer_id="shared.fdr_client"` and `payload.dropped_count=1` + +**AC-4: TileSnapshotTooLargeError on oversized JPEG** +Given `jpeg_max_bytes=256 KiB` and an input of 300 KiB +When `write_snapshot` is called +Then `TileSnapshotTooLargeError` is raised before any file or record is written; the `tiles/` directory is unchanged; no FdrRecord lands on the FdrClient + +**AC-5: TileSnapshotInvalidIdError on bad tile_id** +Given an input `tile_id="../../../etc/passwd"` (path traversal attempt) +When `write_snapshot` is called +Then `TileSnapshotInvalidIdError` is raised before any file write; the `tiles/` directory is unchanged + +**AC-6: Concurrent writes are serialised correctly** +Given two threads each calling `write_snapshot` 100 times with distinct `tile_id`s under cap +When both threads run concurrently +Then all 200 sidecar files exist with byte-correct contents; 200 `mid_flight_tile_snapshot` records were enqueued (one per call); zero overruns; no partial files in `tiles/` + +**AC-7: Existing tiles preserved on sink construction** +Given `/tmp/fdr/abc-123/tiles/` already contains 3 tile files from a prior process (totaling 150 KiB; cap is 200 KiB) +When the sink is constructed for the same `flight_id` +Then the existing tiles are NOT deleted on construction; the sink's internal cap accounting includes them; a subsequent `write_snapshot` of 60 KiB triggers a drop of the oldest existing tile + +**AC-8: Atomic write — no partial JPEGs on crash** +Given a `write_snapshot` call that is interrupted (simulated kill) between the `atomicwrites` temp-file write and the rename +When the test re-inspects the `tiles/` directory +Then NO file exists at the canonical sidecar path with partial content; either the file is absent OR it is fully written and parseable + +## Non-Functional Requirements + +**Performance** +- `write_snapshot` returns within 50 ms p99 for a 200 KiB JPEG on Tier-2 NVM (dominated by `fsync` on rename; tiles are sparse so no batching needed). +- The cap-check sequence (lock acquire + scan tiles/ + drop if needed + lock release) p99 ≤ 5 ms when no drop is needed; p99 ≤ 50 ms when one drop is needed. +- Producer-perceived latency must NOT exceed 100 ms p99 in any scenario — F4 mid-flight tile generation is NOT a hot path but operators do see the result. + +**Reliability** +- The sink's `write_snapshot` is at-most-once per call: a successful write is exactly one sidecar + exactly one FdrRecord; a failed write is zero sidecar (atomicwrites ensures this) + zero FdrRecord (the record is enqueued only after the sidecar write completes). +- The cap-driven drop is also at-most-once per overflow event: the overrun record is enqueued exactly once even under contention (the lock covers the drop + emission sequence). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | write_snapshot 100 KiB JPEG | sidecar file exists at canonical path with byte-correct content | +| AC-2 | Parse the enqueued record | kind=mid_flight_tile_snapshot; payload has snapshot_path + captured_at + frame_id; no jpeg_bytes; serialised < 1 KiB | +| AC-3 | Cap=200 KiB, 3 existing tiles + 60 KiB new | Oldest tile deleted; new tile present; one overrun record enqueued | +| AC-4 | 300 KiB JPEG with cap on size | TileSnapshotTooLargeError; no file or record written | +| AC-5 | tile_id with path traversal characters | TileSnapshotInvalidIdError; no file or record written | +| AC-6 | 2 threads × 100 calls each | All 200 sidecars present; 200 records enqueued; no partials | +| AC-7 | Pre-populate tiles/ then construct sink | Existing tiles untouched; cap accounting includes them | +| AC-8 | Kill mid-write | No partial file at canonical path; either complete or absent | +| NFR-perf-write | Microbench write_snapshot for 200 KiB | p99 ≤ 50 ms | +| NFR-perf-cap-check | Microbench cap-check no-drop path | p99 ≤ 5 ms | +| NFR-perf-cap-drop | Microbench cap-check drop path | p99 ≤ 50 ms | +| NFR-reliability-atomic | Inject failure between temp-write and rename | No half-written canonical file | + +## Constraints + +- The sidecar path is RELATIVE to `flight_root//` in the FdrRecord (`payload.snapshot_path = "tiles/.jpg"`). This makes the FDR portable: the operator can copy the entire flight directory anywhere and the records still reference the right files. +- `tile_id` validation regex `^[a-zA-Z0-9_-]{1,128}$` is the contract; producers may use any naming scheme inside that envelope. +- The cap is `tile_snapshot_cap_bytes`, distinct from segment `flight_cap_bytes` in AZ-293. The two caps are independent — exceeding one does NOT trigger drops in the other domain. +- The shared FdrClient's `producer_id` for the records emitted by this sink is `"shared.fdr_client"` (the sink itself is shared infrastructure); the originating producer (C6 / C11) is reflected ONLY in the optional `frame_id` payload field, not in the outer envelope. Rationale: F4 tiles may be produced collaboratively across multiple components and the canonical attribution is the captured_at timestamp + tile_id. +- This task does NOT introduce any new dependency: `atomicwrites` is already pinned at AZ-263 / E-BOOT. + +## Risks & Mitigation + +**Risk 1: Producer (C6 / C11) flushes tiles faster than the cap can absorb** +- *Risk*: A pathological case where 1000 small tiles per second push the cap into constant churn. +- *Mitigation*: F4 tile generation is rate-limited at the producer side per the C6 / C11 specs (typical 0.01–0.1 Hz). The cap is sized at 64 MiB to comfortably hold the per-flight worst case. The cap-driven overrun record is the canonical signal if a producer misbehaves; AC-3 covers the policy. + +**Risk 2: tile_id collisions across producers** +- *Risk*: C6 and C11 both pick `tile_id="x_42"`; the second call overwrites the first. +- *Mitigation*: `atomicwrites` uses temp files but the rename targets the canonical name — second call OVERWRITES the first. The `payload.snapshot_path` in the second record is identical to the first; the test operator sees ONE file at the path with the second JPEG and TWO records pointing to it. Documented as a limitation: producers MUST namespace their `tile_id`s (e.g. `c6__`); the sink does NOT enforce uniqueness. Code-review Phase 7 (Architecture) catches collisions in `tile_id` schemes across components. + +**Risk 3: A failed sidecar write leaves the FdrRecord pointing at a missing file** +- *Risk*: `atomicwrites` succeeds in the temp file but `os.rename` fails; we already enqueued the FdrRecord pointing at the canonical name. +- *Mitigation*: The order is FIRST sidecar write (must complete) THEN FdrRecord enqueue. AC-2 implicitly covers this — if the sidecar write raises, no record is enqueued. The implementation MUST NOT enqueue the record before `atomicwrites` returns. + +**Risk 4: `os.scandir` of `tiles/` becomes slow with thousands of tiles** +- *Risk*: A 100 MiB cap with tiny tiles ends up with ~10k files in `tiles/`; scanning that directory on every write becomes the bottleneck. +- *Mitigation*: The sink caches the in-memory tile list (sorted by `captured_at`) and updates it on every write; `os.scandir` runs only once on construction (AC-7). Cache invalidation on a manually-deleted tile mirrors AZ-293's stale-list refresh. + +## Runtime Completeness + +- **Named capability**: per-flight mid-flight tile snapshot sidecar storage + pointer-record emission (architecture / E-C13 / AC-8.4 quality metadata, F4 mid-flight tile gen). +- **Production code that must exist**: real `atomicwrites`-based sidecar writer, real FdrRecord pointer emission, real cap-policy + overrun record on overflow. +- **Allowed external stubs**: tests MAY use `FakeFdrSink` (AZ-275) and a tmp `flight_root`; production wiring uses the real `FdrClient` from the composition root. +- **Unacceptable substitutes**: inlining JPEG bytes into FdrRecords ("for now we don't have a sidecar path"), unbounded tile growth without cap enforcement ("the segment cap will catch it" — it won't, AZ-293 explicitly excludes the `tiles/` subdirectory), or skipping `atomicwrites` ("crash-tolerance is a nice-to-have") — operators ARE going to crash-resume mid-flight on Jetson hardware. diff --git a/_docs/02_tasks/todo/AZ-295_c13_thumbnail_rate_limiter.md b/_docs/02_tasks/todo/AZ-295_c13_thumbnail_rate_limiter.md new file mode 100644 index 0000000..8ee9bd6 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-295_c13_thumbnail_rate_limiter.md @@ -0,0 +1,172 @@ +# C13 Failed-Tile Thumbnail Rate Limiter + AC-8.5 Forbidden-Kind Enforcement + +**Task**: AZ-295_c13_thumbnail_rate_limiter +**Name**: C13 AC-8.5 Forbidden-Kind + Thumbnail Rate Cap +**Description**: Implement two paired record-policy gates required by AC-8.5 / C13-IT-03 / RESTRICT-UAV-4: (1) a synchronous producer-side validator that REFUSES `kind="raw_nav_frame"` (and any other AI-cam / nav-cam raw-frame kind) by raising `RawFrameWriteForbiddenError` BEFORE the record is enqueued, so the security violation is visible to the offending producer at the call site; (2) a writer-thread-side rate cap on `kind="failed_tile_thumbnail"` records (default ≤ 0.1 Hz per `description.md` § 7) that drops over-cap thumbnails with a WARN log + emits a `kind="overrun"` record carrying the dropped count, while letting in-cap thumbnails pass through to disk untouched. Together they enforce the only allowed raw-imagery-adjacent persistence path on the FDR. +**Complexity**: 3 points +**Dependencies**: AZ-291_c13_writer_thread, AZ-272_fdr_record_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c13_fdr (epic AZ-248 / E-C13) +**Tracker**: AZ-295 +**Epic**: AZ-248 (E-C13) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — defines `kind="failed_tile_thumbnail"` payload (`{frame_id, tile_id, jpeg_bytes_b64}`) and the ≤ 4 KiB inline-blob invariant the cap respects. +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config block carrying `forbidden_record_kinds` (default frozen set including `raw_nav_frame`, `raw_ai_cam_frame`) and `failed_tile_thumbnail_max_hz` (default 0.1). + +## Problem + +Per AC-8.5 + RESTRICT-UAV-4, the FDR is the ONLY persistence path for raw-imagery-adjacent data, and the ONLY allowed raw-imagery-adjacent kind is `failed_tile_thumbnail`, capped at ≤ 0.1 Hz. Without: + +- A synchronous validator that rejects `kind="raw_nav_frame"` (and equivalents) at the producer's call site, a careless or compromised producer could enqueue a stream of raw frames; by the time the writer thread sees them and drops them, gigabytes of raw imagery have already been serialised onto the wire format and (worst case) onto a segment file. Even an "asynchronous reject + drop" model leaks the bytes through transient memory. +- A writer-side rate cap on `failed_tile_thumbnail`, a producer (C6 / C11) bug or thumbnail spam attack could push the inline-thumbnail throughput from the documented ≤ 0.1 Hz to many Hz, blowing past the inline-blob budget and burying real diagnostic records under thumbnail noise. + +The two gates are intentionally asymmetric: forbidden-kind violation is a HARD security error visible to the caller (raw_nav_frame is never legitimate); over-cap thumbnails are a SOFT throughput control with WARN logging (over-eager producers are common; rate-limit and continue). + +## Outcome + +- A `RecordKindPolicy` object is the single source of truth for both gates. It exposes `enforce_or_raise(record: FdrRecord) -> None` (synchronous; raises `RawFrameWriteForbiddenError` for forbidden kinds; returns silently for everything else including `failed_tile_thumbnail`) and `gate_for_writer(record: FdrRecord) -> GateDecision` (returns `ENQUEUE` or `DROP` for thumbnail rate-cap purposes). +- Producers (C6 / C11 thumbnail emission paths; future producers) call `policy.enforce_or_raise(record)` immediately before `fdr_client.enqueue(record)`. The composition root injects the policy; producers do not construct it themselves. +- The writer thread (AZ-291) calls `policy.gate_for_writer(record)` immediately after dequeue. On `DROP`, the writer skips the append + emits a `kind="overrun"` record with `payload.producer_id="shared.fdr_client"` and `payload.dropped_count` aggregated across the cap window. +- The `failed_tile_thumbnail` rate cap uses a sliding-window counter (1-second windows summed over the last 10 seconds at 0.1 Hz default) so a producer that bursts 5 thumbnails in one second still gets averaged correctly across the window — instead of a tight token-bucket that would either reject every thumbnail after the burst or let through a steady-state too-fast trickle. +- The forbidden-kind set is config-driven (`forbidden_record_kinds`) but its DEFAULT MUST include `raw_nav_frame` and `raw_ai_cam_frame`. Removing those defaults requires a major-version Config bump and is a security-critical review item. + +## Scope + +### Included + +- `RecordKindPolicy` dataclass / class with two methods: `enforce_or_raise(record)` and `gate_for_writer(record)`. +- Forbidden-kind enforcement: + - `enforce_or_raise` raises `RawFrameWriteForbiddenError` if `record.kind` is in the configured forbidden set. The exception's message includes the offending kind and the producer slug from the record envelope so logs identify the source. + - The forbidden set defaults to `frozenset({"raw_nav_frame", "raw_ai_cam_frame"})` and is configurable via `forbidden_record_kinds` on the Config; runtime additions are allowed (you can ADD kinds at runtime), but the `Config` validator REJECTS any preset that REMOVES a default kind from the set (unless an explicit `unsafe_remove_default_forbidden=True` flag is set, which is a security-review-required path documented as such; the flag does NOT exist in any standard preset). +- Failed-tile thumbnail rate cap: + - `gate_for_writer` checks the kind. For non-thumbnail kinds, returns `ENQUEUE`. + - For `kind="failed_tile_thumbnail"`, applies a sliding-window rate cap at `failed_tile_thumbnail_max_hz` (default 0.1 Hz). The window is (1 / max_hz) seconds; up to one record per window passes through. + - On `DROP`, increments a running `thumbnail_dropped_count` counter and emits ONE `kind="overrun"` record per per-cap-window with `payload.dropped_count == accumulated_count_during_window` (coalesced; matches the AZ-274 overrun-coalescing semantics so post-flight tooling sees consistent overrun records regardless of whether the source is FdrClient queue overrun or thumbnail rate cap). + - WARN log per drop window (`kind="fdr.thumbnail_rate_cap_exceeded"; producer_id; dropped_in_window`). Per-second rate cap on the WARN log itself (≤ 1 WARN/sec) so a thumbnail flood does not flood the operational log. +- Composition-root wiring: `make_record_kind_policy(config)` factory; the composition root constructs ONE policy instance and injects it into both (a) every producer's enqueue path and (b) the `FileFdrWriter`'s post-dequeue gate. +- `failed_tile_thumbnail_max_hz` config field (default 0.1; valid range > 0 .. 10.0); `0` is REJECTED at config validation (would silence thumbnails entirely; producers must declare intent explicitly via `disable_failed_tile_thumbnails=True` on a separate flag if they truly want to silence the kind — this requires a security-review-required preset, similar to forbidden-kind removal). + +### Excluded + +- Thumbnail GENERATION (orthorectification failure detection, JPEG encoding) — owned by C6 / C11 producers; this task only validates / rate-caps RECORDS already constructed. +- Mid-flight tile snapshot SUCCESS path (sidecar storage of orthorectified tiles) — owned by AZ-294 / task #4. Failed-tile thumbnails are a DIFFERENT kind with inline (≤ 4 KiB) JPEG bytes, NOT sidecar. +- The `kind="raw_nav_frame"` / `kind="failed_tile_thumbnail"` payload schemas — owned by AZ-272. +- Per-segment / per-flight cap policies — owned by AZ-291 / AZ-293. +- Producer-side rate limiting BEFORE thumbnails are constructed (e.g. C6's decision to attempt orthorectification at most every N frames) — that is per-producer concern; the C13 cap is a defense-in-depth global ceiling. +- Cryptographic signing of records — out of scope this cycle. + +## Acceptance Criteria + +**AC-1: enforce_or_raise rejects raw_nav_frame** +Given `RecordKindPolicy` constructed with default config, and an `FdrRecord(kind="raw_nav_frame", producer_id="c1_vio", payload={...})` +When the producer calls `enforce_or_raise(record)` +Then `RawFrameWriteForbiddenError` is raised; the message includes both `"raw_nav_frame"` and `"c1_vio"`; no record is enqueued (the producer's subsequent `fdr_client.enqueue` is never reached because the call site re-raises) + +**AC-2: enforce_or_raise rejects raw_ai_cam_frame** +Given the default-configured policy +When `enforce_or_raise` is called with `kind="raw_ai_cam_frame"` +Then `RawFrameWriteForbiddenError` is raised (same as AC-1) + +**AC-3: enforce_or_raise passes through failed_tile_thumbnail** +Given the default policy and `FdrRecord(kind="failed_tile_thumbnail", payload={frame_id: 1, tile_id: "x", jpeg_bytes_b64: "..."})` +When `enforce_or_raise` is called +Then the call returns silently; no exception is raised; the producer is free to enqueue + +**AC-4: gate_for_writer admits in-cap thumbnails** +Given `failed_tile_thumbnail_max_hz=0.1` (one per 10 s window) and the writer is starting fresh +When `gate_for_writer(record)` is called once with a `failed_tile_thumbnail` record +Then the return value is `ENQUEUE`; the record proceeds to disk + +**AC-5: gate_for_writer drops over-cap thumbnails + emits coalesced overrun record** +Given `failed_tile_thumbnail_max_hz=0.1` and 5 thumbnails arrive within a single 10-second window +When the writer calls `gate_for_writer` on each +Then the FIRST returns `ENQUEUE`; the next 4 return `DROP`; ONE `kind="overrun"` record is emitted at the end of the window with `payload.dropped_count==4` and `payload.producer_id=`; the WARN log fires at most once per second + +**AC-6: Forbidden set REJECTS removal of defaults** +Given a Config preset that attempts to set `forbidden_record_kinds = frozenset()` (empty — removing all defaults) +When the Config is validated +Then a `ConfigValidationError` is raised naming the missing default kinds; the policy cannot be constructed from this config + +**AC-7: Forbidden set ALLOWS additions** +Given a Config preset that sets `forbidden_record_kinds = frozenset({"raw_nav_frame", "raw_ai_cam_frame", "raw_thermal_frame"})` +When the policy is constructed +Then the policy rejects all three kinds via `enforce_or_raise`; the existing tests for the original two kinds still pass + +**AC-8: Hz=0 is rejected at config validation** +Given a Config preset with `failed_tile_thumbnail_max_hz=0` +When the Config is validated +Then a `ConfigValidationError` is raised; the policy cannot be constructed + +**AC-9: Sliding window resets — bursts spread across windows are admitted** +Given `failed_tile_thumbnail_max_hz=0.1` and one thumbnail at t=0, one at t=11s, one at t=22s +When `gate_for_writer` is called for each +Then ALL THREE return `ENQUEUE` (one per window); zero overrun records are emitted + +**AC-10: Producer slug propagates to overrun.payload.producer_id under cap-driven drops** +Given thumbnails arriving under cap-driven drop conditions, with the originating producer being `c6_tile_cache` +When the overrun record is emitted +Then `payload.producer_id == "c6_tile_cache"` (matches the producer the original thumbnails came from, NOT `"shared.fdr_client"` for the payload — the OUTER envelope's producer_id is `"shared.fdr_client"` per the schema contract) + +## Non-Functional Requirements + +**Performance** +- `enforce_or_raise` p99 ≤ 1 µs (a single set membership check; no allocation). +- `gate_for_writer` p99 ≤ 5 µs on the in-cap path; p99 ≤ 10 µs on the cap-driven drop path (sliding-window counter update + overrun-record construction). +- Both methods are allocation-free on the steady-state in-cap path. + +**Reliability** +- The forbidden-kind set is read once at policy construction and stored as an `frozenset` (immutable across the policy's lifetime). Runtime mutation via reflection is detected by code-review Phase 7 (architecture/security). +- The sliding-window counter is per-policy-instance, not global; a single policy serves the whole flight (one composition-root construction). Resetting between flights happens via a new policy instance at takeoff. +- The policy's WARN-log rate cap uses the same `kind="fdr.write_failure"` rate cap pattern from AZ-291 (≤ 1 WARN/sec) — implemented inside the policy, no shared rate-limit state with the writer thread. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | enforce_or_raise on `kind="raw_nav_frame"` | RawFrameWriteForbiddenError; message contains kind + producer_id | +| AC-2 | enforce_or_raise on `kind="raw_ai_cam_frame"` | RawFrameWriteForbiddenError | +| AC-3 | enforce_or_raise on `kind="failed_tile_thumbnail"` | Returns silently | +| AC-4 | gate_for_writer for first thumbnail in fresh window | Returns ENQUEUE | +| AC-5 | 5 thumbnails in one 10 s window | First ENQUEUE; next 4 DROP; one overrun record (dropped_count=4); ≤ 1 WARN | +| AC-6 | Empty `forbidden_record_kinds` config | ConfigValidationError | +| AC-7 | Adding `raw_thermal_frame` to forbidden set | All three kinds rejected; defaults still rejected | +| AC-8 | `failed_tile_thumbnail_max_hz=0` | ConfigValidationError | +| AC-9 | Thumbnails at t=0, t=11, t=22 with 10 s window | All three ENQUEUE; zero overrun records | +| AC-10 | Drop scenario with originating producer `c6_tile_cache` | overrun record's `payload.producer_id == "c6_tile_cache"` | +| NFR-perf-enforce | Microbench `enforce_or_raise` 10k iter | p99 ≤ 1 µs | +| NFR-perf-gate-allow | Microbench `gate_for_writer` in-cap | p99 ≤ 5 µs | +| NFR-perf-gate-drop | Microbench `gate_for_writer` over-cap | p99 ≤ 10 µs | +| NFR-reliability-immutable | Attempt to mutate `policy.forbidden_kinds` after construction | TypeError (frozenset) or AttributeError (no setter) | + +## Constraints + +- The forbidden-kind set is defense-in-depth, NOT the primary line of defense. Producers MUST not construct `raw_nav_frame` records in the first place (that is owned by their respective component specs); this gate catches regressions and malicious producers. +- The sliding-window counter MUST be O(1) update per call; an O(N) implementation that scans a list of timestamps is rejected at code-review Phase 7 (architecture). +- The cap and forbidden set apply globally across all producers within a flight, NOT per-producer. A single producer cannot consume the entire 0.1 Hz budget by exclusion of others — the budget is a global capacity for the FDR's inline thumbnail throughput. (Per-producer caps, if needed, are owned by individual component specs.) +- This task does NOT introduce new dependencies. Stdlib `time.monotonic_ns` + a fixed-size deque (or constant counter) suffice for the sliding window. + +## Risks & Mitigation + +**Risk 1: AC-7's "additions allowed" path is abused to add legitimate kinds (e.g. `state.tick`)** +- *Risk*: A misconfigured deployment adds `state.tick` to the forbidden set and silently breaks the entire FDR. +- *Mitigation*: Config validation cross-checks the forbidden set against the v1.0.0 schema's closed enum of legitimate kinds and REJECTS additions that are in the schema. The forbidden set is intended to be a SUBSET of "kinds that don't appear in v1.0.0 closed enum + raw-frame variants we explicitly want to ban". Documented in the Config validator + AC-6 tests. + +**Risk 2: Producer-side enforce_or_raise wrapper not actually called** +- *Risk*: A future producer forgets to call `policy.enforce_or_raise` and goes straight to `fdr_client.enqueue` — bypassing the synchronous gate. +- *Mitigation*: A code-review Phase 2 (Spec Compliance) check requires every producer calling `fdr_client.enqueue` to also call `policy.enforce_or_raise` immediately before. The writer-side `gate_for_writer` is the defense-in-depth catch — even if a forbidden-kind record sneaks past the producer, the writer drops it and emits an `overrun` record (the security AC is "no raw frame on disk", not "no raw frame in producer memory"). Both gates exist precisely so producer-side bypasses become observable in logs. + +**Risk 3: Sliding-window counter clock drift** +- *Risk*: `time.monotonic_ns` is per-process; if the process is suspended (Jetson power management) the window appears to compress. +- *Mitigation*: `monotonic_ns` does NOT advance during suspend (per CPython docs); on resume, the counter sees a single large gap. The sliding window adapts naturally — old samples drop out, and the next thumbnail is admitted. Documented; no special mitigation needed. + +**Risk 4: WARN log rate cap interferes with debugging** +- *Risk*: An operator investigating a thumbnail flood sees only one WARN per second and misses the burst pattern. +- *Mitigation*: The OVERRUN RECORD emitted into the FDR carries the per-window `dropped_count`; that is the canonical record. The WARN log is operator convenience only. Documented in the policy's docstring. + +## Runtime Completeness + +- **Named capability**: AC-8.5 forbidden-kind synchronous enforcement + failed-tile thumbnail rate cap (architecture / E-C13 / AC-8.5 / RESTRICT-UAV-4 / C13-IT-03). +- **Production code that must exist**: real `RecordKindPolicy` with both methods, real composition-root wiring into producer paths AND the writer thread, real sliding-window counter, real overrun-record emission on drop. +- **Allowed external stubs**: tests MAY use `FakeFdrSink` (AZ-275); production wiring uses the real shared FdrClient. +- **Unacceptable substitutes**: writer-only enforcement without producer-side `enforce_or_raise` ("the writer will catch it" — too late, the bytes already crossed the wire format), config that allows removing `raw_nav_frame` from defaults silently ("operators know what they're doing"), token-bucket without coalescing ("we'll emit one overrun per drop") — all break C13-IT-03 + AC-NEW-3 + the fundamental AC-8.5 invariant that raw frames MUST NEVER touch durable storage. diff --git a/_docs/02_tasks/todo/AZ-296_c13_open_error_takeoff_abort.md b/_docs/02_tasks/todo/AZ-296_c13_open_error_takeoff_abort.md new file mode 100644 index 0000000..27f50a9 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-296_c13_open_error_takeoff_abort.md @@ -0,0 +1,152 @@ +# C13 FdrOpenError → Takeoff Abort Path + +**Task**: AZ-296_c13_open_error_takeoff_abort +**Name**: C13 Takeoff Abort on FdrOpenError +**Description**: Wire the composition root's takeoff sequence so that `FdrOpenError` raised by `FileFdrWriter.open_flight()` (AZ-292) aborts takeoff BEFORE the C8 FC adapter is opened. This is the AC-NEW-3 every-payload-class-from-t=0 enforcement gate: if the FDR cannot persist records starting at t=0, the system MUST NOT emit external positions to the flight controller, because the audit trail proving "we made every safety-critical decision at t=0" would be missing. The abort is a HARD failure (the companion process exits with a non-zero status code so systemd / the Jetson init system surfaces it to the operator); it does NOT silently degrade. +**Complexity**: 2 points +**Dependencies**: AZ-291_c13_writer_thread, AZ-292_c13_flight_header_footer, AZ-263_initial_structure, AZ-266_log_module +**Component**: composition_root + c13_fdr (epic AZ-248 / E-C13) +**Tracker**: AZ-296 +**Epic**: AZ-248 (E-C13) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — defines the takeoff-sequence contract that this task amends with the FDR-first ordering invariant. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — operational ERROR log shape this task uses for the abort message. + +## Problem + +The takeoff sequence in the composition root currently has no enforced ordering between FDR open and FC adapter open. Without this task: + +- A composition root could open C8 (FC adapter — pymavlink, MSP) BEFORE C13 (FDR), so external positions start streaming to the flight controller before `FileFdrWriter.open_flight()` has confirmed the segment file is writable. The companion would silently emit positions for which there is no audit record at t=0. +- A misconfigured `flight_root` (read-only mount, missing parent directory, full filesystem) would surface only AFTER takeoff has begun — too late for the operator to fix the configuration on the ground. +- C13-IT-06 ("refuse takeoff if `open_flight` fails") would fail because there is no take-off-abort path; the test would observe the FC adapter wired despite the FDR failing to open. + +## Outcome + +- The composition root's takeoff sequence is strictly ordered: (1) construct `FileFdrWriter`, (2) call `start()`, (3) call `open_flight(header)`, (4) ONLY IF (3) succeeded, construct + open the C8 FC adapter, (5) start every other component. +- If step (3) raises `FdrOpenError`, the composition root catches the exception, logs an ERROR via the shared logger (`kind="composition_root.takeoff_aborted"; reason="fdr_open_error"; underlying=`), tears down any partially-constructed components (the writer's `start()` is rolled back via its `stop()` so the filelock is released), and exits the process with a non-zero status code (specifically `2` — distinct from `1` which the project reserves for generic startup failures). +- The exit message printed to stderr names the offending `flight_root` path so the operator can immediately see "the FDR root I configured is wrong" — no log-diving required. +- The abort path is exercised end-to-end by an integration-style test that constructs a composition root with a read-only `flight_root`, runs it, and asserts (a) the FC adapter was NOT instantiated, (b) the process exits with status 2, (c) the stderr message names the path. +- C13-IT-06 (per `_docs/02_document/components/14_c13_fdr/tests.md`) is fully satisfied by this task in combination with AZ-292. + +## Scope + +### Included + +- Modification to the composition root's takeoff sequence to enforce the strict ordering above. The composition root is `src/gps_denied_onboard/runtime_root.py` per AZ-263 / module-layout.md; the change is localised to the takeoff section. +- A `try/except FdrOpenError` block around `open_flight(header)` that: + 1. Logs ONE ERROR record via the shared logger (`kind="composition_root.takeoff_aborted"`, `level="ERROR"`, `kv={"reason": "fdr_open_error", "underlying": str(exc), "flight_root": str(config.fdr_writer.flight_root)}`). + 2. Calls `writer.stop()` to release the filelock + close any open segment file (no-op if `start()` failed before any segment was opened). + 3. Prints a single line to stderr: `FATAL: cannot open FDR at : ; aborting takeoff (exit 2)`. + 4. Calls `sys.exit(2)`. +- The exit status is exactly `2` for FDR-open failures; the constants `EXIT_GENERIC_FAILURE=1` and `EXIT_FDR_OPEN_FAILURE=2` are documented in the composition_root_protocol contract (this task adds the new constant and the contract entry). +- An integration-style test fixture under `tests/integration/composition_root/` that constructs a composition root with a controlled `flight_root` path that fails to open (read-only directory) and asserts the documented behaviour. +- Update to `_docs/02_document/contracts/shared_config/composition_root_protocol.md` to document the strict takeoff ordering and the `EXIT_FDR_OPEN_FAILURE=2` constant. The contract update is in scope (this task touches the contract that other consumers read). +- Validation that the C8 FC adapter constructor / `open()` call sites are NOT reached on the FdrOpenError path. This is verified by the integration test (`assert fc_adapter_constructor.call_count == 0`) and by a code-review Phase 2 (Spec Compliance) check that walks the takeoff sequence statically. + +### Excluded + +- The actual implementation of `open_flight` and `FdrOpenError` — owned by AZ-292. +- The writer's `start()` / `stop()` lifecycle — owned by AZ-291. +- Recovery from `FdrOpenError` (e.g. retrying with a fallback `flight_root`) — explicitly NOT in scope. AC-NEW-3 says every payload class must be present from t=0; a fallback would violate the spirit of the AC by accepting a degraded FDR. The operator must fix the config and restart. +- Other takeoff-abort triggers (e.g. C7 inference engine load failure, C8 FC handshake failure) — those have their own composition-root abort paths owned by the respective component epics and the composition_root contract. +- GCS alert on takeoff abort — the companion is on the ground, not yet emitting to GCS; the abort surfaces via stderr + exit code, NOT GCS STATUSTEXT (which requires the FC adapter, which we are NOT opening). Documented as a constraint. +- Runtime FDR failure (`OSError` mid-flight) — that is owned by AZ-291's degraded-mode path with its own GCS alert. + +## Acceptance Criteria + +**AC-1: FdrOpenError raised → process exits with status 2** +Given a composition root configured with `flight_root=/read-only/path` (where `open_flight()` will raise `FdrOpenError`) +When the composition root's takeoff sequence runs +Then the process exits with status code exactly 2; no other component (especially C8 FC adapter) is constructed; the writer's filelock is released + +**AC-2: Stderr message names the flight_root path** +Given AC-1's setup +When the test captures stderr +Then stderr contains exactly one line matching `^FATAL: cannot open FDR at /read-only/path: .*; aborting takeoff \(exit 2\)$`; no other FATAL lines are printed + +**AC-3: ERROR log record includes underlying exception message** +Given AC-1's setup +When the test parses the structured log records +Then exactly one record exists with `kind="composition_root.takeoff_aborted"`, `level="ERROR"`, `kv.reason=="fdr_open_error"`, `kv.flight_root=="/read-only/path"`, `kv.underlying` containing the underlying `FdrOpenError`'s message + +**AC-4: C8 FC adapter is NOT constructed on the abort path** +Given AC-1's setup AND a test double for the C8 FC adapter that records constructor invocations +When the takeoff sequence aborts +Then the C8 FC adapter test double's constructor was called 0 times; no MAVLink / MSP socket is opened + +**AC-5: Successful open_flight proceeds to FC adapter** +Given a writable `flight_root` and a normal Config +When the takeoff sequence runs +Then `open_flight()` returns; the C8 FC adapter IS constructed AFTER `open_flight()` returns; the process does NOT exit with status 2 + +**AC-6: writer.stop() is called on the abort path** +Given AC-1's setup AND a writer test double that records `start` / `stop` calls +When the takeoff aborts +Then `writer.stop()` was called exactly once after the FdrOpenError; the filelock is released (a subsequent process can construct a new writer for the same `flight_root` without error) + +**AC-7: Non-FdrOpenError exceptions are NOT caught by this handler** +Given a writer that raises `RuntimeError("boom")` from `open_flight` (NOT FdrOpenError) +When the takeoff sequence runs +Then the `RuntimeError` propagates UP (it is not swallowed by the FdrOpenError handler); the process exits with status 1 (generic failure path) — NOT status 2 + +**AC-8: Strict ordering — FdrWriter constructed and started before FC adapter constructor is called** +Given a composition-root unit test that records the order of constructor calls +When the takeoff sequence runs (success path) +Then the order is: `FileFdrWriter.__init__` → `writer.start()` → `writer.open_flight(header)` → `` → ``; any other order fails the test + +## Non-Functional Requirements + +**Performance** +- The takeoff abort path completes within 500 ms of the FdrOpenError being raised (writer.stop() + log + stderr write + exit). Operators must see the abort signal immediately, not after a long teardown. + +**Reliability** +- The abort path MUST NOT itself raise into the caller. Any exception inside the abort handler (e.g. `writer.stop()` itself raising) is swallowed with a SECOND ERROR log; the process still exits with status 2 (with `os._exit(2)` if `sys.exit(2)` is intercepted somehow). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | composition_root with read-only flight_root | exit status 2; no other component constructed; filelock released | +| AC-2 | Capture stderr | Exactly one matching FATAL line naming the flight_root path | +| AC-3 | Parse log records | Exactly one ERROR record with the documented kind + kv | +| AC-4 | Mock C8 adapter; trigger abort | C8 constructor `call_count == 0` | +| AC-5 | Writable flight_root | open_flight succeeds; C8 IS constructed after; no exit 2 | +| AC-6 | Mock writer; trigger abort | writer.stop() called exactly once | +| AC-7 | Writer raises RuntimeError from open_flight | RuntimeError propagates; exit status 1, not 2 | +| AC-8 | Spy on constructor / method invocation order | Strict order: writer init → start → open_flight → C8 init → C8 open | +| NFR-perf-abort | Time abort path from FdrOpenError to exit | ≤ 500 ms | +| NFR-reliability-abort-resilience | writer.stop() raises during abort | Second ERROR logged; process still exits with 2 | + +## Constraints + +- The takeoff abort exit code is FIXED at `2`; changing it is a breaking change to the composition_root contract and operator runbooks. The constant `EXIT_FDR_OPEN_FAILURE=2` lives in the composition root and is documented in `composition_root_protocol.md`. +- The abort path uses `sys.exit(2)` first (so `atexit` handlers run and structured logs flush); only if `sys.exit` does not actually exit (e.g. caught somewhere up the stack — this should not happen but the abort handler is defensive) does it fall back to `os._exit(2)`. +- The stderr message format is fixed (matches AC-2's regex). Operator runbooks grep for this exact pattern to surface FDR misconfigs in the field. +- This task does NOT introduce new dependencies. `sys`, `os`, and the existing logger are sufficient. + +## Risks & Mitigation + +**Risk 1: A future refactor moves the C8 FC adapter constructor BEFORE the FDR open** +- *Risk*: An optimization that "opens the FC adapter early to warm the link" silently breaks AC-NEW-3. +- *Mitigation*: AC-8's strict-ordering test runs in CI on every change to `runtime_root.py`. Code-review Phase 2 (Spec Compliance) explicitly checks that the FdrWriter open precedes the FC adapter constructor. + +**Risk 2: `sys.exit(2)` interferes with pytest test runners** +- *Risk*: The test asserts on exit status 2 but pytest catches the SystemExit and reports it as a test pass. +- *Mitigation*: The integration test runs the composition root in a subprocess (`subprocess.run([...])`) and asserts on `proc.returncode == 2`. Documented in the test fixture; pytest's in-process exit interception is sidestepped. + +**Risk 3: The abort handler swallows the FdrOpenError stack trace, making field debugging hard** +- *Risk*: The operator sees `FATAL: cannot open FDR at /path: ` but the underlying cause (e.g. ENOSPC vs. EACCES vs. ENOENT) is hidden. +- *Mitigation*: AC-3's `kv.underlying` field carries the full `str(exc)` from the FdrOpenError; the structured log record preserves the full causal chain. The stderr line is the operator-facing summary; the log is the debug trail. + +**Risk 4: Operators might want a "continue without FDR" override flag** +- *Risk*: Field debugging pressure leads to a `--ignore-fdr-failure` CLI flag that violates AC-NEW-3. +- *Mitigation*: This task EXPLICITLY excludes such an override (per the Excluded section). The contract update documents that no such override is permitted; adding one is a major-version bump on `composition_root_protocol` AND a security-review-required change. Documented as a constraint. + +## Runtime Completeness + +- **Named capability**: AC-NEW-3 every-payload-class-from-t=0 takeoff gate (architecture / E-C13 / AC-NEW-3 / C13-IT-06 / RESTRICT-UAV-4). +- **Production code that must exist**: real composition-root takeoff-sequence ordering, real `try/except FdrOpenError` handler, real `sys.exit(2)` (with `os._exit(2)` fallback), real `writer.stop()` rollback, real ERROR log + stderr message. +- **Allowed external stubs**: tests MAY use a subprocess + temp-directory `flight_root`; production wiring uses the real composition root. +- **Unacceptable substitutes**: a "warning, not abort" path ("the operator can decide"), exit code 1 ("we don't need a separate FDR-failure code"), opening the FC adapter before the FDR ("optimisation; we'll close it if FDR fails") — all break C13-IT-06 and the AC-NEW-3 invariant. diff --git a/_docs/02_tasks/todo/AZ-297_c7_runtime_protocol.md b/_docs/02_tasks/todo/AZ-297_c7_runtime_protocol.md new file mode 100644 index 0000000..c24f4f4 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-297_c7_runtime_protocol.md @@ -0,0 +1,167 @@ +# C7 InferenceRuntime Protocol + Composition-Root Selection + +**Task**: AZ-297_c7_runtime_protocol +**Name**: C7 InferenceRuntime Protocol +**Description**: Define the `InferenceRuntime` Protocol, its DTOs (`BuildConfig`, `EngineCacheEntry`, `EngineHandle`, `ThermalState`), the runtime error taxonomy, and the composition-root selection switch that wires exactly one of `TensorrtRuntime` / `OnnxTrtEpRuntime` / `PytorchFp16Runtime` at startup based on ADR-001 (config) and ADR-002 (`BUILD_*` flags). This is the foundational shared-API task for E-C7 — every other E-C7 task implements this Protocol, and five external components (C2, C2.5, C3, C3.5, C10) plus C4 (ThermalState consumer) depend on the contract this task freezes. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema +**Component**: c7_inference (epic AZ-249 / E-C7) +**Tracker**: AZ-297 +**Epic**: AZ-249 (E-C7) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — `EngineCacheEntry` carries the sha256 of the engine binary; this contract defines that representation. +- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — `EngineCacheEntry` carries the parsed `(SM, JP, TRT, precision)` tuple from the filename schema. +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — runtime selection is a Config field; this contract defines the field and the runtime-label vocabulary. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — error events emitted by Protocol implementations use this log shape. + +## Problem + +Five different components (C2 VPR backbone, C2.5 ReRanker, C3 CrossDomainMatcher, C3.5 AdHoP, C10 CacheProvisioner) and one consumer of the thermal-throttle telemetry feed (C4 Pose) all need a single, frozen interface to the on-Jetson inference runtime. Without it: + +- Each consumer would import a concrete TRT / ONNX-RT / PyTorch class directly, hard-coding the runtime choice and breaking ADR-001's runtime selectability. +- `BUILD_TENSORRT_RUNTIME=OFF` (Tier-0 workstation) would not compile because consumers depend on TRT-specific symbols. +- The composition root would have to know per-component which runtime is acceptable; today only ADR-001 (config) + ADR-002 (`BUILD_*` flags) decide. +- Error handling would diverge per runtime; `EngineHashMismatchError` (D-C10-3) and `EngineSchemaMismatchError` (D-C10-7) would have different shapes per implementation, making the F2 takeoff abort path fragile. +- The C4 hybrid covariance decision (D-CROSS-LATENCY-1) would have no canonical `ThermalState` shape to read. + +This task delivers the typed boundary every consumer reads against and every implementation conforms to. It writes no runtime logic — the concrete TRT / ONNX-RT / PyTorch strategies are AZ-298 / AZ-299 / AZ-300. + +## Outcome + +- A `InferenceRuntime` Protocol (PEP 544 `typing.Protocol`) is exported from `src/gps_denied_onboard/components/c7_inference/interface.py` and re-exported from the component's `__init__.py`. +- The DTOs `BuildConfig`, `EngineCacheEntry`, `EngineHandle`, `ThermalState` are dataclasses (frozen) at the same import path; field shape and invariants match the contract file. +- The runtime error taxonomy is a single hierarchy under `c7_inference.errors`: `RuntimeError` ← {`EngineBuildError`, `EngineDeserializeError`, `EngineHashMismatchError`, `EngineSchemaMismatchError`, `EngineSidecarMissingError`, `CalibrationCacheError`, `InferenceError`, `OutOfMemoryError`, `TelemetryUnavailableError`}. Every implementation raises only these; consumers catch only these. +- The composition root has a `build_inference_runtime(config: Config) -> InferenceRuntime` factory function that selects the strategy by `config.inference.runtime` (`tensorrt` | `onnx_trt_ep` | `pytorch_fp16`) and respects compile-time `BUILD_*` gating: requesting a strategy whose `BUILD_*` flag is OFF raises `RuntimeNotAvailableError` at composition time (NOT at first inference). +- Every implementation's `current_runtime_label()` returns the lowercase label matching the config value (`"tensorrt"`, `"onnx_trt_ep"`, `"pytorch_fp16"`); this is the FDR-stamped label for AC-NEW-3 audit. +- A frozen contract file at `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` carries the full shape; consumers read that file, not this task spec. + +## Scope + +### Included + +- `InferenceRuntime` Protocol with the six methods from `_docs/02_document/components/09_c7_inference/description.md` § 2: `compile_engine`, `deserialize_engine`, `infer`, `release_engine`, `thermal_state`, `current_runtime_label`. +- DTO dataclasses for `BuildConfig`, `EngineCacheEntry`, `EngineHandle` (opaque marker class), `ThermalState`. All frozen except `EngineHandle` (which is opaque to consumers — implementations subclass). +- Error hierarchy under `c7_inference.errors`; every error type the Protocol promises; all are derived from a common `c7_inference.errors.RuntimeError` so consumers can catch the family. +- `build_inference_runtime(config) -> InferenceRuntime` composition-root factory in `src/gps_denied_onboard/runtime_root/inference_factory.py`. Imports the concrete strategy lazily — guarded by `if BUILD_TENSORRT_RUNTIME: from c7_inference.tensorrt_runtime import TensorrtRuntime` so an OFF flag does not force an import. +- A `RuntimeNotAvailableError` raised by the factory when the requested strategy is not built into this binary. +- A `ConfigSchemaError` extension to AZ-269's config loader for the new `config.inference.runtime` enum + the optional `config.inference.thermal_poll_hz` (default 1.0) + `config.inference.engine_cache_dir` fields. +- The contract file at `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` filled per `decompose/templates/api-contract.md` with Shape, Invariants, Non-Goals, Versioning Rules, and at least three Test Cases. +- Type-only unit tests that verify each concrete strategy module's class actually conforms to the Protocol via `runtime_checkable` + `isinstance` (catches drift at CI time, not deployment). + +### Excluded + +- `TensorrtRuntime` implementation — AZ-298. +- `OnnxTrtEpRuntime` implementation — AZ-299. +- `PytorchFp16Runtime` implementation — AZ-300. +- `EngineGate` validator — AZ-301 (this task defines the error types it raises, not the validator). +- Background thermal-state polling loop — AZ-302 (this task defines the `ThermalState` DTO and the `thermal_state()` Protocol method, not the polling thread). +- C4 hybrid covariance-mode consumer wiring — owned by E-C4. +- C10 CacheProvisioner consumer wiring of `compile_engine` — owned by E-C10. + +## Acceptance Criteria + +**AC-1: Protocol is conformance-checkable** +Given a class that implements all six Protocol methods with matching signatures +When `isinstance(impl, InferenceRuntime)` is evaluated under `runtime_checkable` +Then the result is `True`; for a class that omits any method, the result is `False` + +**AC-2: Frozen DTOs reject mutation** +Given a constructed `BuildConfig(precision=Fp16, ...)`, `EngineCacheEntry(...)`, or `ThermalState(...)` instance +When the test attempts `instance.precision = Int8` (or any field reassignment) +Then `dataclasses.FrozenInstanceError` is raised; the original value is preserved + +**AC-3: Error hierarchy catchable as a single family** +Given any of the nine documented error subtypes +When the consumer wraps an implementation call in `try: ... except c7_inference.errors.RuntimeError` +Then every documented subtype is caught; an unrelated `Exception` is NOT caught (the Protocol's error envelope does not leak into general exception handling) + +**AC-4: Composition-root factory honours config** +Given `config.inference.runtime = "tensorrt"` and `BUILD_TENSORRT_RUNTIME=ON` +When `build_inference_runtime(config)` is called +Then a `TensorrtRuntime` instance is returned and `instance.current_runtime_label() == "tensorrt"` + +**AC-5: Composition-root factory honours BUILD flag gate** +Given `config.inference.runtime = "tensorrt"` and `BUILD_TENSORRT_RUNTIME=OFF` +When `build_inference_runtime(config)` is called +Then `RuntimeNotAvailableError` is raised at composition time with a message naming `"tensorrt"`; no module-level import of TRT symbols has occurred (verifiable via `sys.modules`) + +**AC-6: Unknown runtime label rejected at config load** +Given `config.inference.runtime = "tensorflow_lite"` (not in the enum) +When the config is loaded via AZ-269's loader +Then `ConfigSchemaError` is raised at load time with a message listing the valid values; `build_inference_runtime` is never reached + +**AC-7: `current_runtime_label()` matches config value exactly** +Given any selectable runtime +When `instance.current_runtime_label()` is called +Then the returned string is one of `"tensorrt"`, `"onnx_trt_ep"`, `"pytorch_fp16"` and equals `config.inference.runtime`; AC-NEW-3 audit relies on this exact-match property + +**AC-8: Contract file matches Protocol shape** +Given the contract file at `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` +When a contract-test parses the Shape section's method/field tables and compares against the runtime Protocol via introspection +Then every method, every field, every error type is present and consistent in both + +## Non-Functional Requirements + +**Compatibility** +- The Protocol is `typing.Protocol` (PEP 544 structural typing) so existing components that import the concrete TRT class today (none yet — this is greenfield) can be retrofitted without inheritance changes. +- All error types subclass `Exception` (not `BaseException`) so `except Exception:` in upstream layers continues to work as expected. + +**Performance** +- The factory `build_inference_runtime` returns within 200 ms (it imports + constructs one strategy; the heavy GPU work happens inside the strategy's own `compile_engine` / `deserialize_engine` calls — not the factory). +- DTO construction (`BuildConfig`, `EngineCacheEntry`, `ThermalState`) is dataclass-frozen; per-instance overhead is the bare-cost dataclass `__init__`. + +**Reliability** +- The Protocol is the boundary of acceptable runtime errors. Implementations MUST NOT raise other types into consumers; if a third-party library (TRT, ONNX-RT, PyTorch) raises something else, the implementation catches and rewraps into the documented family. +- Versioning: any breaking change to the Protocol or its DTOs MUST bump the contract file's `Version` and notify every consumer task listed in the contract header. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `runtime_checkable` Protocol vs. a fully-implementing fake; vs. a fake missing one method | `isinstance` returns True for full, False for partial | +| AC-2 | Mutation attempt on each frozen DTO | `FrozenInstanceError` raised; original value preserved | +| AC-3 | Raise each of the nine error subtypes; catch as `c7_inference.errors.RuntimeError` | All caught; an unrelated `ValueError` is NOT caught by the same handler | +| AC-4 | `build_inference_runtime` with `tensorrt` + flag ON → fake `TensorrtRuntime` | Returned instance is `TensorrtRuntime`; `current_runtime_label()` == `"tensorrt"` | +| AC-5 | `build_inference_runtime` with `tensorrt` + flag OFF | `RuntimeNotAvailableError`; `sys.modules` does NOT contain `c7_inference.tensorrt_runtime` | +| AC-6 | Config load with invalid `runtime` value | `ConfigSchemaError`; valid values listed in message | +| AC-7 | `current_runtime_label()` for each strategy | Matches the config value used to construct it | +| AC-8 | Contract introspection vs. Protocol introspection | Shape parity test passes | +| NFR-perf-factory | Microbench `build_inference_runtime` × 1000 | p99 ≤ 200 ms (dominated by lazy import on first call; subsequent calls << 1 ms) | +| NFR-reliability-error-family | All nine subtypes inherit from `c7_inference.errors.RuntimeError` | Verified via `issubclass` for each | + +## Constraints + +- The Protocol uses `typing.Protocol` from stdlib; no third-party Protocol library is introduced. +- DTO dataclasses use stdlib `dataclasses` with `frozen=True`; no `pydantic` or `attrs` dependency. +- `EngineHandle` is an opaque marker class — consumers MUST NOT introspect its fields. Each strategy subclasses with implementation-specific state. The Protocol exposes `EngineHandle` as the type but consumers treat it as a token to pass back to the same strategy. +- Lazy import of concrete strategies is mandatory. The factory's `if BUILD_TENSORRT_RUNTIME: from c7_inference.tensorrt_runtime import TensorrtRuntime` block is not optional — it is the mechanism by which Tier-0 workstation builds compile without TRT installed. +- The contract file at `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` is the source of truth. If the Protocol shape changes here without the contract updating, that is a Spec-Gap finding (High) per code-review skill Phase 2. +- This task does NOT add new third-party dependencies — `typing.Protocol`, `dataclasses`, `enum` are stdlib. + +## Risks & Mitigation + +**Risk 1: Protocol drift between contract and code** +- *Risk*: Implementations diverge from the contract over time; consumers cannot tell which is canonical. +- *Mitigation*: AC-8 contract-introspection test runs in CI; any drift fails the test before merge. The contract file's `## Test Cases` section names this exact test. + +**Risk 2: Lazy-import gating is bypassed by a transitively-imported module** +- *Risk*: A consumer imports `c7_inference` (the package) and the package's `__init__.py` eagerly imports a concrete strategy, triggering the TRT import even when `BUILD_TENSORRT_RUNTIME=OFF`. +- *Mitigation*: The package `__init__.py` re-exports ONLY the Protocol and DTOs and errors — it does NOT import any concrete strategy. AC-5 verifies via `sys.modules` that no strategy module is loaded during a Tier-0 factory call. + +**Risk 3: Error hierarchy widens silently** +- *Risk*: A future strategy adds a tenth error type without updating the contract or the family base class. +- *Mitigation*: The contract file lists the canonical nine. Implementations MUST raise only members of `c7_inference.errors.RuntimeError`; a strategy raising a non-family error is a Spec-Gap finding (High) at code-review time. AC-3's catch-as-family test catches the obvious case. + +## Runtime Completeness + +- **Named capability**: typed Protocol + DTOs + error envelope + composition-root selection (architecture / E-C7 / ADR-001 + ADR-002 + ADR-009). +- **Production code that must exist**: real Protocol declaration, real frozen DTOs, real error hierarchy, real composition-root factory with lazy-import gating, real config-loader extension for the runtime enum. +- **Allowed external stubs**: tests MAY substitute fake strategy classes that conform to the Protocol; production wiring uses the real strategies from AZ-298 / AZ-299 / AZ-300. +- **Unacceptable substitutes**: ABCs instead of `typing.Protocol` (would force inheritance changes downstream), `pydantic.BaseModel` instead of `@dataclass(frozen=True)` (would add a runtime validation layer this task does not need), eager imports of concrete strategies in `__init__.py` (would defeat `BUILD_*` gating), or a `runtime: str` config field without an enum (would lose the load-time validation in AC-6). + +## Contract + +This task produces/implements the contract at `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-298_c7_tensorrt_runtime.md b/_docs/02_tasks/todo/AZ-298_c7_tensorrt_runtime.md new file mode 100644 index 0000000..571fd84 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-298_c7_tensorrt_runtime.md @@ -0,0 +1,196 @@ +# C7 TensorrtRuntime — Engine Compile + Deserialize + Infer + GPU Memory + +**Task**: AZ-298_c7_tensorrt_runtime +**Name**: C7 TensorrtRuntime +**Description**: Implement `TensorrtRuntime`, the production-default `InferenceRuntime` strategy on JetPack 6.2 + TensorRT 10.3 (per D-C7-9). Owns the full TRT lifecycle: engine compilation via the Polygraphy / trtexec / IBuilderConfig hybrid (FP16 + INT8 + Mixed precision; INT8 calibration cache trust enforcement); engine deserialization at F2 takeoff load (delegating manifest content-hash + filename schema validation to AZ-301 EngineGate); per-flight resident `EngineHandle` GPU memory management; sync per-call `infer` on the F3 hot path with per-model latency budgets from C7-PT-01; release on flight end; CUDA stream ownership. +**Complexity**: 5 points +**Dependencies**: AZ-297_c7_runtime_protocol, AZ-301_c7_engine_gate, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c7_inference (epic AZ-249 / E-C7) +**Tracker**: AZ-298 +**Epic**: AZ-249 (E-C7) + +### Document Dependencies + +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — the Protocol this task implements; produced by AZ-297. +- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — the schema parser used at deserialise time (delegated to EngineGate). +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — engine sidecar trust check (delegated to EngineGate). +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that provides engine cache dir, calibration dataset path, precision selection. + +## Problem + +Without a real TensorRT 10.3 strategy, the on-Jetson hot path cannot meet the C7-PT-01 per-model latency targets (UltraVPR ≤ 60 ms, LightGlue ≤ 30 ms, AdHoP ≤ 90 ms, DISK ≤ 50 ms p95) and the AC-4.1 system E2E budget < 400 ms p95 collapses. The Protocol from AZ-297 is just types; this task is what actually runs ML on the Jetson Orin Nano Super. + +Concretely, without this task: + +- F1 pre-flight engine compilation has no production producer; C10 CacheProvisioner cannot build the engine cache. +- F2 takeoff load has no engine deserialiser; flights cannot start. +- F3 hot path has no `infer`; every consumer (C2 / C2.5 / C3 / C3.5) has nothing to call. +- INT8 calibration cache trust (D-C10-6) has no production gate at compile time. +- Per-flight GPU memory budget (NFT-LIM-01: ≤ 4 GB resident across all engines, C7-PT-02) has no enforcement point. + +This task delivers the canonical production runtime — every other strategy (AZ-299 ONNX-RT, AZ-300 PyTorch) is a fallback against this one's numbers. + +## Outcome + +- A `TensorrtRuntime` class at `src/gps_denied_onboard/components/c7_inference/tensorrt_runtime.py` conforming to the `InferenceRuntime` Protocol from AZ-297; `current_runtime_label() == "tensorrt"`. +- `compile_engine(model_path, build_config) -> EngineCacheEntry` produces an FP16 / INT8 / Mixed engine using the Polygraphy + trtexec + IBuilderConfig hybrid: Polygraphy for orchestration + ONNX import; trtexec for build profiling and binary outputs where appropriate; IBuilderConfig for fine-grained TRT controls (workspace size, optimization profiles, INT8 calibrator). The output `.engine` file lands at the cache path with the D-C10-7 filename schema and a sha256 sidecar from AZ-280. +- INT8 calibration cache is trusted iff its content hash matches the calibration-dataset hash recorded at compile time. A mismatch raises `CalibrationCacheError` and forces a full rebuild — never a silent fallback. +- `deserialize_engine(EngineCacheEntry) -> EngineHandle` calls AZ-301 `EngineGate` first (raises `EngineHashMismatchError` / `EngineSchemaMismatchError` / `EngineSidecarMissingError` on refusal); on success, builds an `IExecutionContext`, allocates GPU buffers per the optimization profile's `opt_shape`, allocates one CUDA stream, and returns a `TrtEngineHandle` (`EngineHandle` subclass) that holds the runtime-resident state. +- `infer(handle, inputs) -> outputs` does sync GPU stream execution: H2D copy of every named input → `enqueueV3` on the engine's exec context → D2H copy of every named output → stream sync → return. No pinned-memory pool reuse beyond what TRT itself allocates (keeps the simple-baseline path viable). +- `release_engine(handle)` frees GPU buffers, destroys the exec context, releases the CUDA stream. Called once per `EngineHandle` at flight end (or on `OutOfMemoryError`-driven cleanup). +- Concurrent `infer` calls against different engines are serialised on a single CUDA stream per Runtime instance (matches the description.md "typically one stream because the F3 hot path is single-threaded"). +- `EngineBuildError` on compile failure (e.g., ONNX op unsupported by TRT 10.3 plugins); `EngineDeserializeError` on engine corruption; `InferenceError` on transient CUDA fault; `OutOfMemoryError` on GPU OOM with the offending engine name in the message. +- The `current_runtime_label()` returns `"tensorrt"` exactly; the FDR-stamped runtime label is consistent across logs, FDR records, and operator post-flight inspection. + +## Scope + +### Included + +- `TensorrtRuntime` class implementation conforming to the AZ-297 Protocol. +- `compile_engine`: Polygraphy network creation + ONNX parser + IBuilderConfig setup (workspace MB from `BuildConfig`, optimization profiles, FP16 / INT8 / Mixed flags) + INT8 calibrator wiring with calibration-dataset path from `BuildConfig.calibration_dataset` + trtexec invocation for build (subprocess) when `BuildConfig.use_trtexec=True` (faster on FP16; IBuilderConfig direct path is the default for INT8). Output: `.engine` file written via `helpers.sha256_sidecar.atomic_write_with_sidecar`. +- INT8 calibration cache trust: at compile time, write a `.calib_cache` file alongside the `.engine` and a sidecar `.calib_cache.sha256`. The calibration dataset's content hash is stamped into the cache header. On reuse, the calibrator reads the existing cache iff the dataset hash matches; otherwise rebuilds and overwrites — never silently uses a stale cache. +- `deserialize_engine`: invokes AZ-301 `EngineGate.validate(EngineCacheEntry)` first (raising the gate's documented errors); then loads the engine via `IRuntime.deserialize_cuda_engine`, builds `IExecutionContext`, allocates GPU buffers from `opt_shape`, returns `TrtEngineHandle`. +- `infer`: sync GPU stream execution per the description.md § 5; H2D and D2H via `cudaMemcpyAsync` on the owned stream; `enqueueV3`; stream sync; per-frame DEBUG log with backbone name and elapsed milliseconds. +- `release_engine`: destroys the exec context, frees GPU buffers, releases the stream; idempotent on a handle that was already released (returns silently). +- `thermal_state()` is delegated to AZ-302's publisher via a constructor-injected `ThermalStatePublisher` reference; this task wires the delegation, AZ-302 owns the polling loop. +- Per-engine GPU memory budget enforcement: at deserialize time, sum the buffer allocations against `config.inference.gpu_memory_budget_bytes` (default 4 GB per C7-PT-02). Refuse to deserialize if the budget would be exceeded — raises `OutOfMemoryError` BEFORE allocating, with a message identifying which engine pushed it over. +- Error envelope per the AZ-297 Protocol: `EngineBuildError`, `EngineDeserializeError`, `CalibrationCacheError`, `InferenceError`, `OutOfMemoryError`. +- Diagnostic INFO log on `compile_engine` start/end with elapsed seconds and output path; INFO log on `deserialize_engine` with engine identity + warm-up confirmation; per-frame DEBUG log on `infer` (off by default; enabled by config). +- A standalone CLI entry point `python -m c7_inference.tensorrt_runtime compile ` is exposed for C10 CacheProvisioner to invoke pre-flight without holding a runtime instance. The CLI is a thin wrapper around `compile_engine`; it is not a separate compile path. + +### Excluded + +- AZ-301 EngineGate validation logic (filename-schema parse, manifest content-hash check) — this task only INVOKES the gate. +- AZ-302 ThermalState polling thread — this task only delegates to a publisher reference. +- AZ-299 OnnxTrtEpRuntime fallback — separate task. +- AZ-300 PytorchFp16Runtime baseline — separate task. +- C10 CacheProvisioner orchestration of compile_engine — owned by E-C10. This task exposes the API; C10 calls it. +- Engine warm-up beyond the deserialize-side buffer allocation (the warm-up that AC-NEW-1 measures end-to-end is owned by C10's pre-flight orchestration; the per-engine warm-up cost lives here). +- ONNX op unsupported by TRT 10.3 — out of scope to "fix" via plugins; raises `EngineBuildError` and the operator chooses ONNX-RT fallback (AZ-299). +- Multi-stream concurrent execution — out of scope this cycle (description.md notes the F3 hot path is single-threaded; future task if needed). + +## Acceptance Criteria + +**AC-1: compile_engine produces an FP16 .engine + sidecar at the canonical path** +Given a valid ONNX model and `BuildConfig(precision=Fp16, ...)` +When `compile_engine(model_path, build_config)` runs to completion +Then a `.engine` file exists at the D-C10-7 filename schema path with a matching `.sha256` sidecar from AZ-280; the returned `EngineCacheEntry` carries that path, the sha256, and the `(SM, JP, TRT, precision)` tuple + +**AC-2: compile_engine produces an INT8 .engine with a calibration cache + sidecar** +Given a valid ONNX model, a calibration dataset directory with at least 100 images, and `BuildConfig(precision=Int8, calibration_dataset=...)` +When `compile_engine` runs +Then a `.engine` file is produced AND a `.calib_cache` file is produced with a `.calib_cache.sha256` sidecar; the cache header records the calibration-dataset content hash; a second `compile_engine` call with the same dataset reuses the cache (verifiable: the second call is < 30 s, the first is minutes) + +**AC-3: stale calibration cache forces rebuild** +Given an existing `.calib_cache` with dataset hash A, and the dataset on disk now hashes to B +When `compile_engine` runs with the same calibration_dataset path +Then the calibrator detects the mismatch, rebuilds the cache from scratch (cache header now records hash B), and writes a new sidecar; the prior cache is overwritten — no silent reuse, no `CalibrationCacheError` + +**AC-4: corrupted calibration cache raises CalibrationCacheError** +Given an existing `.calib_cache` whose sidecar `.calib_cache.sha256` does not match the cache file's actual sha256 +When `compile_engine` runs +Then `CalibrationCacheError` is raised; no engine is built; the cache file is NOT silently overwritten (operator must explicitly delete or replace) + +**AC-5: deserialize_engine invokes EngineGate before any GPU allocation** +Given an `EngineCacheEntry` whose engine file's filename schema does not match the running Jetson tuple +When `deserialize_engine(entry)` is called +Then `EngineGate.validate(entry)` is invoked first, raises `EngineSchemaMismatchError`; no `IRuntime.deserialize_cuda_engine` call is made; no GPU memory is allocated (verifiable via NVML before/after delta) + +**AC-6: deserialize_engine refuses when GPU memory budget would be exceeded** +Given a deserialize that would allocate 1.2 GB and a runtime instance currently holding 3.0 GB resident with budget = 4 GB +When `deserialize_engine` is called +Then `OutOfMemoryError` is raised with the engine name in the message; no partial allocation occurs; the prior 3.0 GB resident state is unchanged + +**AC-7: infer round-trips H2D + enqueueV3 + D2H on the owned stream** +Given a deserialised `TrtEngineHandle` and a fixed input dict +When `infer(handle, inputs)` is called and a profiling tool counts CUDA API events +Then the call sequence is: `cudaMemcpyAsync(H2D)` × N inputs, `enqueueV3`, `cudaMemcpyAsync(D2H)` × M outputs, `cudaStreamSynchronize`; output dict has the M expected named tensors + +**AC-8: per-model latency p95 budget** +Given the production-default UltraVPR / LightGlue / AdHoP / DISK FP16 engines deserialised on Tier-2 (Jetson Orin Nano Super) +When the C7-PT-01 latency benchmark runs (scripted load matching production) +Then per-model p95 latencies are: UltraVPR ≤ 60 ms, LightGlue ≤ 30 ms, AdHoP ≤ 90 ms, DISK ≤ 50 ms; failure thresholds (100 / 60 / 150 / 90 ms respectively) are NOT crossed + +**AC-9: GPU memory budget compliance** +Given all production-default engines resident concurrently +When the C7-PT-02 memory benchmark runs +Then GPU resident memory across engines ≤ 4 GB (failure threshold 5 GB) and process RAM ≤ 1.5 GB (failure threshold 2 GB) + +**AC-10: release_engine fully frees GPU state and is idempotent** +Given a deserialised handle holding 1 GB GPU memory +When `release_engine(handle)` is called once and then again +Then after the first call NVML reports 1 GB freed; after the second call no error is raised and NVML state is unchanged + +## Non-Functional Requirements + +**Performance** +- Per-model p95 latencies as in AC-8. +- GPU memory ≤ 4 GB resident, RAM ≤ 1.5 GB process — AC-9 / NFT-LIM-01 / NFT-LIM-04. +- `compile_engine` for FP16 sub-minute typical; INT8 minutes (calibration dominates) — bounded by hardware, not budget. +- `deserialize_engine` p95 ≤ 5 s per engine on Tier-2 (engineering sanity bound; AC-NEW-1 cold-start budget is end-to-end and is asserted by C7-IT-01 in the test phase). + +**Compatibility** +- TensorRT 10.3 only this cycle (per D-C7-9 JetPack 6.2 lock). No 9.x / 8.x compatibility shims. +- Built on Polygraphy + trtexec + IBuilderConfig — three TRT-supported entry points; no custom TRT builder calls outside that surface. + +**Reliability** +- All errors are caught and rewrapped into the AZ-297 family. No raw `RuntimeError` or TRT-specific exception leaks to consumers. +- `infer` NEVER blocks on the producer side; sync stream execution is on the consumer's calling thread. +- INT8 calibration cache trust is the lurking foot-gun (per description.md § 7); the AC-2 / AC-3 / AC-4 flow is the only protection. A new strategy or refactor MUST preserve these three. + +**Concurrency** +- One CUDA stream per `TensorrtRuntime` instance. The F3 hot path is single-threaded (per description.md); concurrent `infer` calls from different threads are NOT supported this cycle (would corrupt the stream). Documented as a constraint. +- Compile path (`compile_engine`) is allowed to run while a runtime instance is also serving `infer` calls only if the compile is in a separate process (the C10 CLI entry point); same-process concurrent compile + infer is NOT supported. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | compile FP16 with a tiny ONNX (sanity) | Engine + sidecar produced at canonical path; EngineCacheEntry shape correct | +| AC-2 | compile INT8 with a 100-image calibration dataset; rerun | Cache + sidecar produced; rerun reuses cache (under 30 s); cache header has dataset hash | +| AC-3 | mutate calibration dataset; rerun compile | Cache rebuilt; new dataset hash in header | +| AC-4 | corrupt calibration sidecar; rerun compile | `CalibrationCacheError`; engine NOT built | +| AC-5 | deserialise an engine with a mismatched filename SM | `EngineSchemaMismatchError` from gate; no GPU allocation (NVML diff = 0) | +| AC-6 | deserialise that would push past budget | `OutOfMemoryError`; prior resident state unchanged | +| AC-7 | infer with a profiler attached | Exact CUDA call sequence: N H2D, enqueueV3, M D2H, stream sync | +| AC-8 | C7-PT-01 microbench against the four production engines | p95 within budgets per the table | +| AC-9 | C7-PT-02 memory test with all engines resident | ≤ 4 GB GPU, ≤ 1.5 GB RAM | +| AC-10 | release_engine called twice | First call frees memory; second is a no-op | +| NFR-perf-deserialize | Microbench deserialise per engine | p95 ≤ 5 s on Tier-2 | +| NFR-reliability-error-rewrap | Inject a TRT C++ exception via mock | Rewrapped into the AZ-297 family; original message preserved | + +## Constraints + +- TensorRT 10.3 + JetPack 6.2 lock per D-C7-9. +- Polygraphy + trtexec + IBuilderConfig only — no custom TRT builder API outside this surface. +- `helpers.sha256_sidecar` (AZ-280) for atomic write + sidecar pattern. +- `helpers.engine_filename_schema` (AZ-281) for the D-C10-7 filename schema (consumed at deserialise via the gate). +- The `EngineHandle` subclass `TrtEngineHandle` is opaque to consumers; consumers MUST NOT introspect its fields. Implementation may add diagnostic fields (e.g., `last_infer_elapsed_ms`) but they are NOT part of the AZ-297 Protocol. +- The CLI entry point `python -m c7_inference.tensorrt_runtime compile ...` is for C10 to invoke; not part of any consumer's public API. +- This task introduces no new third-party dependencies beyond TensorRT 10.3, Polygraphy, and the trtexec binary that ships with TRT. +- Per-frame DEBUG logging defaults to OFF (would flood at 39 Hz aggregate per description.md); enabled only via `config.inference.per_frame_debug_log = true`. + +## Risks & Mitigation + +**Risk 1: ONNX op unsupported by TRT 10.3** +- *Risk*: A backbone author exports an op TRT cannot lower; `compile_engine` raises `EngineBuildError` and the operator has nothing to fall back to. +- *Mitigation*: `EngineBuildError` is surfaced to the operator pre-flight (per description.md error-handling spec); the operator switches the runtime config to `onnx_trt_ep` (AZ-299) which has wider op coverage. Documented in the operator playbook (out of scope here). + +**Risk 2: INT8 calibration silently uses a stale cache** +- *Risk*: AC-3 / AC-4 fails in a corner case (dataset mutated atomically, hash check races with compile start). +- *Mitigation*: The calibration-dataset hash is computed at the START of `compile_engine` and compared to the cache header's hash; the dataset is treated as immutable for the duration of the call. AC-3 + AC-4 cover the obvious cases; the corner case requires a separate test where the dataset is replaced mid-compile, which is operator error and out of scope. + +**Risk 3: GPU memory budget is exceeded under sustained use due to TRT internal scratch** +- *Risk*: `deserialize_engine` reports buffer size based on `opt_shape`, but TRT may allocate additional internal scratch beyond what is reported. +- *Mitigation*: AC-9 measures actual NVML resident memory, not reported buffer size; a regression here fails the test. The budget includes a safety margin (4 GB target with 5 GB hard fail) that absorbs typical TRT scratch. + +**Risk 4: trtexec subprocess hangs on a malformed ONNX** +- *Risk*: trtexec can hang silently on certain malformed inputs. +- *Mitigation*: The trtexec invocation has a config-driven timeout (default 600 s = 10 minutes); on timeout, the subprocess is killed and `EngineBuildError("trtexec timeout after 600s")` is raised. The IBuilderConfig direct path is preferred for INT8 anyway; trtexec is mainly for FP16 build profiling. + +## Runtime Completeness + +- **Named capability**: TensorRT 10.3 production runtime + Polygraphy / trtexec / IBuilderConfig hybrid + INT8 calibration cache trust + GPU memory budget enforcement (architecture / E-C7 / D-C7-9 / NFT-PERF-01 / NFT-LIM-01). +- **Production code that must exist**: real `TensorrtRuntime` class implementing the AZ-297 Protocol; real Polygraphy + trtexec + IBuilderConfig compile path; real `IRuntime.deserialize_cuda_engine` + `IExecutionContext` + GPU buffer allocations; real `enqueueV3` + sync-stream execution; real INT8 calibrator with dataset-hash cache trust; real GPU memory budget check at deserialise. +- **Allowed external stubs**: tests MAY substitute a `FakeCudaProfiler` to count CUDA events (AC-7); production wiring uses real TRT + real CUDA. C10 CacheProvisioner is the production caller of `compile_engine`; tests MAY drive the CLI directly. +- **Unacceptable substitutes**: a Python-level fake "engine" that bypasses TRT (would defeat the whole point), a calibration cache "always trusted" path (would break D-C10-6), an `infer` that uses `torch.compile` instead of `enqueueV3` (that's AZ-300 PyTorch baseline territory), or an in-memory engine that skips the file-on-disk path (would break the engine cache lifecycle that C10 + F2 takeoff load depend on). diff --git a/_docs/02_tasks/todo/AZ-299_c7_onnxrt_fallback.md b/_docs/02_tasks/todo/AZ-299_c7_onnxrt_fallback.md new file mode 100644 index 0000000..af8ce5d --- /dev/null +++ b/_docs/02_tasks/todo/AZ-299_c7_onnxrt_fallback.md @@ -0,0 +1,167 @@ +# C7 OnnxTrtEpRuntime — ONNX Runtime + TensorRT EP Fallback + +**Task**: AZ-299_c7_onnxrt_fallback +**Name**: C7 OnnxTrtEpRuntime +**Description**: Implement `OnnxTrtEpRuntime`, the fallback `InferenceRuntime` strategy that uses ONNX Runtime with the TensorRT execution provider. Triggered by config selection or by operator escalation when the TRT-direct path (AZ-298) cannot deserialise the cached engine for a given model — produces correct results with a degraded-latency WARN log (covers C7-IT-05). Conforms to the same AZ-297 Protocol so the composition root can select either strategy at startup. +**Complexity**: 3 points +**Dependencies**: AZ-297_c7_runtime_protocol, AZ-301_c7_engine_gate, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c7_inference (epic AZ-249 / E-C7) +**Tracker**: AZ-299 +**Epic**: AZ-249 (E-C7) + +### Document Dependencies + +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — the Protocol this task implements; produced by AZ-297. +- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — used at deserialise time when reusing a cached `.engine` via ORT TRT EP. +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar trust check via the gate. +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config object that provides ORT cache dir + ONNX model paths. + +## Problem + +Two scenarios force a fallback off `TensorrtRuntime`: + +1. The cached `.engine` for a given model fails to deserialise on this Jetson (filename-schema mismatch outside the gate's tolerance, an op TRT 10.3 dropped between calibrator runs, or operator-driven engine rename) — the system must keep flying without dropping the request. +2. The operator deliberately selects ORT + TRT EP via config (e.g., during a debugging session, or on a Tier-1 workstation where TRT-direct hard-fails). + +Without `OnnxTrtEpRuntime`: + +- C7-IT-05 (ONNX-RT fallback when TRT engine unavailable) has nothing to verify. +- The "simple-baseline" engineering rule that every fancy strategy must have a working fallback is violated. +- The ADR-001 "runtime selectable at startup" promise has only the production strategy and the PyTorch baseline; ORT is the middle ground users expect. + +## Outcome + +- An `OnnxTrtEpRuntime` class at `src/gps_denied_onboard/components/c7_inference/onnx_trt_runtime.py` conforming to the AZ-297 Protocol; `current_runtime_label() == "onnx_trt_ep"`. +- `compile_engine` is a no-op for ORT — it returns an `EngineCacheEntry` whose `engine_path` is the underlying `.onnx` file path. ORT will lazy-compile a TRT subgraph in-session on first use; the EP cache directory holds those subgraph caches transparently. +- `deserialize_engine(EngineCacheEntry) -> EngineHandle`: if the entry's `engine_path` is a `.engine` file, invoke AZ-301 EngineGate first (cached engine reuse via ORT TRT EP cache); if it is a `.onnx`, skip the gate and load the ONNX directly. In either case, build an `InferenceSession` with the TRT EP provider list `["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]` (provider fallback chain) and warm up by running one zero-input through the session. +- `infer(handle, inputs) -> outputs` calls `session.run(output_names, inputs)`; the call is sync; ORT manages its own CUDA stream internally. +- `release_engine(handle)` calls `session.end_profiling()` and drops the session reference; ORT releases EP resources on garbage collection. +- A degraded-latency WARN log (`kind="c7.fallback_to_onnx_trt_ep"`) is emitted ONCE on first `infer` call when the runtime was selected as a fallback (signalled by the composition root via a constructor flag); the operator post-flight FDR shows the fallback was used. +- `thermal_state()` is delegated to the same `ThermalStatePublisher` reference used by `TensorrtRuntime` (AZ-302 owns the publisher). +- ORT version pin matches the project's `requirements.txt` / equivalent; this task does NOT introduce a new ORT version. + +## Scope + +### Included + +- `OnnxTrtEpRuntime` class implementing the AZ-297 Protocol. +- `compile_engine`: returns an `EngineCacheEntry` pointing at the source `.onnx` file (no separate engine binary). The `.onnx`'s sha256 is computed via AZ-280 and stamped into the entry; the `(SM, JP, TRT, precision)` tuple is set to the host's running tuple (since ORT will lazy-compile per host). +- `deserialize_engine`: provider list construction (`["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]`), provider options (`trt_engine_cache_enable=True`, `trt_engine_cache_path=config.inference.ort_trt_cache_dir`, `trt_max_workspace_size=config.inference.gpu_memory_budget_bytes // 4`), session creation, single-shot warm-up. Returns `OnnxTrtEpEngineHandle` (`EngineHandle` subclass) wrapping the session. +- Engine-cache reuse path: if `EngineCacheEntry.engine_path.suffix == ".engine"`, invoke `EngineGate.validate(entry)` first; the engine binary is then placed at the `trt_engine_cache_path` so ORT's TRT EP picks it up on session creation. If gate refuses, raise the gate's error (no fallback to ONNX direct — the operator must explicitly switch to a different cache or restart with a different config). +- ONNX direct path: if `engine_path.suffix == ".onnx"`, skip the gate (no engine to validate); ORT compiles the TRT subgraph in-session. +- `infer`: `session.run(output_names, {name: ndarray for name, ndarray in inputs.items()})`; output is a dict matching the Protocol shape. Per-frame DEBUG log gated by config. +- `release_engine`: idempotent session drop. +- Constructor flag `is_fallback: bool = False` set by the composition root when this strategy is wired as the fallback (vs. selected directly). On first `infer`, if `is_fallback`, emit the `kind="c7.fallback_to_onnx_trt_ep"` WARN log + an FDR record (via `gcs_alert` in the AZ-291 sense — the actual GCS wire is C8's job; this task calls a constructor-injected callback). +- Error envelope: `EngineBuildError` (ONNX validation failure), `EngineDeserializeError` (session creation failure), `InferenceError` (mid-flight ORT runtime exception), `OutOfMemoryError` (mid-session OOM), the gate's errors when applicable. + +### Excluded + +- AZ-298 TensorrtRuntime (production-default) — separate task. +- AZ-300 PytorchFp16Runtime — separate task. +- AZ-301 EngineGate validation logic — this task INVOKES the gate. +- AZ-302 ThermalState polling — this task delegates. +- ORT version upgrade or new ORT dependency — pinned to the project's existing version. +- Custom ORT EPs (e.g., TVM, OpenVINO) — out of scope this cycle. +- Multi-session pooling — one session per `EngineHandle` this cycle. +- Engine cache directory cleanup — operator-initiated; handled by C12 operator tooling. + +## Acceptance Criteria + +**AC-1: Protocol conformance** +Given `runtime_checkable(InferenceRuntime)` +When `isinstance(OnnxTrtEpRuntime(...), InferenceRuntime)` is evaluated +Then result is `True`; `current_runtime_label() == "onnx_trt_ep"` + +**AC-2: deserialize from `.onnx` skips the gate** +Given an `EngineCacheEntry(engine_path=)` +When `deserialize_engine(entry)` is called +Then `EngineGate.validate` is NOT called; an ORT `InferenceSession` is created with the TRT EP at the head of the provider list; a single warm-up `session.run` succeeds + +**AC-3: deserialize from `.engine` invokes the gate** +Given an `EngineCacheEntry(engine_path=)` whose filename schema mismatches the host +When `deserialize_engine(entry)` is called +Then `EngineGate.validate` is invoked first and raises `EngineSchemaMismatchError`; no `InferenceSession` is created + +**AC-4: infer round-trips through ORT and returns named outputs** +Given a deserialised handle for a UltraVPR-shaped model and a fixed input dict +When `infer(handle, inputs)` is called +Then `session.run` is invoked with the model's declared output names; the returned dict matches the Protocol shape; numerical outputs match the TRT-direct strategy within a documented tolerance (FP16 round-trip) + +**AC-5: fallback WARN log fires once on first infer** +Given a runtime constructed with `is_fallback=True` +When `infer` is called for the first time +Then exactly one `kind="c7.fallback_to_onnx_trt_ep"` WARN log is emitted AND one `gcs_alert(...)` callback is invoked; subsequent `infer` calls do NOT emit the log again + +**AC-6: provider fallback chain respects ORT order** +Given an environment where TRT EP refuses to load (e.g., TRT version mismatch outside ORT's tolerance) but CUDA EP is available +When `deserialize_engine` is called +Then session creation succeeds with `CUDAExecutionProvider` as the active provider; an INFO log records the actual provider in use; `current_runtime_label()` STILL returns `"onnx_trt_ep"` (the runtime label is the strategy, not the EP) + +**AC-7: release_engine drops the session and is idempotent** +Given a deserialised handle +When `release_engine(handle)` is called once and then again +Then the first call drops the session reference and the GPU memory ORT held is released; the second call returns silently + +**AC-8: workspace budget is respected** +Given `config.inference.gpu_memory_budget_bytes = 4 GB` +When `deserialize_engine` is called and ORT's TRT EP attempts to allocate workspace +Then the workspace size is capped at `gpu_memory_budget_bytes // 4 = 1 GB` per the provider option; an attempt to exceed this raises `OutOfMemoryError` + +## Non-Functional Requirements + +**Performance** +- Per-call latency budget is the same as TRT (C7-PT-01) but the test fixture allows up to 1.5× slack (the WARN log notes "degraded latency" — the operator has been informed). Failure thresholds match TRT's failure thresholds (100 / 60 / 150 / 90 ms). +- Session creation budget: ≤ 30 s p95 on Tier-2 for the first deserialise (ORT lazy-compile dominates the first call); subsequent deserialises that hit the EP cache should be ≤ 5 s p95. + +**Compatibility** +- ORT version pinned to the project's `requirements.txt` (this task does NOT change it). +- TRT EP requires the same TRT 10.3 / CUDA stack as `TensorrtRuntime`; if absent, `CUDAExecutionProvider` carries the inference (degraded but functional) per AC-6. + +**Reliability** +- Errors rewrapped into the AZ-297 family. +- ORT-internal exceptions (e.g., `onnxruntime.OrtInvalidArgument`) are caught and rewrapped as `InferenceError`. +- Session lifetime is bound to the `EngineHandle`; the runtime never holds a session beyond an explicit `release_engine` or process exit. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Protocol conformance + label | `isinstance` True; label string match | +| AC-2 | Deserialise from `.onnx` | Gate NOT called; session created with TRT EP head | +| AC-3 | Deserialise from `.engine` with mismatched schema | Gate raises; no session | +| AC-4 | Numerical comparison against TRT-direct (UltraVPR sample input) | Outputs match within FP16 tolerance | +| AC-5 | First infer with `is_fallback=True` | Exactly one WARN log; one gcs_alert; second infer silent | +| AC-6 | Force TRT EP refusal, allow CUDA EP | Session creates with CUDA EP; label still `"onnx_trt_ep"` | +| AC-7 | release_engine called twice | First drops; second is no-op | +| AC-8 | Workspace cap | Provider option set to budget // 4 | +| NFR-perf-session-create | Microbench session creation × 5 | First p95 ≤ 30 s; subsequent ≤ 5 s | +| NFR-reliability-error-rewrap | Inject ORT internal error | Rewrapped to `InferenceError` | + +## Constraints + +- ORT version pinned at the project default; no upgrade in this task. +- Provider list order is fixed: `["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]`. CPU is the bottom-of-list fallback so a misconfigured Jetson still serves results (slowly) instead of hard-failing — the WARN log will scream, but the system stays alive. +- ORT EP's `trt_engine_cache_path` is a config field; defaults to a per-flight subdirectory of `config.inference.engine_cache_dir`. +- The `OnnxTrtEpEngineHandle` is opaque to consumers. +- This task introduces no new third-party dependencies beyond what ORT already requires. + +## Risks & Mitigation + +**Risk 1: ORT TRT EP cache poisons across precisions** +- *Risk*: The TRT EP cache is keyed by ORT-internal hashes, not by D-C10-7 schema. Reusing a stale cache could load a wrong-precision subgraph. +- *Mitigation*: The EP cache directory is per-flight (under `engine_cache_dir`) and cleaned on flight end by C12 operator tooling. The cache lives only across a single flight; cross-flight reuse is operator-initiated and explicit. + +**Risk 2: Provider fallback to CPU EP is silent** +- *Risk*: If both TRT and CUDA EPs refuse to load, ORT falls back to CPU; latency budgets explode and the system "works" at 100× target latency. +- *Mitigation*: The active provider list is logged at INFO on session creation; if CPU is the active provider, an additional WARN log fires (`kind="c7.cpu_fallback"`). The composition root MAY install a hard refusal hook that raises `EngineDeserializeError` on CPU fallback (operator-configurable; default is "warn but allow"). + +**Risk 3: ORT version drift breaks TRT EP option keys** +- *Risk*: ORT 1.16 → 1.18 changed some TRT EP option key names; a future upgrade silently regresses. +- *Mitigation*: The TRT EP provider option dict is built behind a single helper function; the helper has a unit test that pins the option-key names. Upgrade-time changes fail the unit test before merge. + +## Runtime Completeness + +- **Named capability**: ONNX Runtime + TensorRT execution-provider fallback path (architecture / E-C7 / C7-IT-05). +- **Production code that must exist**: real `OnnxTrtEpRuntime` class implementing the AZ-297 Protocol; real ORT `InferenceSession` creation with the TRT-EP-led provider list; real ORT `session.run` on the F3 hot path; real fallback WARN + GCS alert wiring. +- **Allowed external stubs**: tests MAY substitute a recording wrapper around `session.run` to verify call sequence (AC-4); production wiring uses real ORT. +- **Unacceptable substitutes**: a stub that just defers to `TensorrtRuntime` (would defeat the fallback's purpose), an ORT session without the TRT EP at the head of the provider list (would silently use CUDA EP and break C7-IT-05's design intent), or hardcoded CPU-EP-only mode (would silently meet the test outcome with absurd latency). diff --git a/_docs/02_tasks/todo/AZ-300_c7_pytorch_baseline.md b/_docs/02_tasks/todo/AZ-300_c7_pytorch_baseline.md new file mode 100644 index 0000000..4f01c73 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-300_c7_pytorch_baseline.md @@ -0,0 +1,161 @@ +# C7 PytorchFp16Runtime — Mandatory Simple Baseline + +**Task**: AZ-300_c7_pytorch_baseline +**Name**: C7 PytorchFp16Runtime +**Description**: Implement `PytorchFp16Runtime`, the mandatory simple-baseline `InferenceRuntime` strategy. Loads each backbone's canonical PyTorch checkpoint, calls `.half().cuda()`, and conforms to the AZ-297 Protocol — no engine compile, no engine deserialize, no calibration cache. Used as the numerical reference every fancier strategy is measured against (engine simplicity rule), and as the only viable runtime for Tier-1 workstation Docker (where TRT installation is non-trivial). +**Complexity**: 2 points +**Dependencies**: AZ-297_c7_runtime_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c7_inference (epic AZ-249 / E-C7) +**Tracker**: AZ-300 +**Epic**: AZ-249 (E-C7) + +### Document Dependencies + +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — the Protocol this task implements; produced by AZ-297. +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config carries the PyTorch checkpoint paths and the runtime selection. + +## Problem + +A "simple baseline" is mandatory because: + +- Without a numerical reference, the FP16 / INT8 outputs from `TensorrtRuntime` (AZ-298) and `OnnxTrtEpRuntime` (AZ-299) cannot be sanity-checked. Every strategy must produce results that agree with the PyTorch reference within a documented tolerance. +- Tier-1 workstation Docker runs research / debugging / training-vs-deployed comparison workloads where TRT is not installed. Without `PytorchFp16Runtime`, these workflows have no executable path through the C7 component. +- The ENG-RULE (engine simplicity) demands every complex strategy can be ablated to a simple one; PyTorch is that simple one. + +Without this task, the AZ-297 Protocol has only fancy implementations and no ground truth. + +## Outcome + +- A `PytorchFp16Runtime` class at `src/gps_denied_onboard/components/c7_inference/pytorch_fp16_runtime.py` conforming to the AZ-297 Protocol; `current_runtime_label() == "pytorch_fp16"`. +- `compile_engine` is a no-op — returns an `EngineCacheEntry` with `engine_path` set to the source PyTorch checkpoint path (`.pt` / `.pth`). The `(SM, JP, TRT, precision)` tuple is set to `(None, None, None, "fp16")` since PyTorch is hardware-portable across SM levels. +- `deserialize_engine(EngineCacheEntry) -> EngineHandle` loads the checkpoint with `torch.load(map_location="cuda")`, calls `.half().cuda().eval()`, returns a `PytorchEngineHandle` wrapping the model. +- `infer(handle, inputs) -> outputs` does sync GPU forward pass with `torch.no_grad() + torch.inference_mode()`; converts input numpy arrays to `torch.Tensor.half().cuda()`, runs the forward, converts outputs back to numpy. No torch.compile, no scripting, no tracing — straight eager FP16. +- `release_engine(handle)` deletes the model reference and calls `torch.cuda.empty_cache()` to free GPU memory. +- `thermal_state()` delegates to the constructor-injected `ThermalStatePublisher` (AZ-302). +- `BUILD_PYTORCH_RUNTIME=ON` is the default Tier-1 setting per ADR-002; airborne (Tier-2 default Jetson) is OFF; operator (`BUILD_C10_PROVISIONING=ON`) is OFF; replay (Tier-3) is OFF — but airborne can still load this strategy IF an operator explicitly switches the config. + +## Scope + +### Included + +- `PytorchFp16Runtime` class implementation conforming to the AZ-297 Protocol. +- `compile_engine`: no-op; returns `EngineCacheEntry` whose `engine_path` is the checkpoint path; sha256 computed via AZ-280 (the helper, but invoked transitively — this task does not directly depend on AZ-280's API; the entry is built using stdlib hashing and the same algorithm). +- `deserialize_engine`: `torch.load(map_location="cuda") → .half().cuda().eval() → wrap → return`. Single warm-up forward with zero-shaped input to allocate buffers. +- `infer`: input dict → `{name: torch.from_numpy(arr).half().cuda() for name, arr in inputs.items()}` → forward pass under `torch.no_grad() + torch.inference_mode()` → output dict via `.cpu().numpy()`. Synchronous (the sync barrier is implicit in the `.cpu()` transfer). +- `release_engine`: drop the model reference, call `torch.cuda.empty_cache()`. +- Diagnostic INFO log on `deserialize_engine` with checkpoint path + parameter count + estimated GPU footprint (`sum(p.numel() * p.element_size() for p in model.parameters())`). +- Per-frame DEBUG log on `infer` (off by default, gated by config). +- Error envelope: `EngineDeserializeError` (checkpoint missing or incompatible state dict), `InferenceError` (forward-pass exception), `OutOfMemoryError` (CUDA OOM during forward). +- Constructor accepts a `ThermalStatePublisher` reference for the `thermal_state()` delegation. + +### Excluded + +- AZ-298 TensorrtRuntime — separate task. +- AZ-299 OnnxTrtEpRuntime — separate task. +- AZ-301 EngineGate (no engine binaries to validate; PyTorch is checkpoint-based, not engine-based). +- AZ-302 ThermalState polling — delegated. +- `torch.compile` / `torch.jit.trace` / `torch.jit.script` — explicitly out of scope; this is the SIMPLE baseline. +- Mixed-precision autocast — explicitly FP16 only; no `torch.cuda.amp.autocast`. +- Multi-GPU support — single Jetson GPU only. +- Engine cache for PyTorch — there is no engine cache; the checkpoint IS the artifact. + +## Acceptance Criteria + +**AC-1: Protocol conformance** +Given `runtime_checkable(InferenceRuntime)` +When `isinstance(PytorchFp16Runtime(...), InferenceRuntime)` is evaluated +Then result is `True`; `current_runtime_label() == "pytorch_fp16"` + +**AC-2: compile_engine is a no-op** +Given a checkpoint path on disk +When `compile_engine(path, build_config)` is called +Then no `.engine` file is produced; the returned `EngineCacheEntry` has `engine_path == path` and the `(SM, JP, TRT)` tuple components are `None`; the call returns within 100 ms + +**AC-3: deserialize loads, half-casts, GPU-moves, eval-mode** +Given a valid checkpoint +When `deserialize_engine(entry)` is called +Then the loaded model has `model.training == False`; every parameter has `dtype == torch.float16` and `device.type == "cuda"`; one warm-up forward has succeeded; the returned handle is a `PytorchEngineHandle` + +**AC-4: infer produces numpy output dict matching the Protocol** +Given a deserialised handle for a UltraVPR-shaped model and a fixed input numpy dict +When `infer(handle, inputs)` is called +Then the returned value is a `dict[str, np.ndarray]`; every output is FP16-cast or FP32-cast per the model's actual output dtypes (no silent type coercion); the numerical output is within a documented tolerance of the FP32 reference (when running the same model in FP32 mode for the test's reference path) + +**AC-5: release frees GPU memory** +Given a deserialised handle holding K MB GPU memory +When `release_engine(handle)` is called +Then NVML reports K MB freed within 1 s (the freed memory may not return to OS immediately, but `torch.cuda.memory_allocated()` decreases to zero for that handle's allocations) + +**AC-6: missing checkpoint raises EngineDeserializeError** +Given a non-existent checkpoint path +When `deserialize_engine(entry)` is called +Then `EngineDeserializeError` is raised with the path in the message; no GPU memory is allocated + +**AC-7: incompatible state dict raises EngineDeserializeError** +Given a checkpoint whose state-dict keys do not match the architecture the runtime expects +When `deserialize_engine` is called +Then `EngineDeserializeError` is raised; the original `RuntimeError` from `load_state_dict(strict=True)` is preserved as `__cause__` + +**AC-8: CUDA OOM during infer surfaces as OutOfMemoryError** +Given a deserialised model and an input tensor large enough to OOM +When `infer(handle, inputs)` is called +Then `OutOfMemoryError` is raised (rewrapped from `torch.cuda.OutOfMemoryError`); the model is NOT silently moved to CPU + +## Non-Functional Requirements + +**Performance** +- Per-call latency budget is the simple-baseline reference; no specific p95 target. PyTorch FP16 typically runs 3–5× slower than TRT FP16 on Jetson; that is acceptable because this strategy is not the production-default airborne choice. +- `deserialize_engine` p95 ≤ 10 s on Tier-2 (checkpoint load + half-cast + GPU move + warm-up). + +**Compatibility** +- PyTorch version pinned to the project default; this task does NOT change it. +- Torch checkpoint format only — `.pt` / `.pth` files saved via `torch.save`. + +**Reliability** +- Errors rewrapped into the AZ-297 family. +- `eval()` mode is set unconditionally; this is the simple baseline, not a training runtime. Even if a checkpoint accidentally ships in training mode, the runtime forces `eval()`. +- `torch.no_grad()` + `torch.inference_mode()` are applied inside `infer`; the forward never accumulates gradients. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Protocol conformance + label | `isinstance` True; label match | +| AC-2 | compile_engine returns quickly with checkpoint path | No `.engine` produced; entry shape correct; ≤ 100 ms | +| AC-3 | deserialize a small test model | Eval mode True; FP16 dtype on GPU; warm-up succeeded | +| AC-4 | infer numerical comparison vs FP32 reference | Output within tolerance | +| AC-5 | release after deserialise | NVML / `torch.cuda.memory_allocated` shows freed | +| AC-6 | deserialise non-existent path | `EngineDeserializeError`; no GPU alloc | +| AC-7 | deserialise mismatched state dict | `EngineDeserializeError`; `__cause__` preserved | +| AC-8 | infer with deliberately oversized input | `OutOfMemoryError`; no CPU fallback | +| NFR-perf-deserialize | Microbench deserialise × 5 | p95 ≤ 10 s on Tier-2 | +| NFR-reliability-eval-mode | After deserialise, check `model.training` | False unconditionally | + +## Constraints + +- PyTorch version pinned at project default. +- Eager FP16 only — no `torch.compile`, no JIT, no autocast. The "simple baseline" is in the name. +- The model architecture loader is per-backbone; this task wires in a registry mapping `model_name` (from `BuildConfig`) to the architecture class. The registry is populated by per-backbone modules (UltraVPR, LightGlue, etc.) — that registry is owned by E-C2 / E-C2.5 / E-C3 / E-C3.5 component code and outside this task's scope; this task only provides the mechanism (a single dict registered at composition time). +- The `PytorchEngineHandle` is opaque. +- This task introduces no new third-party dependencies beyond what PyTorch already requires. + +## Risks & Mitigation + +**Risk 1: Model architecture registry leaks across components** +- *Risk*: To load a `UltraVPR` checkpoint, this runtime needs the `UltraVPR` class. Importing it from `c2_vpr.ultra_vpr` would create a back-edge from C7 (Layer 2) to C2 (Layer 3) violating module-layout layering. +- *Mitigation*: The composition root registers each backbone class into `PytorchFp16Runtime`'s registry at startup (dependency injection). The runtime never imports component code directly. The injection is wired in `runtime_root` (Layer 5), which is allowed to depend on every layer. + +**Risk 2: Checkpoint deserialization is a security risk** +- *Risk*: `torch.load` can execute arbitrary code via pickle when loading untrusted checkpoints. +- *Mitigation*: Checkpoints are cosigned via the deployment manifest's signature (per `_docs/02_document/risk_mitigations.md`). This task uses `torch.load(weights_only=True)` (PyTorch 2.x default) which restricts pickle to known-safe types. A non-weights-only checkpoint raises `EngineDeserializeError`. + +**Risk 3: FP16 numerical mismatch with FP32 reference outside tolerance** +- *Risk*: Some model architectures lose accuracy when half-cast (FP16) without autocast. +- *Mitigation*: AC-4 documents the tolerance per model (recorded in the implementation report). If a backbone exceeds tolerance, the runtime is unfit for that backbone and the operator switches to TRT (which uses calibration to recover accuracy). The accuracy-vs-runtime trade-off is a per-backbone property documented during integration testing — this task accepts the result, does not work around it. + +## Runtime Completeness + +- **Named capability**: PyTorch FP16 simple-baseline runtime (architecture / E-C7 / ENG-RULE). +- **Production code that must exist**: real `PytorchFp16Runtime` class implementing the AZ-297 Protocol; real `torch.load` + `.half().cuda().eval()` + sync forward; real release path. +- **Allowed external stubs**: tests MAY substitute a tiny `nn.Linear` checkpoint as the "model"; production wiring uses the actual backbones registered by the composition root. +- **Unacceptable substitutes**: a CPU-only mode (would defeat the GPU-first invariant the AZ-297 Protocol implies via `EngineHandle`); `torch.compile` (would silently change the simple-baseline contract); autocast (would change the "FP16 only" guarantee that downstream comparisons rely on). diff --git a/_docs/02_tasks/todo/AZ-301_c7_engine_gate.md b/_docs/02_tasks/todo/AZ-301_c7_engine_gate.md new file mode 100644 index 0000000..07bb718 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-301_c7_engine_gate.md @@ -0,0 +1,161 @@ +# C7 EngineGate — D-C10-3 Content-Hash + D-C10-7 Filename Schema Enforcement + +**Task**: AZ-301_c7_engine_gate +**Name**: C7 EngineGate +**Description**: Implement the takeoff-side `EngineGate` validator that every `InferenceRuntime` strategy invokes before deserialising a cached `.engine` file. Two refusal paths: (1) D-C10-7 filename-schema mismatch raises `EngineSchemaMismatchError` at parse time; (2) D-C10-3 manifest content-hash mismatch (or missing sidecar) raises `EngineHashMismatchError` / `EngineSidecarMissingError`. Pure validation — no GPU ops, no I/O beyond reading the sidecar + the deployed manifest. Isolated so all three runtime strategies (TRT / ONNX-RT / PyTorch) call the same validator. +**Complexity**: 3 points +**Dependencies**: AZ-297_c7_runtime_protocol, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-266_log_module +**Component**: c7_inference (epic AZ-249 / E-C7) +**Tracker**: AZ-301 +**Epic**: AZ-249 (E-C7) + +### Document Dependencies + +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — defines `EngineCacheEntry` (input) and the gate's error types. +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar verification contract. +- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — filename-schema parser. + +## Problem + +D-C10-3 (manifest content-hash takeoff gate) and D-C10-7 (engine filename schema) are two of the project's safety-critical gates. Both fire at takeoff (F2) when an `.engine` file is about to be deserialised. Without a centralised validator: + +- Each runtime strategy (TRT / ONNX-RT / PyTorch) would re-implement the gate logic — three copies, three drift surfaces, three places to fix bugs. +- C7-IT-03 (D-C10-3 takeoff gate refuses mismatched engine) and C7-IT-04 (filename-schema enforcement) cannot share test fixtures. +- A future runtime (a hypothetical CUDA EP path) could silently skip the gate by forgetting to invoke it. + +This task delivers ONE validator, used by every strategy, with a single point of refusal. + +## Outcome + +- An `EngineGate` class at `src/gps_denied_onboard/components/c7_inference/engine_gate.py` with a single public method: + `validate(entry: EngineCacheEntry, host_tuple: HostTuple, manifest: DeploymentManifest) -> None` + that returns silently on success and raises one of `EngineSchemaMismatchError`, `EngineHashMismatchError`, `EngineSidecarMissingError` on refusal. +- `HostTuple` is a small frozen dataclass `(sm: int, jp: str, trt: str, precision: PrecisionMode)` derived from `nvidia-smi` / `pynvml` + the runtime's pinned TRT version + the engine's intended precision (read from the entry). +- `DeploymentManifest` is a typed wrapper over the deployed `manifest.json` (an ordered map of `engine_relative_path → sha256_hex`) that C10 produces during F1 and the deployment process delivers alongside the engines. The manifest schema is owned by E-C10; this task DEPENDS on its existence but does not write the manifest. +- The two refusal paths are evaluated in this order: + 1. Schema parse: `helpers.engine_filename_schema.parse(entry.engine_path.name)` returns the `(sm, jp, trt, precision)` quadruple; if the parse fails, raise `EngineSchemaMismatchError(reason="parse failure: ...")`. + 2. Schema match: if the parsed quadruple does not match `host_tuple`, raise `EngineSchemaMismatchError(expected=host_tuple, got=parsed)`. + 3. Sidecar presence: if no `.sha256` sidecar exists alongside `entry.engine_path`, raise `EngineSidecarMissingError`. + 4. Sidecar trust: `helpers.sha256_sidecar.verify(entry.engine_path)` — if the sidecar's recorded hash does not match the engine's actual sha256, raise `EngineHashMismatchError(stage="sidecar")`. + 5. Manifest match: if `manifest[entry.engine_path.relative_to(manifest.root)]` does not equal the verified sha256, raise `EngineHashMismatchError(stage="manifest")`. +- Every refusal includes structured fields in the exception message (engine path, expected vs. got tuple / hash, manifest entry if any) suitable for the takeoff-abort log path that C10 / runtime_root wires up downstream. +- Diagnostic INFO log on success (`kind="c7.gate.pass"`, engine path, host tuple, manifest hash); ERROR log on each refusal (`kind="c7.gate.refuse"`, refusal reason, engine path). + +## Scope + +### Included + +- `EngineGate` class with `validate(entry, host_tuple, manifest) -> None`. +- `HostTuple` dataclass and a stateless `read_host_tuple() -> HostTuple` helper that calls `nvidia-smi --query-gpu=compute_cap,driver_version,...` (via `pynvml` where possible). The helper is in this task because the gate is the only consumer; future consumers can lift it out. +- The five refusal paths above, in the documented order. The order is deterministic so test fixtures can target each step. +- Parse-error vs. tuple-mismatch differentiation: the gate distinguishes "we could not parse the filename" from "we parsed it and it does not match" (via `EngineSchemaMismatchError(reason=...)` vs `(expected=..., got=...)`). Both are the same exception class; the kwargs differ. +- Manifest reader: a thin typed wrapper at `src/gps_denied_onboard/components/c7_inference/manifest.py` that reads the deployed `manifest.json` and exposes `__getitem__` and `root`. The actual manifest schema is owned by E-C10; this task implements only the reader sufficient for the gate's needs and references the canonical schema location. +- INFO-on-pass and ERROR-on-refuse logs. +- Constructor-injectable `ManifestReader` for tests (the production reader reads from disk; tests inject a dict-backed fake). + +### Excluded + +- AZ-298 / AZ-299 / AZ-300 strategy implementations — they CALL the gate. +- AZ-302 ThermalState publisher — unrelated. +- The deployment manifest's schema — owned by E-C10 (this task writes the reader, not the writer). +- F2 takeoff abort orchestration (the gate raises an error; the runtime caller propagates; the takeoff path catches and aborts) — owned by `runtime_root` and E-C10. +- C12 operator tooling diagnostics for refused engines — out of scope. +- A "tolerant mode" that allows minor SM differences — explicitly out of scope this cycle (would defeat the safety gate). + +## Acceptance Criteria + +**AC-1: filename-schema parse failure refused at parse time** +Given an engine file named `bogus_name.engine` (no schema) +When `validate(entry, ...)` is called +Then `EngineSchemaMismatchError(reason="parse failure: ...")` is raised; subsequent gate steps are NOT executed (no sidecar read, no manifest lookup); no GPU memory allocated by the caller (verifiable via NVML diff = 0 around the call) + +**AC-2: filename-schema tuple mismatch refused at parse time** +Given an engine `ultravpr__sm86_jp6.2_trt10.3_fp16.engine` and a host with `sm=87` +When `validate` is called +Then `EngineSchemaMismatchError(expected=HostTuple(sm=87, ...), got=ParsedTuple(sm=86, ...))` is raised; no sidecar / manifest checks execute + +**AC-3: missing sidecar refused before manifest lookup** +Given a schema-matched engine whose `.sha256` sidecar does NOT exist on disk +When `validate` is called +Then `EngineSidecarMissingError(engine_path=...)` is raised; the manifest is NOT read + +**AC-4: sidecar trust failure** +Given a schema-matched engine whose sidecar exists but records a hash that does NOT match the engine's actual sha256 +When `validate` is called +Then `EngineHashMismatchError(stage="sidecar", engine_path=..., expected=..., got=...)` is raised; the manifest is NOT consulted + +**AC-5: manifest mismatch** +Given a schema-matched engine whose sidecar verifies (sidecar hash == file hash) but the deployment manifest's entry for this engine path records a DIFFERENT hash +When `validate` is called +Then `EngineHashMismatchError(stage="manifest", engine_path=..., manifest_hash=..., file_hash=...)` is raised + +**AC-6: full-success path returns silently and logs INFO** +Given an engine that passes all five steps +When `validate` is called +Then the call returns `None` silently; one `kind="c7.gate.pass"` INFO log record was emitted; the caller proceeds to deserialise + +**AC-7: refusal order is deterministic for fixture targeting** +Given a fixture engine that is BOTH schema-mismatched AND has a missing sidecar +When `validate` is called +Then `EngineSchemaMismatchError` is raised (NOT `EngineSidecarMissingError`) — the schema check runs first; this property is documented and tested + +**AC-8: read_host_tuple matches the running Jetson** +Given a Jetson Orin Nano Super running JetPack 6.2 + TRT 10.3 +When `read_host_tuple()` is called +Then the returned tuple has `sm=87, jp="6.2", trt="10.3"`; on a workstation (Tier-1 Docker) the values reflect that environment instead + +## Non-Functional Requirements + +**Performance** +- `validate` p99 ≤ 50 ms (sidecar read + manifest dict lookup + sha256 streaming over the engine file). Sha256 over a 500 MB engine file dominates; that is the design budget. +- `read_host_tuple` p99 ≤ 100 ms (one `pynvml` call). + +**Reliability** +- The gate is deterministic — same inputs always produce the same outcome. Test fixtures rely on this property. +- The gate makes NO network calls and NO writes; it is read-only. +- Errors carry structured fields suitable for the post-flight FDR record (the runtime caller forwards them). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Bogus filename | `EngineSchemaMismatchError(reason=...)`; no further gate steps | +| AC-2 | Mismatched SM in filename | `EngineSchemaMismatchError(expected=..., got=...)` | +| AC-3 | Missing sidecar | `EngineSidecarMissingError`; manifest not read | +| AC-4 | Sidecar hash != file hash | `EngineHashMismatchError(stage="sidecar")` | +| AC-5 | Manifest hash != sidecar hash | `EngineHashMismatchError(stage="manifest")` | +| AC-6 | Happy path | Returns None; INFO log emitted | +| AC-7 | Both schema-fail AND sidecar-missing | Schema error wins (deterministic order) | +| AC-8 | host tuple read on Jetson Orin Nano Super | (sm=87, jp="6.2", trt="10.3") | +| NFR-perf-validate | Microbench validate × 100 with a 500 MB engine | p99 ≤ 50 ms | +| NFR-reliability-no-write | Run validate against a read-only directory | No writes attempted (sidecar stays untouched) | + +## Constraints + +- The five refusal paths execute in the documented order; the order is part of the public contract (AC-7 verifies it). +- The gate is read-only. NEVER writes to the engine file, sidecar, or manifest. +- The `ManifestReader` is constructor-injectable; the production reader reads `manifest.json` from disk; tests inject a dict-backed fake. +- The `read_host_tuple` helper uses `pynvml` first; falls back to parsing `nvidia-smi` output if `pynvml` is unavailable. NEVER returns a synthetic / default tuple — if the GPU cannot be queried, raises `RuntimeError("cannot read host tuple")` and the takeoff path aborts. +- Sha256 is computed using stdlib `hashlib.sha256` with chunked reads via `helpers.sha256_sidecar`; this task does NOT introduce a new sha256 library. +- This task introduces no new third-party dependencies beyond `pynvml` (which is already a project dependency for jetson-stats / NVML telemetry per the C7 description.md). + +## Risks & Mitigation + +**Risk 1: Sha256 over a 500 MB engine dominates takeoff latency** +- *Risk*: Per-engine 50 ms × 10 engines = 500 ms blocking takeoff. +- *Mitigation*: Sidecar's recorded hash is the trust anchor; once the sidecar verifies, the manifest match is a dict lookup. The actual file-streaming sha256 happens during sidecar verification (one streaming pass per engine). Per-engine 50 ms is the budget; the test asserts it. If a future regression pushes past this, the gate is fast-pathed by reusing a cached file-hash computed at compile time (out of scope this cycle). + +**Risk 2: Manifest reader silently treats missing entry as pass** +- *Risk*: A typo in the manifest produces `KeyError` swallowed somewhere; the gate "passes" without checking. +- *Mitigation*: The manifest reader's `__getitem__` raises `EngineHashMismatchError(stage="manifest", reason="missing manifest entry for ...")` on missing key — NEVER returns None or treats absence as pass. AC-5 covers the mismatch case; an additional negative test covers the missing case. + +**Risk 3: Refusal order changes silently across refactors** +- *Risk*: A future refactor reorders the five steps; AC-7's "schema wins over sidecar-missing" property regresses. +- *Mitigation*: AC-7 is a deterministic ordering test; any reorder fails it. The refusal order is part of the public contract documented in this task spec. + +## Runtime Completeness + +- **Named capability**: D-C10-3 takeoff content-hash gate + D-C10-7 filename-schema enforcement (architecture / E-C7 / E-C10 / risk_mitigations.md R04). +- **Production code that must exist**: real `EngineGate.validate` calling real `helpers.engine_filename_schema.parse`, real `helpers.sha256_sidecar.verify`, real `ManifestReader` reading the deployed manifest.json from disk. +- **Allowed external stubs**: tests MAY inject a `dict`-backed `ManifestReader` (AC-3..AC-7); production wiring reads the on-disk manifest. +- **Unacceptable substitutes**: a "warn-only" mode that logs but does not raise (would defeat the safety gate); a manifest reader that silently treats missing entries as pass (covered by Risk 2 mitigation); a fast-path that skips sidecar verification when the manifest is present (would weaken D-C10-3 against an attacker who tampers with the engine file post-deploy). diff --git a/_docs/02_tasks/todo/AZ-302_c7_thermal_publisher.md b/_docs/02_tasks/todo/AZ-302_c7_thermal_publisher.md new file mode 100644 index 0000000..5c82bab --- /dev/null +++ b/_docs/02_tasks/todo/AZ-302_c7_thermal_publisher.md @@ -0,0 +1,177 @@ +# C7 ThermalState Publisher — jetson-stats / NVML 1 Hz Background + +**Task**: AZ-302_c7_thermal_publisher +**Name**: C7 ThermalState Publisher +**Description**: Implement the 1 Hz background polling loop that reads jetson-stats / pynvml (CPU/GPU temperature, throttle bit, measured clock MHz), produces a lock-free atomic `ThermalState` snapshot, exposes it via `InferenceRuntime.thermal_state()` for C4's D-CROSS-LATENCY-1 hybrid covariance-mode decision. Emits an FDR record on every throttle-state transition; emits a WARN log on first throttle entry and on telemetry unavailability; defaults `thermal_throttle_active = false` on `TelemetryUnavailableError`. Throttle-detection latency must be ≤ 1 s end-to-end so C4 reacts within 1 frame (C7-IT-02). +**Complexity**: 3 points +**Dependencies**: AZ-297_c7_runtime_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf +**Component**: c7_inference (epic AZ-249 / E-C7) +**Tracker**: AZ-302 +**Epic**: AZ-249 (E-C7) + +### Document Dependencies + +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — defines `ThermalState` and `thermal_state()`; produced by AZ-297. +- `_docs/02_document/contracts/shared_fdr_client/fdr_client_protocol.md` — used to emit thermal-transition FDR records via `FdrClient.publish`. +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — `kind="c7.thermal_transition"` record envelope. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — WARN log shape on throttle entry / telemetry loss. +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — Config carries `thermal_poll_hz` and a fallback flag for telemetry unavailability behaviour. + +## Problem + +C4 Pose's D-CROSS-LATENCY-1 hybrid switches between two covariance modes (steady-state vs. JACOBIAN) based on whether the Jetson is thermal-throttling. Without a publisher: + +- C4 has no canonical `ThermalState` source — the AC-NEW-5 throttle-reaction-within-1-frame guarantee has nothing to read. +- The F3 hot path threads cannot poll `jetson-stats` themselves (would race with `infer` on the GIL-protected critical section per description.md § 7). +- Thermal transitions are not recorded in the FDR — post-flight tooling cannot correlate degraded poses with thermal events. +- `TelemetryUnavailableError` (jetson-stats hung or absent) has no documented degraded-mode behaviour; without explicit handling, the system would either crash or silently lie. + +This task is the SINGLE owner of thermal telemetry across the companion process. + +## Outcome + +- A `ThermalStatePublisher` class at `src/gps_denied_onboard/components/c7_inference/thermal_publisher.py` with `start() / stop()` lifecycle and `read() -> ThermalState` accessor. +- A background thread (NORMAL priority, daemonic) polls jetson-stats / pynvml at `config.inference.thermal_poll_hz` Hz (default 1.0); each successful poll updates a single `_atomic_snapshot: ThermalState` field via simple-assignment (Python's GIL covers atomic-store of a single object reference — documented in the implementation report). +- `read()` is the lock-free reader: returns the current `_atomic_snapshot`. F3 hot-path callers MAY call `read()` from any thread; the call is wait-free and ≤ 1 µs. +- Throttle-state transitions (the boolean `thermal_throttle_active` flips between two consecutive polls) emit: + - One `kind="c7.thermal_transition"` FDR record via the constructor-injected `FdrClient` (AZ-273) carrying `previous_state`, `new_state`, `gpu_temp_c`, `cpu_temp_c`, `measured_clock_mhz`, `measured_at_ns`. + - One WARN log on entry-to-throttle (NOT on exit-from-throttle — the exit is INFO). +- `TelemetryUnavailableError` (raised by jetson-stats / pynvml internals): the publisher catches, sets `thermal_throttle_active = false` and `gpu_temp_c = None / cpu_temp_c = None`, emits ONE WARN log per occurrence (rate-limited to ≤ 1/sec), and continues polling. The runtime never silently lies — `ThermalState.is_telemetry_available` is set to `False` so C4 can choose to ignore the throttle bit. +- The publisher is a singleton constructed by the composition root and passed by reference into every `InferenceRuntime` strategy that needs `thermal_state()`. The strategies do NOT each construct their own publisher. +- The publisher is started during composition (after `FdrClient` is registered, before any consumer calls `read()`); stopped by the composition root's process-exit hook. + +## Scope + +### Included + +- `ThermalStatePublisher` class with `__init__(config, fdr_client, logger)`, `start()`, `stop()`, `read() -> ThermalState`. +- Background polling thread: while `_running`, sleep `1.0 / config.inference.thermal_poll_hz` seconds, read `jetson-stats` (`from jtop import jtop` context manager OR `pynvml.nvmlDeviceGetTemperature` / `nvmlDeviceGetCurrentClocksThrottleReasons` direct calls — implementation chooses based on availability), build a fresh `ThermalState`, atomically replace `_atomic_snapshot`. +- Source-selection logic: at `start()`, attempt to import `jtop` (jetson-stats); if unavailable, attempt `pynvml`; if both unavailable, raise `TelemetryUnavailableError` from `start()` itself (composition aborts cleanly — operator chooses to disable thermal-aware paths). +- Throttle-transition detection: compare `previous._atomic_snapshot.thermal_throttle_active` vs. new value; on flip, emit FDR record + log per Outcome. +- WARN log on first telemetry-unavailable in a window (rate-limited to 1/sec): `kind="c7.thermal.unavailable"`. +- INFO log on `start()` and `stop()`; INFO log on throttle-exit transitions; WARN log on throttle-entry transitions. +- `ThermalState` extension: add `is_telemetry_available: bool` to the DTO defined in AZ-297. (NOTE: this is a Protocol-touching change; AZ-297's contract MUST list this field and AZ-302 SHOULD coordinate with AZ-297 at decompose time. Documented in the AZ-297 contract's `## Test Cases`.) +- Constructor-injected `Clock` for testability — tests inject a fake clock that advances on demand; production wires `time.monotonic_ns`. +- The `read()` accessor is wait-free and re-entrant; can be called from any thread including the F3 hot path. +- A `ThermalStatePublisher.is_running() -> bool` introspection accessor for the composition root and tests. + +### Excluded + +- AZ-297 InferenceRuntime Protocol — this task adds the `is_telemetry_available` field to the existing `ThermalState` DTO; the Protocol method `thermal_state()` is owned by the strategies that delegate to this publisher. +- AZ-298 / AZ-299 / AZ-300 strategies — they delegate to this publisher; they do NOT poll themselves. +- AZ-301 EngineGate — unrelated. +- C4 Pose's covariance-mode switching logic — owned by E-C4. This task PUBLISHES `ThermalState`; C4 consumes. +- Cooling controller / fan curve adjustments — out of scope; the companion process is read-only on thermal telemetry. +- Cross-flight thermal trend analysis — operator post-flight tooling owns it. +- A SECOND telemetry source for redundancy (two pynvml clients, etc.) — out of scope this cycle. + +## Acceptance Criteria + +**AC-1: read() is wait-free and returns the latest snapshot** +Given a started publisher with two completed polls +When `read()` is called from the F3 hot path +Then the call returns within 1 µs (microbenched); the returned `ThermalState` matches the most recent poll's data + +**AC-2: throttle entry within 1 s** +Given a running publisher and a simulated jetson-stats spoof flipping `throttle_active` from False to True at time T +When the test waits 1 s and calls `read()` +Then the returned `ThermalState.thermal_throttle_active == True` (latency ≤ 1 s end-to-end at 1 Hz poll rate); ONE FDR `kind="c7.thermal_transition"` record was emitted with `new_state=True`; ONE WARN log was emitted + +**AC-3: throttle exit within 1 s and INFO log** +Given a publisher currently in throttle and a simulated flip to False +When 1 s elapses and the test calls `read()` +Then `thermal_throttle_active == False`; ONE FDR record with `new_state=False`; ONE INFO log (NOT WARN) records the exit + +**AC-4: telemetry unavailability sets is_telemetry_available=False, defaults throttle to False** +Given a started publisher whose jetson-stats source raises `TelemetryUnavailableError` on every poll +When the test waits 2 s and calls `read()` +Then `ThermalState.is_telemetry_available == False`, `thermal_throttle_active == False` (default-safe), `gpu_temp_c == None`, `cpu_temp_c == None`; the WARN log was emitted at most twice in 2 s (rate-limited to 1/sec) + +**AC-5: cold-start with no source raises** +Given an environment where neither `jtop` nor `pynvml` is importable +When the publisher's `start()` is called +Then `TelemetryUnavailableError` is raised from `start()` itself; the publisher is in `is_running() == False` state; the composition root catches and either aborts startup or proceeds with thermal-aware paths disabled (decision is composition-root's, not this task's) + +**AC-6: start/stop lifecycle is idempotent** +Given a publisher +When `start()` is called twice in succession +Then the second call is a no-op (returns silently); the polling thread is NOT duplicated; `is_running() == True` + +When `stop()` is called twice +Then the second `stop()` is a no-op; resources are NOT double-freed + +**AC-7: poll thread does not interfere with infer hot path** +Given a started publisher polling at 1 Hz and an F3 hot-path benchmark running `infer` at 39 Hz aggregate (per description.md) +When the benchmark runs for 60 s +Then the F3 hot-path latency p95 is unchanged compared to a baseline without the publisher (the polling thread does not contend on the CUDA stream or any infer-critical resource); the publisher's poll p99 is ≤ 100 ms + +**AC-8: FDR record envelope matches contract** +Given a throttle transition +When the FDR record is written +Then the record matches `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` shape with `kind="c7.thermal_transition"`, `producer_id="c7_inference.thermal"`, `payload` containing `{previous_state, new_state, gpu_temp_c, cpu_temp_c, measured_clock_mhz, measured_at_ns}` (deep-equal verified against the schema's payload type) + +## Non-Functional Requirements + +**Performance** +- `read()` is wait-free; p99 ≤ 1 µs. +- Poll p99 ≤ 100 ms (the polling thread executes at NORMAL priority; jetson-stats takes 30–80 ms typically). +- F3 hot-path latency unchanged when publisher is running (AC-7). + +**Reliability** +- The publisher NEVER blocks the F3 hot path. The polling thread runs at NORMAL priority on a separate Python thread; `read()` is a single object-reference load (atomic under GIL). +- Telemetry-unavailability defaults are documented and tested (AC-4); the system never lies about throttle state. +- The publisher is one of the FIRST things the composition root starts (after `FdrClient` registration) and one of the LAST things it stops (before `FdrClient.stop`). Documented as a startup-order constraint. + +**Concurrency** +- One polling thread per process. The publisher is a singleton. +- `read()` is re-entrant and called from the F3 hot path threads (consumers); they hold no locks the publisher is sensitive to. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Microbench read() × 100k from a worker thread | p99 ≤ 1 µs; returned state matches latest poll | +| AC-2 | Spoof flip False→True at T; wait 1 s; read | throttle_active=True; one FDR + one WARN | +| AC-3 | Spoof flip True→False; wait 1 s; read | throttle_active=False; one FDR + one INFO | +| AC-4 | Spoof unavailable on every poll for 2 s | is_telemetry_available=False; default-safe; ≤ 2 WARN | +| AC-5 | Disable jtop + pynvml; start | TelemetryUnavailableError raised; is_running=False | +| AC-6 | start twice; stop twice | Idempotent; no double-spawn / double-free | +| AC-7 | F3 hot-path bench with publisher running | p95 unchanged vs. baseline; poll p99 ≤ 100 ms | +| AC-8 | Throttle transition FDR record shape | Matches schema deep-equal | +| NFR-perf-poll | Microbench poll body × 100 | p99 ≤ 100 ms | +| NFR-reliability-default-safe | Telemetry unavailable on first poll → read() | is_telemetry_available=False; throttle_active=False | + +## Constraints + +- One polling thread per process; publisher is a singleton. +- `read()` is wait-free; the implementation MUST NOT introduce any lock or condition variable on the read path. +- Source selection at `start()` time: `jtop` first, `pynvml` second; if both fail, raise `TelemetryUnavailableError` from `start()` (do NOT silently default-safe — that hides a misconfigured deployment). +- Once selected, the source does NOT swap mid-flight (e.g., if jtop becomes unavailable mid-flight, the publisher hits AC-4 default-safe behaviour but does NOT switch to pynvml — operator must restart). +- WARN log on telemetry-unavailable is rate-limited to ≤ 1/sec via a simple monotonic-clock check; rate limit window is documented. +- `ThermalState.is_telemetry_available` is added to the AZ-297 DTO; this is a coordinated change documented in the AZ-297 contract's change log. +- This task introduces no new third-party dependencies — `jtop` (jetson-stats) and `pynvml` are both already pinned by the description.md key dependencies table. + +## Risks & Mitigation + +**Risk 1: GIL atomicity assumption is wrong on a future Python** +- *Risk*: Python 3.13+ free-threaded mode (PEP 703) removes the GIL; simple-assignment is no longer atomic. +- *Mitigation*: Implementation report documents the GIL assumption; the project's Python version is pinned and the implementation is correct under the pinned version. If/when the project moves to free-threaded mode, this task is revisited (would add an `atomic` library wrapper or threading.Lock around the snapshot). + +**Risk 2: Polling thread starves under thermal-throttle (the very condition it is observing)** +- *Risk*: Under heavy throttle, NORMAL-priority threads may not get scheduled; the publisher's poll latency exceeds 1 s and the throttle-detection latency contract violates. +- *Mitigation*: The polling thread is daemonic but at NORMAL priority; jetson-stats internal calls take 30–80 ms typically — well under 1 s budget. AC-2 is the canonical test; if it fails on real hardware, the operator may bump the priority via config (out of scope this cycle). + +**Risk 3: jetson-stats internal threading interferes with our polling thread** +- *Risk*: `jtop` runs its own background thread; concurrent access from our polling thread + an operator tool could corrupt jtop's state. +- *Mitigation*: This publisher is the SINGLE owner of `jtop` access in the companion process. Operator tools (C12) read FDR records, not the live publisher. Documented in the implementation report. + +**Risk 4: FDR record emission rate spikes during rapid throttle oscillation** +- *Risk*: A pathological thermal scenario could oscillate at the poll rate, emitting one record per second sustained — affecting AC-NEW-3 segment-size budgets. +- *Mitigation*: 1 record per second sustained is well within the C13 writer's throughput budget (200 Hz aggregate per AZ-291); even worst-case oscillation is benign. No additional rate limit is needed for FDR transition records. + +## Runtime Completeness + +- **Named capability**: ThermalState publisher + lock-free atomic snapshot + 1 Hz background polling (architecture / E-C7 / AC-NEW-5 / D-CROSS-LATENCY-1). +- **Production code that must exist**: real `ThermalStatePublisher` class with real background thread, real `jtop` / `pynvml` poll, real lock-free `_atomic_snapshot` reference swap, real FDR record emission via the injected `FdrClient`. +- **Allowed external stubs**: tests MAY substitute a `FakeJtopSource` and a `FakeFdrClient` (AC-2..AC-8); production wiring uses real `jtop` + real AZ-273 `FdrClient`. +- **Unacceptable substitutes**: a polling loop that uses `time.sleep` without a real `Clock` injection (would break test determinism); a snapshot field guarded by a `threading.Lock` on the read path (would violate the wait-free read requirement under F3 hot path); a "warn but always default-safe" mode that never raises `TelemetryUnavailableError` from `start()` (would hide misconfigured deployments — exactly the failure mode AC-5 prevents). diff --git a/_docs/02_tasks/todo/AZ-303_c6_storage_interfaces.md b/_docs/02_tasks/todo/AZ-303_c6_storage_interfaces.md new file mode 100644 index 0000000..8ab8367 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-303_c6_storage_interfaces.md @@ -0,0 +1,211 @@ +# C6 Storage Interfaces — Protocols + DTOs + Composition-Root Factories + +**Task**: AZ-303_c6_storage_interfaces +**Name**: C6 Storage Interfaces +**Description**: Define the three `c6_tile_cache` Protocols (`TileStore`, `TileMetadataStore`, `DescriptorIndex`), their shared DTOs (`TileId`, `TileMetadata`, `TileQualityMetadata`, `TilePixelHandle`, `Bbox`, `SectorBoundary`, `HnswParams`, `IndexMetadata`), the `TileSource` / `FreshnessLabel` / `VotingStatus` / `SectorClassification` enums, the runtime error taxonomy (`TileCacheError` family + `IndexBuildError`), and the composition-root factory triple `build_tile_store / build_tile_metadata_store / build_descriptor_index`. This is the foundational shared-API task for E-C6 — five external components (C2, C2.5, C3, C10, C11) plus C12 operator tooling depend on the contracts this task freezes. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-280_sha256_sidecar +**Component**: c6_tile_cache (epic AZ-250 / E-C6) +**Tracker**: AZ-303 +**Epic**: AZ-250 (E-C6) + +### Document Dependencies + +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — produced by this task. +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — produced by this task. +- `_docs/02_document/contracts/c6_tile_cache/descriptor_index.md` — produced by this task. +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — `TileMetadata.content_sha256_hex`, `IndexMetadata.sidecar_sha256_hex`, and the atomic-write/sidecar pattern. +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — adds `config.tile_cache.{store_runtime, metadata_runtime, descriptor_index_runtime, root_dir, postgres_dsn, lru_eviction_threshold_bytes}` fields. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — error events emitted by Protocol implementations use this log shape. + +## Problem + +Five different components (C2 VPR, C2.5 ReRanker, C3 CrossDomainMatcher, C10 CacheProvisioner, C11 TileDownloader/Uploader) and one operator-side consumer (C12) all need a single, frozen interface to the persistent imagery store. Without it: + +- Each consumer would import the concrete `PostgresFilesystemStore` / `FaissDescriptorIndex` directly, hard-coding the storage choice and breaking ADR-009 interface-first DI. +- A future swap of the descriptor index (FAISS → ScaNN) or the metadata store (Postgres → SQLite for Tier-0 dev) would ripple across every consumer. +- Error handling would diverge per consumer; `FreshnessRejectionError`, `ContentHashMismatchError`, and `IndexUnavailableError` would have different shapes per impl, making the F2 takeoff abort path and the F4 mid-flight insert path fragile. +- The composition root would have to know per-component which storage runtime is acceptable; today only ADR-001 (config) + ADR-009 (interface DI) decide. +- No canonical place to declare the `TileMetadata` shape — every consumer would re-derive `flight_id`, `companion_id`, `quality_metadata`, `voting_status` field names, leading to drift. + +This task delivers the typed boundary every consumer reads against and every implementation conforms to. It writes no storage logic — concrete `PostgresFilesystemStore` is owned by the postgres-filesystem-store task; concrete `FaissDescriptorIndex` is owned by the faiss-descriptor-index task; the freshness gate logic is its own task; the LRU eviction is its own task. + +## Outcome + +- Three Protocols at `src/gps_denied_onboard/components/c6_tile_cache/interface.py` (re-exported from `__init__.py`): + - `TileStore` — `read_tile_pixels`, `write_tile`, `tile_exists`, `delete_tile`. + - `TileMetadataStore` — `query_by_bbox`, `insert_metadata`, `update_voting_status`, `mark_uploaded`, `pending_uploads`, `record_lru_access`, `lru_candidates`, `total_disk_bytes`, `get_by_id`. + - `DescriptorIndex` — `search_topk`, `descriptor_dim`, `mmap_handle`, `rebuild_from_descriptors`, `index_metadata`. +- All three Protocols are `typing.Protocol` with `runtime_checkable=True`. +- DTOs at `src/gps_denied_onboard/components/c6_tile_cache/_types.py` (re-exported from `__init__.py`): `TileId`, `TileMetadata`, `TileMetadataPersistent`, `TileQualityMetadata`, `Bbox`, `SectorBoundary`, `HnswParams`, `IndexMetadata`, plus the enums `TileSource`, `FreshnessLabel`, `VotingStatus`, `SectorClassification`. All `@dataclass(frozen=True)` except `TilePixelHandle` (opaque context-manager class). +- A `TilePixelHandle` ABC at the same path that the concrete impl subclasses; consumers use `with handle as memview:` and treat the underlying bytes as read-only. +- The runtime error hierarchy under `c6_tile_cache.errors`: + - `TileCacheError` ← {`TileNotFoundError`, `TileFsError`, `TileMetadataError`, `ContentHashMismatchError`, `FreshnessRejectionError`, `IndexUnavailableError`}. + - `IndexBuildError` (NOT a subclass of `TileCacheError` — offline build envelope only; raised by `rebuild_from_descriptors`). +- Composition-root factories at `src/gps_denied_onboard/runtime_root/storage_factory.py`: + - `build_tile_store(config) -> TileStore` + - `build_tile_metadata_store(config) -> TileMetadataStore` + - `build_descriptor_index(config) -> DescriptorIndex` + - Each respects compile-time `BUILD_*` gating (today only `BUILD_FAISS_INDEX` for `DescriptorIndex`; the metadata + filesystem store has no build flag). + - Requesting an impl whose flag is OFF raises `RuntimeNotAvailableError` (reused from AZ-297) at composition time, NOT at first call. +- A `ConfigSchemaError` extension to AZ-269's config loader for the new `config.tile_cache.{store_runtime, metadata_runtime, descriptor_index_runtime, root_dir, postgres_dsn, lru_eviction_threshold_bytes}` fields. +- Three frozen contract files at `_docs/02_document/contracts/c6_tile_cache/{tile_store, tile_metadata_store, descriptor_index}.md` carry the full shapes; consumers read those files, not this task spec. +- Type-only unit tests verify each future concrete impl module's class actually conforms to the Protocol via `runtime_checkable` + `isinstance` (catches drift at CI time, not deployment). + +## Scope + +### Included + +- All three Protocols, all DTOs, all enums, the error taxonomy, the composition-root factory triple, and the config-loader extension. +- Three contract files (already drafted alongside this task); the producer task is responsible for keeping them in sync with the code. +- Type-only conformance tests at `tests/unit/c6_tile_cache/test_protocol_conformance.py` that import each concrete impl class and assert `isinstance(impl, ProtocolClass)`. The tests stand up no Postgres / FAISS — they only exercise structural typing. +- `RuntimeNotAvailableError` reuse from AZ-297 (do NOT define a new error type). +- `TilePixelHandle` ABC (so the concrete impl can subclass; tests can substitute a fake handle that wraps a `bytes` buffer). +- DTO field validation at construction time: e.g., `TileId(zoom_level=22)` (out-of-range) raises `ValueError`; `Bbox` with `min_lat > max_lat` raises `ValueError`. These are NOT in `TileCacheError` — they are stdlib `ValueError` for bad caller input. +- The `FreshnessRejectionError`, `ContentHashMismatchError`, and `IndexBuildError` types (defined here even though only the impl tasks raise them — keeps the family / taxonomy in one place). + +### Excluded + +- `PostgresFilesystemStore` implementation — separate task (`c6_postgres_filesystem_store`). +- `FaissDescriptorIndex` implementation — separate task (`c6_faiss_descriptor_index`). +- Postgres schema migration (`_alembic/0001_initial.sql`) — separate task (`c6_postgres_schema`). +- Freshness gate logic (active_conflict reject / stable_rear downgrade) — separate task (`c6_freshness_gate`); this task only declares `FreshnessRejectionError` and the `freshness_label` field. +- 10 GB LRU cache eviction — separate task (`c6_cache_budget_eviction`); this task only declares `lru_candidates` / `record_lru_access` / `total_disk_bytes` Protocol methods. +- C10 CacheProvisioner consumer wiring of `rebuild_from_descriptors` — owned by E-C10. +- C11 `TileUploader` consumer wiring of `pending_uploads` / `mark_uploaded` — owned by E-C11. +- C2 / C2.5 / C3 consumer wiring of read paths — owned by their respective epics. +- Sector boundary CRUD — owned by C12 operator tooling. This task only declares the read-side `SectorBoundary` DTO. +- Test infrastructure (Postgres test container, FAISS test fixtures) — owned by E-BBT (test infrastructure task). + +## Acceptance Criteria + +**AC-1: Three Protocols are conformance-checkable** +Given a class that implements every method on `TileStore` (or `TileMetadataStore`, or `DescriptorIndex`) with matching signatures +When `isinstance(impl, TileStore)` is evaluated under `runtime_checkable` +Then the result is `True`; for a class that omits any method, the result is `False` for that Protocol + +**AC-2: Frozen DTOs reject mutation** +Given a constructed `TileId(...)`, `TileMetadata(...)`, `Bbox(...)`, or `HnswParams(...)` instance +When the test attempts any field reassignment +Then `dataclasses.FrozenInstanceError` is raised; the original value is preserved + +**AC-3: Error hierarchy catchable as a single family** +Given any of the six `TileCacheError` subtypes +When the consumer wraps a Protocol method call in `try: ... except c6_tile_cache.errors.TileCacheError` +Then every documented subtype is caught; an unrelated `Exception` is NOT caught; `IndexBuildError` is also NOT caught (it is intentionally out of the runtime-read envelope) + +**AC-4: Composition-root factory honours config** +Given `config.tile_cache.descriptor_index_runtime = "faiss_hnsw"` and `BUILD_FAISS_INDEX=ON` +When `build_descriptor_index(config)` is called +Then a `FaissDescriptorIndex` instance is returned (the test substitutes a fake satisfying the Protocol; production wiring is the same call site) + +**AC-5: Composition-root factory honours BUILD flag gate** +Given `config.tile_cache.descriptor_index_runtime = "faiss_hnsw"` and `BUILD_FAISS_INDEX=OFF` +When `build_descriptor_index(config)` is called +Then `RuntimeNotAvailableError` is raised at composition time with a message naming `"faiss_hnsw"`; no module-level import of FAISS symbols has occurred (verifiable via `sys.modules` does NOT contain `c6_tile_cache.faiss_descriptor_index`) + +**AC-6: Unknown runtime label rejected at config load** +Given `config.tile_cache.descriptor_index_runtime = "scann"` (not in the enum) +When the config is loaded via AZ-269's loader +Then `ConfigSchemaError` is raised at load time with a message listing the valid values; `build_descriptor_index` is never reached + +**AC-7: Constructor-time validation rejects bad input** +Given `TileId(zoom_level=22, lat=0.0, lon=0.0)` (out-of-range zoom) or `Bbox(min_lat=10, min_lon=0, max_lat=5, max_lon=10)` (inverted box) +When the DTO is constructed +Then `ValueError` is raised with a message naming the offending field; no DTO instance is produced + +**AC-8: TilePixelHandle is read-only by contract** +Given a concrete `TilePixelHandle` subclass that exposes `memoryview` over mmap'd bytes +When `with handle as memview: memview[0] = 0xff` +Then `TypeError: cannot modify read-only memoryview` is raised; the underlying file is not mutated + +**AC-9: Contract files match Protocol shapes** +Given the three contract files at `_docs/02_document/contracts/c6_tile_cache/` +When a contract-test parses each file's Shape section's method/field tables and compares against the runtime Protocol via introspection +Then every method, every field, every error type is present and consistent in both directions + +**AC-10: VotingStatus transitions are policy-aware (declared, not enforced)** +Given the `VotingStatus` enum +When the consumer test asserts the documented forward-transitions table (`PENDING → TRUSTED`, `PENDING → REJECTED`, `TRUSTED → REJECTED`) +Then the table matches the contract; the actual enforcement lives in `update_voting_status` impl (NOT this task), so the test only verifies the enum exposes the four documented states (`PENDING`, `TRUSTED`, `REJECTED`) + +## Non-Functional Requirements + +**Compatibility** +- Protocols use stdlib `typing.Protocol` (PEP 544); no third-party Protocol library is introduced. +- DTOs use stdlib `dataclasses` with `frozen=True`; no `pydantic` or `attrs` dependency. +- Errors subclass `Exception` (not `BaseException`); upstream `except Exception:` continues to work. + +**Performance** +- The factory triple `build_*` returns within 50 ms each (lazy-imports the concrete impl on first call; subsequent calls << 1 ms). +- DTO construction is the bare-cost dataclass `__init__` plus the constructor-time validation (AC-7). + +**Reliability** +- Implementations MUST raise only members of `c6_tile_cache.errors.TileCacheError` from runtime Protocol methods; third-party library exceptions (psycopg / FAISS C++ exceptions / OS errors from filesystem syscalls) MUST be caught and rewrapped. +- Versioning: any breaking change to a Protocol or DTO MUST bump the corresponding contract file's `Version` and notify every consumer task listed in the contract header. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `runtime_checkable` Protocol vs. fully-implementing fakes for each of the three Protocols; vs. fakes missing one method | `isinstance` returns True for full, False for partial | +| AC-2 | Mutation attempt on each frozen DTO | `FrozenInstanceError`; original value preserved | +| AC-3 | Raise each of the six error subtypes; catch as `c6_tile_cache.errors.TileCacheError` | All caught; unrelated `ValueError` is NOT caught; `IndexBuildError` is NOT caught by the family handler | +| AC-4 | `build_descriptor_index` with `faiss_hnsw` + flag ON → fake `FaissDescriptorIndex` | Returned instance satisfies the Protocol | +| AC-5 | `build_descriptor_index` with `faiss_hnsw` + flag OFF | `RuntimeNotAvailableError`; `sys.modules` does NOT contain `c6_tile_cache.faiss_descriptor_index` | +| AC-6 | Config load with invalid `descriptor_index_runtime` value | `ConfigSchemaError`; valid values listed in message | +| AC-7 | `TileId(zoom_level=22, ...)`, `Bbox(min_lat > max_lat, ...)` | `ValueError` with offending field named | +| AC-8 | `TilePixelHandle` write attempt through `memoryview` | `TypeError`; underlying file unchanged | +| AC-9 | Contract introspection vs. Protocol introspection for each of the three contracts | Shape parity test passes for all three | +| AC-10 | `VotingStatus` enum surface | `{PENDING, TRUSTED, REJECTED}` exactly | +| NFR-perf-factory | Microbench `build_*` × 1000 | p99 ≤ 50 ms each | +| NFR-reliability-error-family | All six subtypes inherit from `c6_tile_cache.errors.TileCacheError` | Verified via `issubclass` for each | + +## Constraints + +- The Protocols are stdlib `typing.Protocol`; no third-party Protocol library is introduced. +- DTOs are stdlib `@dataclass(frozen=True)`; no `pydantic` / `attrs`. +- `TilePixelHandle` is an ABC — concrete impls subclass with mmap-backed state; consumers MUST treat the bytes as read-only (enforced via `memoryview` `readonly=True`). +- The error hierarchy is the boundary of acceptable runtime errors. Implementations rewrap third-party exceptions; consumers catch the family. +- Lazy import of concrete impls is mandatory in the composition-root factory triple. The package `__init__.py` re-exports ONLY the Protocols, DTOs, enums, errors, and `TilePixelHandle` ABC — no concrete impl module is imported at package load time. +- The three contract files at `_docs/02_document/contracts/c6_tile_cache/` are the source of truth for shape; if the Protocol changes here without the contract updating, that is a Spec-Gap finding (High) per code-review skill Phase 2. +- This task introduces no new third-party dependencies — `typing.Protocol`, `dataclasses`, `enum`, `pathlib`, `numpy` (already pinned for the project) are all that's used. +- `numpy` arrays in the `DescriptorIndex` Protocol surface MUST be C-contiguous `float32`; the impl validates this at runtime (raises `IndexUnavailableError` on mismatch per the contract). This task only declares the type annotations; validation logic lives in the impl task. + +## Risks & Mitigation + +**Risk 1: Protocol drift between contract and code** +- *Risk*: Implementations diverge from the contract over time; consumers cannot tell which is canonical. +- *Mitigation*: AC-9 contract-introspection test runs in CI; any drift fails the test before merge. Each contract's `## Test Cases` section names this exact test. + +**Risk 2: Lazy-import gating bypassed by transitively-imported module** +- *Risk*: A consumer imports `c6_tile_cache` (the package) and the package's `__init__.py` eagerly imports the concrete impl, triggering FAISS load even when `BUILD_FAISS_INDEX=OFF`. +- *Mitigation*: The package `__init__.py` re-exports ONLY the Protocols, DTOs, enums, errors, and `TilePixelHandle` ABC — it does NOT import any concrete impl. AC-5 verifies via `sys.modules`. + +**Risk 3: Three Protocols cluttering the public surface** +- *Risk*: A consumer that needs only `TileStore` is forced to import the whole `c6_tile_cache` package; if the package eagerly evaluates the other two Protocols' DTOs, the import cost is wasteful. +- *Mitigation*: Stdlib dataclasses + `typing.Protocol` evaluation is essentially free (one class statement each); the AC-5 sys-modules test covers the only meaningful cost (concrete impls). No further mitigation needed. + +**Risk 4: TileMetadata field set drifts as new sources or quality fields are added** +- *Risk*: Adding a field to `TileMetadata` is a contract change rippling to every consumer. +- *Mitigation*: Versioning rules in `tile_store.md` § Versioning Rules require a minor bump for new optional fields with defaults; consumers tolerate. A required-field addition is a major bump and triggers the user-Choose-format coordination per `decompose/templates/api-contract.md`. + +**Risk 5: `IndexBuildError` outside the family confuses catchers** +- *Risk*: A consumer doing `except TileCacheError` MIGHT expect to catch a build-time corruption; instead the error escapes. +- *Mitigation*: Documented as Non-Goal in `descriptor_index.md` and as a separate test in AC-3. The build path lives in C10 pre-flight; in flight the only descriptor-index errors are read-side (`IndexUnavailableError`, which IS in the family). + +## Runtime Completeness + +- **Named capability**: typed Protocols + DTOs + error envelope + composition-root selection for `c6_tile_cache` (architecture / E-C6 / ADR-001 + ADR-009). +- **Production code that must exist**: real Protocol declarations, real frozen DTOs, real error hierarchy, real composition-root factory triple with lazy-import gating, real config-loader extension for the runtime enum, real constructor-time DTO validation (AC-7), real `TilePixelHandle` ABC. +- **Allowed external stubs**: tests MAY substitute fake impl classes that conform to the Protocols; production wiring uses the real impls from the postgres-filesystem-store and faiss-descriptor-index tasks. +- **Unacceptable substitutes**: ABCs instead of `typing.Protocol` (would force inheritance changes downstream), `pydantic.BaseModel` instead of `@dataclass(frozen=True)` (adds a runtime validation layer this task does not need), eager imports of concrete impls in `__init__.py` (would defeat `BUILD_FAISS_INDEX` gating), or a `descriptor_index_runtime: str` config field without an enum (would lose the load-time validation in AC-6). + +## Contract + +This task produces/implements the contracts at: +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` +- `_docs/02_document/contracts/c6_tile_cache/descriptor_index.md` + +Consumers MUST read those files — not this task spec — to discover the interfaces. diff --git a/_docs/02_tasks/todo/AZ-304_c6_postgres_schema.md b/_docs/02_tasks/todo/AZ-304_c6_postgres_schema.md new file mode 100644 index 0000000..d5acf48 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-304_c6_postgres_schema.md @@ -0,0 +1,271 @@ +# C6 Postgres Schema — Tiles Table + Sector Boundaries + Migration Script + +**Task**: AZ-304_c6_postgres_schema +**Name**: C6 Postgres Schema +**Description**: Author the canonical Postgres schema for `c6_tile_cache`: `tiles` (composite key + spatial btree + LRU + voting state + onboard-ingest provenance + per-row JPEG disk size + content-hash chain), `sector_boundaries` (operator-set classification rectangles), `tile_freshness_rules` (per-flight thresholds the freshness gate reads). Ship the initial Alembic migration `_alembic/0001_initial.sql` (forward + reversible down), the schema dataclass mappings used by `PostgresFilesystemStore`, and the per-flight bootstrap migration runner that the composition root invokes at startup. +**Complexity**: 2 points +**Dependencies**: AZ-303_c6_storage_interfaces, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c6_tile_cache (epic AZ-250 / E-C6) +**Tracker**: AZ-304 +**Epic**: AZ-250 (E-C6) + +### Document Dependencies + +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — defines the `TileMetadata` / `Bbox` / `SectorBoundary` shapes the schema must persist; defines the LRU + disk-budget contract. +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — defines the `content_sha256_hex` invariant the `tiles.content_sha256` column carries. +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — `config.tile_cache.postgres_dsn` field. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO log shape on migration apply / no-op. +- `_docs/02_document/data_model.md` — system-wide data model the schema must align with (`tiles`, `flight_id` provenance, `quality_metadata` JSONB shape). + +## Problem + +Without a frozen Postgres schema: + +- `PostgresFilesystemStore` has nothing to insert against — `insert_metadata` cannot land any row. +- `query_by_bbox` has no btree to index against — even a 1k-row corpus will table-scan, blowing the C6-PT-01 latency budget. +- The composite-key uniqueness invariant from `tile_metadata_store.md` § I-1 is unenforced — duplicate-key inserts would silently corrupt the cache. +- `lru_candidates` cannot order by `accessed_at` without a column; `total_disk_bytes` cannot SUM without a `disk_bytes` column. +- The freshness gate (separate task) cannot read sector boundaries without a `sector_boundaries` table. +- The C11 `TileUploader` cannot drive its loop off `pending_uploads()` without an `uploaded_at` column. +- Re-running the companion against a stale DB has no migration runner — the operator would have to manually rebuild. + +This task delivers the on-disk shape that every other C6 task and every consumer depends on. It writes no Python logic beyond the Alembic env + the schema-validation helper — concrete `PostgresFilesystemStore` is a separate task. + +## Outcome + +- A migration script at `src/gps_denied_onboard/components/c6_tile_cache/_alembic/versions/0001_initial.py` (Alembic Python migration; the project's existing Alembic env is bootstrap-task-owned per AZ-263). Forward migration `upgrade()` creates three tables and four indexes; reverse `downgrade()` drops them in reverse order. The migration is idempotent against a clean DB and is rejected (Alembic's standard behaviour) if applied to a DB at a later revision. +- A migration runner `apply_migrations(config) -> MigrationResult` at `src/gps_denied_onboard/components/c6_tile_cache/migrations.py` invoked by the composition root at startup AFTER config load and BEFORE `PostgresFilesystemStore` construction. Returns `MigrationResult(applied: list[str], current_revision: str, no_op: bool)`. Logs INFO on every applied revision; logs INFO with `no_op=True` when the DB is already at head. +- Three tables exist after `upgrade()`: + 1. `tiles` — see Schema below. + 2. `sector_boundaries` — see Schema below. + 3. `tile_freshness_rules` — see Schema below. +- Four indexes exist after `upgrade()`: + - `tiles_pkey` — `PRIMARY KEY (zoom_level, lat, lon, source)` (composite, enforces I-1 from the metadata-store contract). + - `idx_tiles_spatial` — btree over `(zoom_level, lat, lon)` for `query_by_bbox`. + - `idx_tiles_pending_upload` — partial btree over `(uploaded_at) WHERE source = 'onboard_ingest' AND uploaded_at IS NULL` for `pending_uploads`. + - `idx_tiles_lru` — btree over `accessed_at` for `lru_candidates`. +- `quality_metadata` is JSONB (NOT a separate table) — matches description.md § 2 and `data_model.md`. The JSONB shape is validated at the application layer (the `TileQualityMetadata` dataclass). +- A schema fixture `tests/fixtures/c6_postgres_schema_v1.sql` is the human-readable expected DDL used by the schema-shape test (AC-3). + +## Scope + +### Included + +- The Alembic migration `0001_initial.py` covering three tables + four indexes. +- A `MigrationResult` dataclass `@dataclass(frozen=True)`. +- The `apply_migrations(config)` runner using the project-pinned Alembic version (already in the bootstrap dependency set per AZ-263). +- The schema-shape test (`tests/unit/c6_tile_cache/test_postgres_schema.py`) that introspects a freshly-migrated test DB and asserts the documented column types, nullable flags, default values, primary keys, and indexes (Postgres `information_schema` queries; no FAISS / no Python logic). +- The `_alembic/env.py` bootstrap (registers the migration directory with the existing project Alembic env; no NEW alembic config). +- The schema fixture `tests/fixtures/c6_postgres_schema_v1.sql` — copy-pastable DDL the test diffs against. +- Postgres connection helper `c6_tile_cache.connection.psycopg_pool(config) -> psycopg_pool.ConnectionPool` (used by both this task's runner and the future `PostgresFilesystemStore`); the helper is a thin wrapper over `psycopg_pool.ConnectionPool` that takes the DSN from config. + +### Excluded + +- Concrete `PostgresFilesystemStore` (insert / query / mark methods) — separate task (`c6_postgres_filesystem_store`). +- The freshness gate logic that reads `sector_boundaries` / `tile_freshness_rules` — separate task (`c6_freshness_gate`). +- The LRU eviction policy that reads `accessed_at` — separate task (`c6_cache_budget_eviction`). +- FAISS index file format — separate task (`c6_faiss_descriptor_index`). +- Sector-boundary CRUD (operator-side INSERT/UPDATE) — owned by C12. +- Per-flight DB lifecycle (drop-and-rebuild between flights, freshness-rules reload) — owned by the composition root's startup orchestration; this task only applies migrations idempotently. +- A second migration revision — every future schema change is a NEW migration file; this task only ships `0001_initial.py`. +- Postgres tuning (work_mem, shared_buffers) — handled by the deployment / Dockerfile (E-DEPLOY); the schema is portable across reasonable Postgres 16 configurations. +- Postgres-version migration (16 → 17) — out of scope this cycle; the schema MUST work on 16.x. + +## Schema + +### Table: `tiles` + +| Column | Type | Nullable | Default | Notes | +|--------|------|----------|---------|-------| +| `zoom_level` | `INTEGER` | NO | — | composite PK | +| `lat` | `DOUBLE PRECISION` | NO | — | composite PK; centre latitude | +| `lon` | `DOUBLE PRECISION` | NO | — | composite PK; centre longitude | +| `source` | `TEXT` | NO | — | composite PK; CHECK `source IN ('googlemaps', 'onboard_ingest')` | +| `tile_size_meters` | `DOUBLE PRECISION` | NO | — | | +| `tile_size_pixels` | `INTEGER` | NO | — | | +| `capture_timestamp` | `TIMESTAMPTZ` | NO | — | UTC | +| `content_sha256` | `TEXT` | NO | — | 64 hex chars; matches the JPEG body hash from AZ-280's atomic-write/sidecar pattern | +| `freshness_label` | `TEXT` | NO | `'fresh'` | CHECK `freshness_label IN ('fresh', 'stale_active_conflict', 'stale_rear', 'downgraded')` | +| `flight_id` | `UUID` | YES | NULL | non-NULL when `source = 'onboard_ingest'` (CHECK enforces) | +| `companion_id` | `TEXT` | YES | NULL | non-NULL when `source = 'onboard_ingest'` (CHECK enforces) | +| `quality_metadata` | `JSONB` | YES | NULL | non-NULL when `source = 'onboard_ingest'` (CHECK enforces); shape validated app-side | +| `voting_status` | `TEXT` | NO | `'trusted'` for googlemaps; `'pending'` for onboard_ingest | CHECK `voting_status IN ('pending', 'trusted', 'rejected')`; default per-source via trigger | +| `disk_bytes` | `BIGINT` | NO | — | byte size of the on-disk JPEG; populated by `write_tile` | +| `accessed_at` | `TIMESTAMPTZ` | NO | `now()` | LRU clock — updated by `record_lru_access` | +| `uploaded_at` | `TIMESTAMPTZ` | YES | NULL | set by `mark_uploaded`; remains NULL until C11 `TileUploader` confirms post-flight upload | +| `created_at` | `TIMESTAMPTZ` | NO | `now()` | row-create timestamp; immutable | + +Constraints: + +- `PRIMARY KEY (zoom_level, lat, lon, source)` +- `CHECK (zoom_level BETWEEN 0 AND 21)` +- `CHECK (source IN ('googlemaps', 'onboard_ingest'))` +- `CHECK (freshness_label IN ('fresh', 'stale_active_conflict', 'stale_rear', 'downgraded'))` +- `CHECK (voting_status IN ('pending', 'trusted', 'rejected'))` +- `CHECK (disk_bytes >= 0)` +- `CHECK (length(content_sha256) = 64)` +- `CHECK ((source = 'onboard_ingest' AND flight_id IS NOT NULL AND companion_id IS NOT NULL AND quality_metadata IS NOT NULL) OR (source = 'googlemaps'))` + +### Table: `sector_boundaries` + +| Column | Type | Nullable | Default | Notes | +|--------|------|----------|---------|-------| +| `boundary_id` | `UUID` | NO | `gen_random_uuid()` | PK | +| `min_lat` | `DOUBLE PRECISION` | NO | — | | +| `min_lon` | `DOUBLE PRECISION` | NO | — | | +| `max_lat` | `DOUBLE PRECISION` | NO | — | | +| `max_lon` | `DOUBLE PRECISION` | NO | — | | +| `classification` | `TEXT` | NO | — | CHECK `classification IN ('active_conflict', 'stable_rear')` | +| `set_by_operator` | `TEXT` | NO | — | operator handle for audit | +| `set_at` | `TIMESTAMPTZ` | NO | `now()` | | + +Constraints: + +- `PRIMARY KEY (boundary_id)` +- `CHECK (min_lat <= max_lat AND min_lon <= max_lon)` +- `CHECK (classification IN ('active_conflict', 'stable_rear'))` + +NO spatial index this cycle — the row count is small (≤ a few hundred per flight), and the freshness gate reads them all into memory at flight start. + +### Table: `tile_freshness_rules` + +| Column | Type | Nullable | Default | Notes | +|--------|------|----------|---------|-------| +| `classification` | `TEXT` | NO | — | PK; matches `sector_boundaries.classification` | +| `max_age_seconds` | `BIGINT` | NO | — | seconds; per `STABLE_REAR` is the downgrade threshold; per `ACTIVE_CONFLICT` is the rejection threshold | +| `action` | `TEXT` | NO | — | CHECK `action IN ('reject', 'downgrade')` | +| `set_at` | `TIMESTAMPTZ` | NO | `now()` | | + +Constraints: + +- `PRIMARY KEY (classification)` +- `CHECK (action IN ('reject', 'downgrade'))` +- `CHECK (max_age_seconds > 0)` + +Default rows seeded by the migration: +- `('active_conflict', 6 * 30 * 86400, 'reject')` — 6 months, AC-8.2. +- `('stable_rear', 12 * 30 * 86400, 'downgrade')` — 12 months, AC-8.2. + +## Acceptance Criteria + +**AC-1: Migration is idempotent against a clean DB** +Given a fresh Postgres 16 database with no `alembic_version` row +When `apply_migrations(config)` runs +Then all three tables and all four indexes exist; the `alembic_version` row carries `0001_initial`; `MigrationResult.applied == ['0001_initial']`; `MigrationResult.no_op == False` + +**AC-2: Migration is no-op when at head** +Given a Postgres DB already at `0001_initial` +When `apply_migrations(config)` runs again +Then `MigrationResult.applied == []`; `MigrationResult.no_op == True`; no DDL is emitted (verifiable via `pg_stat_user_tables` row counts unchanged) + +**AC-3: Schema shape matches the documented DDL** +Given a freshly-migrated DB +When the schema-shape test introspects `information_schema.columns` and `pg_indexes` +Then every column matches the `Schema` section above (name, data type, nullability, default expression); every index matches (name, columns, partial-index predicate where applicable); every CHECK constraint exists with the documented expression + +**AC-4: Composite primary key enforces uniqueness** +Given an empty `tiles` table +When two INSERTs with the same `(zoom_level, lat, lon, source)` are attempted with different `content_sha256` values +Then the second INSERT raises a Postgres unique-constraint violation; the first row is unaffected; the application layer translates this to `TileMetadataError` (in the `PostgresFilesystemStore` task — this task surfaces only the raw Postgres error) + +**AC-5: CHECK constraint enforces source-aware mandatory fields** +Given an `onboard_ingest` row with `flight_id = NULL` +When the INSERT is attempted +Then the row is rejected by the CHECK constraint at the DB layer + +**AC-6: Down migration reverses cleanly** +Given a DB at `0001_initial` +When `alembic downgrade -1` runs (operator-only command; not exercised by the runtime) +Then all three tables and all four indexes are dropped; the DB returns to the empty pre-migration state; subsequent `upgrade` re-applies cleanly + +**AC-7: Default freshness rules are seeded** +Given a freshly-migrated DB +When the schema-shape test queries `tile_freshness_rules` +Then exactly two rows exist: `('active_conflict', 15552000, 'reject')` and `('stable_rear', 31104000, 'downgrade')` + +**AC-8: Migration runner logs INFO on apply and no-op** +Given a clean DB +When `apply_migrations` runs and then runs again +Then the first call emits an INFO log with `kind="c6.migration.applied"` carrying `revisions=['0001_initial']`; the second call emits an INFO log with `kind="c6.migration.no_op"` + +**AC-9: Quality metadata JSONB is validated app-side, NOT DB-side** +Given an `onboard_ingest` row with `quality_metadata = '{}'::jsonb` (empty JSONB but non-NULL) +When the INSERT runs at the DB layer +Then the INSERT succeeds (DB CHECK does not validate the JSONB shape); the application-layer validation (in `PostgresFilesystemStore`'s `insert_metadata`) is what would reject it. This task documents the boundary: the schema enforces presence/non-NULL only; shape is the impl task's responsibility. + +## Non-Functional Requirements + +**Performance** +- Migration apply ≤ 5 s on an empty Postgres 16 database. Schema is small (3 tables, 4 indexes) and the runner uses a single connection. +- `apply_migrations` no-op call (DB at head) ≤ 100 ms. +- Idempotency: re-running `apply_migrations` is bound only by the head-detection query (single SELECT against `alembic_version`). + +**Compatibility** +- Postgres 16.x (matches `satellite-provider`'s pin per description.md § 5). +- `psycopg_pool` 3.x — already pinned by AZ-263 bootstrap. +- Alembic 1.13+ — already pinned by AZ-263 bootstrap. + +**Reliability** +- The migration is wrapped in a single transaction (Alembic's default for non-DDL-batched migrations on Postgres). A crash mid-migration leaves the DB at the prior revision. +- The runner catches `psycopg.errors.SerializationFailure` and retries once with exponential backoff; after the second failure, raises a `MigrationError` (NEW error type defined here, NOT in `TileCacheError` — migrations are bootstrap-time, not runtime). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `apply_migrations` against fresh testcontainer DB | Three tables + four indexes exist; alembic_version='0001_initial'; result.applied=['0001_initial'] | +| AC-2 | `apply_migrations` against already-migrated DB | result.applied=[]; result.no_op=True; no DDL emitted | +| AC-3 | Introspect information_schema after migration; diff against `tests/fixtures/c6_postgres_schema_v1.sql` | Zero diff; every column / index / CHECK matches | +| AC-4 | Two INSERTs with same `(zoom, lat, lon, source)` | Second INSERT raises `psycopg.errors.UniqueViolation` | +| AC-5 | INSERT `onboard_ingest` row with `flight_id=NULL` | Raises `psycopg.errors.CheckViolation` | +| AC-6 | `alembic downgrade -1` then `upgrade` | DB returns to empty state then re-applies cleanly | +| AC-7 | SELECT `tile_freshness_rules` after migration | Exactly 2 rows with documented values | +| AC-8 | Capture log records during migration apply + no-op | Two INFO records with `kind="c6.migration.applied"` and `kind="c6.migration.no_op"` | +| AC-9 | INSERT row with `quality_metadata='{}'::jsonb` | DB-layer accepts; documented as app-side responsibility | +| NFR-perf-apply | Migration apply on empty 16.x | Wall ≤ 5 s | +| NFR-perf-noop | `apply_migrations` no-op timing | Wall ≤ 100 ms | +| NFR-reliability-retry | Inject `SerializationFailure` once, then succeed | Migration succeeds on retry; on second failure raises `MigrationError` | + +## Constraints + +- Postgres 16.x ONLY this cycle; no SQLite / no MySQL fallback. +- Alembic + `psycopg_pool` are already pinned by AZ-263; this task does NOT introduce new third-party dependencies. +- The migration MUST be reversible (`downgrade` drops cleanly) — operator post-flight tooling depends on it for "drop-and-rebuild" flows. +- The schema MUST mirror `data_model.md` exactly (especially the `quality_metadata` JSONB shape and the `voting_status` enum). Any deviation requires a `data_model.md` update first; this task does NOT silently extend the data model. +- The `quality_metadata` JSONB shape is NOT validated at the DB layer (no domain types, no CHECK on JSON structure). That validation is `PostgresFilesystemStore.insert_metadata` (separate task) — documented in AC-9. +- `gen_random_uuid()` requires the `pgcrypto` extension; the migration's `upgrade()` runs `CREATE EXTENSION IF NOT EXISTS pgcrypto` as its first statement. +- `MigrationError` is NOT a member of the `TileCacheError` family — migrations run before any `c6_tile_cache.errors` consumer is constructed. +- The schema-fixture file `tests/fixtures/c6_postgres_schema_v1.sql` is the diff target; updating it without a migration revision is a Spec-Gap finding (High) at code-review time. + +## Risks & Mitigation + +**Risk 1: `quality_metadata` JSONB silently malformed** +- *Risk*: An impl task writes a `quality_metadata` JSONB that doesn't match `TileQualityMetadata` shape; the DB accepts it; downstream consumers crash on read. +- *Mitigation*: AC-9 documents the boundary — DB only enforces presence; shape is `insert_metadata`'s job. The future `c6_postgres_filesystem_store` task's tests cover round-trip of every documented shape. + +**Risk 2: Alembic version drift between dev and CI** +- *Risk*: Developer pins different Alembic minor and migrations apply differently in CI. +- *Mitigation*: AZ-263 bootstrap pins Alembic to a single minor; this task adds no version constraints of its own. + +**Risk 3: Down-migration data loss is irreversible** +- *Risk*: Operator runs `alembic downgrade -1` on a DB with live data; tiles are lost. +- *Mitigation*: Down-migration is documented as operator-only and destructive; the runner does NOT auto-downgrade. The composition root's startup runner only ever calls `upgrade head`. + +**Risk 4: Spatial-index strategy is wrong for high-zoom queries** +- *Risk*: `(zoom_level, lat, lon)` btree may not be optimal for a tight bbox at zoom 21. +- *Mitigation*: AC-3 fixes the index shape; if `query_by_bbox` benchmarks fail at takeoff load, a follow-up migration adds a GIST index. Not blocking this cycle (description.md notes the row count is bounded; btree is sufficient). + +**Risk 5: `pgcrypto` extension not available on a deployment** +- *Risk*: A Tier-1 Postgres deployment ships without `pgcrypto`; `gen_random_uuid()` fails. +- *Mitigation*: The migration's first statement is `CREATE EXTENSION IF NOT EXISTS pgcrypto`; if the deployment lacks the extension package, `apply_migrations` raises `MigrationError` early — surfaced to the operator at composition. + +## Runtime Completeness + +- **Named capability**: Postgres 16 spatial metadata index + per-flight schema bootstrap + LRU/upload bookkeeping columns + sector-boundary classification table + per-classification freshness rules table (description.md / data_model.md / AC-NEW-3 / AC-NEW-6 / RESTRICT-SAT-2). +- **Production code that must exist**: real Alembic migration `0001_initial.py`, real `apply_migrations` runner, real schema-fixture diff test, real `psycopg_pool` connection helper. +- **Allowed external stubs**: tests use `testcontainers`-managed Postgres 16 instances (already in the project's test infra per AZ-263); production wiring uses the operator's deployed Postgres. +- **Unacceptable substitutes**: SQLite "for testing only" — `production` and `test` environments MUST both be Postgres 16 (test environment as close to production as possible per coderule.mdc); raw SQL DDL applied without Alembic (would defeat the version-tracking the runner depends on); a `quality_metadata` validation at the DB layer (would lock the schema to the JSONB shape — the application-side validation is the single source of truth). + +## Contract + +This task does NOT produce a new contract file — it implements the `tile_metadata_store.md` contract's persistence surface. The schema-fixture file `tests/fixtures/c6_postgres_schema_v1.sql` is the diff target referenced in `tile_metadata_store.md` § Test Cases (`schema-shape-fixture-diff`) — but the contract document of record stays the Protocol contract. diff --git a/_docs/02_tasks/todo/AZ-305_c6_postgres_filesystem_store.md b/_docs/02_tasks/todo/AZ-305_c6_postgres_filesystem_store.md new file mode 100644 index 0000000..8e302d8 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-305_c6_postgres_filesystem_store.md @@ -0,0 +1,254 @@ +# C6 PostgresFilesystemStore — TileStore + TileMetadataStore Production Impl + +**Task**: AZ-305_c6_postgres_filesystem_store +**Name**: C6 PostgresFilesystemStore +**Description**: Implement `PostgresFilesystemStore`, the single production class that satisfies BOTH the `TileStore` Protocol (filesystem-backed JPEG I/O byte-identical to `satellite-provider`) AND the `TileMetadataStore` Protocol (Postgres-backed spatial / LRU / voting state). Owns the full insert path (atomic-write + sha256 sidecar via AZ-280, content-hash gate, single-transaction row insert), the read path (mmap-backed `TilePixelHandle`, btree-indexed bbox query, LRU access stamp), and the bookkeeping path (`mark_uploaded`, `update_voting_status`, `lru_candidates`, `total_disk_bytes`). The freshness gate's `FreshnessRejectionError` raise point is wired here but the rule-evaluation logic lives in the freshness-gate task; the LRU eviction policy lives in the cache-budget-eviction task — this store exposes the primitives both consume. +**Complexity**: 5 points +**Dependencies**: AZ-303_c6_storage_interfaces, AZ-304_c6_postgres_schema, AZ-280_sha256_sidecar, AZ-279_wgs_converter, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf +**Component**: c6_tile_cache (epic AZ-250 / E-C6) +**Tracker**: AZ-305 +**Epic**: AZ-250 (E-C6) + +### Document Dependencies + +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — Protocol this task implements; produced by AZ-303. +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — Protocol this task implements; produced by AZ-303. +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — atomic-write + sidecar pattern used by `write_tile`; produced by AZ-280. +- `_docs/02_document/contracts/shared_helpers/wgs_converter.md` — `(lat, lon, zoom_level)` → `(x, y)` Web-Mercator tile-coordinate conversion the byte-identity invariant depends on; produced by AZ-279. +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — `kind="c6.write"`, `kind="c6.write_failed"`, `kind="c6.evicted"` records the store emits. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO/WARN/ERROR log shapes. + +## Problem + +Without a real `PostgresFilesystemStore`: + +- The C2.5 / C3 read path has no production impl — `read_tile_pixels` cannot return any bytes; F3 hot path stalls before C2.5. +- The F4 mid-flight tile-gen write path has nothing to land tiles in — `write_tile` is a hole; AC-8.4 fails. +- The C11 `TileDownloader` has no insert target — F1 pre-flight provisioning cannot persist Google Maps tiles; takeoff blocks. +- The C11 `TileUploader` has no `pending_uploads()` source — F10 post-landing upload is silent. +- The C10 manifest builder has no `query_by_bbox` — F1 cannot enumerate trusted tiles. +- The byte-identity invariant from `tile_store.md` § I-1 is unenforced — F10 upload would be a re-encode rather than a copy, breaking the upload contract with `satellite-provider`. + +This task is the production-default impl. The Protocols, contracts, schema, and helpers are now ready; this is the integration point. + +## Outcome + +- A `PostgresFilesystemStore` class at `src/gps_denied_onboard/components/c6_tile_cache/postgres_filesystem_store.py` that conforms to BOTH the `TileStore` and `TileMetadataStore` Protocols (verified by AZ-303's `runtime_checkable` test). +- Constructor signature: `__init__(self, *, root_dir: Path, postgres_pool: psycopg_pool.ConnectionPool, sha256_sidecar: Sha256SidecarHelper, wgs_converter: WgsConverter, fdr_client: FdrClient, logger: Logger)`. The composition root wires the dependencies; this class owns no globals. +- `read_tile_pixels(tile_id) -> TilePixelHandle`: computes the path via the injected `wgs_converter`, opens the JPEG via `mmap.mmap(fileno, 0, prot=mmap.PROT_READ)`, returns a `MmapTilePixelHandle` (subclass of the `TilePixelHandle` ABC from AZ-303). Raises `TileNotFoundError` when both row and file are absent; `TileMetadataError` when row exists but file is missing (or vice-versa); `TileFsError` on syscall failure. +- `write_tile(tile_blob, metadata) -> None`: + 1. Validates `sha256(tile_blob) == metadata.content_sha256_hex`; mismatch → `ContentHashMismatchError` (no I/O performed). + 2. Computes the canonical path via `wgs_converter`. + 3. (Freshness gate hook — implemented by the freshness-gate task; this task wires the call site so the gate raises `FreshnessRejectionError` BEFORE filesystem write; if the gate is OFF in config, the check is skipped.) + 4. Writes the JPEG + `.sha256` sidecar via the injected `sha256_sidecar.atomic_write_with_sidecar(path, tile_blob)`. + 5. Inserts the metadata row in a single Postgres transaction. On unique-violation → roll back, delete the just-written file/sidecar (compensating action), raise `TileMetadataError`. + 6. Emits an FDR record `kind="c6.write"` with `tile_id`, `source`, `disk_bytes`, `content_sha256`. +- `tile_exists(tile_id) -> bool`: SELECT 1 from `tiles` for the composite key; returns True iff a row exists. Does NOT touch the filesystem (the row is the source of truth; consistency is checked on `read_tile_pixels`). +- `delete_tile(tile_id) -> bool`: deletes the row, the JPEG file, and the sidecar — in that order — inside a single transaction with the filesystem ops. Returns `False` if the row was already absent (no-error path per `tile_store.md` § I-6). +- `query_by_bbox(bbox, zoom, *, voting_filter, source_filter) -> list[TileMetadata]`: parameterised SELECT against the composite btree; respects the optional filters. Returns rows ordered by `(lat, lon)` for deterministic test output. +- `insert_metadata(metadata) -> None`: SELECT-side equivalent of `write_tile`'s row insert (used by the rebuild path that has the file already on disk — e.g., F1 pre-flight when `TileDownloader` has placed JPEGs and only the row needs to land). Validates the file exists at the canonical path and that its sha256 matches `metadata.content_sha256_hex`; raises `ContentHashMismatchError` on mismatch, `TileFsError` on absent file. Does NOT re-execute the freshness gate (callers that bypass `write_tile` must run the gate themselves). +- `update_voting_status(tile_id, status) -> None`: UPDATE with the I-8 forward-only transition validation in app-layer (PENDING→TRUSTED, PENDING→REJECTED, TRUSTED→REJECTED allowed; backwards raises `TileMetadataError`). +- `mark_uploaded(tile_id, uploaded_at) -> None`: UPDATE `uploaded_at`; raises `TileNotFoundError` if the row is absent. +- `pending_uploads() -> list[TileMetadata]`: SELECT against the partial index `idx_tiles_pending_upload`. +- `record_lru_access(tile_id, accessed_at) -> None`: UPDATE `accessed_at = GREATEST(accessed_at, $1)` (monotonic per `tile_metadata_store.md` § I-4). +- `lru_candidates(*, max_count) -> list[TileMetadata]`: SELECT ORDER BY `accessed_at ASC` LIMIT `max_count`. +- `total_disk_bytes() -> int`: SELECT COALESCE(SUM(disk_bytes), 0) FROM tiles WHERE voting_status != 'rejected'. +- `get_by_id(tile_id) -> Optional[TileMetadata]`: returns `None` on absence (NOT `TileNotFoundError`). +- All third-party exceptions (psycopg errors, OS errors) are caught and rewrapped into the `TileCacheError` family. +- ERROR log on every `TileMetadataError` / `ContentHashMismatchError`; WARN log on every `write_tile` retry; INFO log on store construction with row count + disk bytes; DEBUG log on every read/write (off by default per `tile_store.md` perf table). + +## Scope + +### Included + +- `PostgresFilesystemStore` class implementation conforming to both Protocols. +- `MmapTilePixelHandle` subclass of the `TilePixelHandle` ABC. +- The compensating-delete on insert failure: if the filesystem write succeeded but the row insert failed, the file + sidecar are deleted before the exception propagates; this preserves I-2 (atomic write + sidecar invariant means atomic file+row pair from the consumer's perspective). +- Per-call query parameter binding via psycopg's parameterised query API (no string interpolation; SQL injection is not a vector but the parameterised path is also faster). +- Connection pool sizing per `config.tile_cache.postgres_pool_size` (default 4 — bounded so a runaway query loop cannot DoS the pool). +- Freshness-gate call site (a method `_evaluate_freshness(metadata) -> Optional[FreshnessLabel]` that the freshness-gate task substitutes; this task ships the trivial pass-through `return metadata.freshness_label` so existing tests pass; the gate task replaces the body). +- Transaction boundaries: every multi-statement operation is in a single transaction; SAVEPOINT only for the `read_tile_pixels` consistency check (read-side does not need its own transaction since the schema enforces the one-row invariant). +- Filesystem path computation via the injected `wgs_converter` ONLY — NEVER hardcode the Web-Mercator math here; the byte-identity invariant tracks `wgs_converter`'s output directly. +- Idempotent constructor: a re-constructed store against an existing DB + filesystem reads the existing state; does NOT truncate or rebuild. +- A standalone CLI `python -m c6_tile_cache.tools dump ` for operator post-flight inspection (no formal contract — just calls `read_tile_pixels` + writes JPEG to stdout). +- Connection-failure handling: if the pool is unreachable on construction, raises `TileMetadataError` (NOT a separate `ConnectionError` — keeps the family flat). + +### Excluded + +- The freshness gate's rule-evaluation logic — separate task (`c6_freshness_gate`); this task ships only the pass-through hook. +- The 10 GB LRU eviction loop — separate task (`c6_cache_budget_eviction`); this task exposes `lru_candidates` / `delete_tile` / `total_disk_bytes` as the primitives the eviction policy consumes. +- The FAISS descriptor index — separate task (`c6_faiss_descriptor_index`); this task does NOT implement the `DescriptorIndex` Protocol. +- The orthorectifier (used by F4 to project nav-camera frames into tile-space JPEG bytes) — owned by C5; this task receives the bytes via `write_tile`. +- The C11 `TileDownloader` / `TileUploader` HTTP clients — separate epic. +- C10's manifest builder — separate epic. +- Postgres tuning, server config, replica setup — handled by E-DEPLOY. +- Multi-process producer/consumer — single-process per flight per `tile_store.md` Non-Goals. +- Tile orthorectification math — NOT here. + +## Acceptance Criteria + +**AC-1: Round-trip write + read is byte-identical** +Given a JPEG body `B` of N bytes with `content_sha256_hex = sha256(B)`, and `metadata` with a known `(zoom_level, lat, lon)` +When `write_tile(B, metadata)` returns and a subsequent `read_tile_pixels(metadata.tile_id)` is called +Then `read_tile_pixels(...).__enter__()` exposes a `memoryview` whose bytes equal `B`; the filesystem path equals the path `wgs_converter` computes for the same coordinate; the `.sha256` sidecar file's content equals `content_sha256_hex` followed by a newline (per AZ-280 contract) + +**AC-2: Content-hash mismatch is rejected before any I/O** +Given `metadata.content_sha256_hex` deliberately set to a wrong value +When `write_tile(B, metadata)` is called +Then `ContentHashMismatchError` is raised; no JPEG file is written; no sidecar is written; no Postgres row is inserted; an ERROR log records the rejection with `tile_id` and the expected vs actual hashes + +**AC-3: Composite-key duplicate raises TileMetadataError + compensating delete** +Given a `tiles` row already exists for `(zoom=18, lat, lon, source='googlemaps')` +When a second `write_tile` is attempted with the same key but different `content_sha256_hex` and different `B` +Then `TileMetadataError` is raised; the second JPEG file is NOT left on disk (compensating delete ran); the original row + file are unchanged; an ERROR log records the duplicate + +**AC-4: Row-without-file consistency fault is fail-fast** +Given a `tiles` row for `(zoom, lat, lon, source)` whose JPEG file has been deleted out-of-band +When `read_tile_pixels(tile_id)` is called +Then `TileMetadataError` (NOT `TileNotFoundError`) is raised with a message identifying both the row and the missing path; the operator's signal that the cache is in a degraded state + +**AC-5: query_by_bbox returns deterministic results** +Given 100 inserted rows uniformly distributed across a 1°×1° bbox at zoom=18 +When `query_by_bbox(bbox=that_1deg, zoom=18, voting_filter=None, source_filter=None)` is called +Then exactly 100 rows are returned, ordered by `(lat ASC, lon ASC)`; the EXPLAIN plan uses `idx_tiles_spatial` (verifiable via `EXPLAIN (BUFFERS, FORMAT JSON)` parsing in the test) + +**AC-6: query_by_bbox honours filters** +Given the same 100 rows with mixed `voting_status` (50 PENDING, 50 TRUSTED) +When `query_by_bbox(..., voting_filter=VotingStatus.TRUSTED)` is called +Then exactly the 50 TRUSTED rows are returned; PENDING rows are excluded; an analogous test holds for `source_filter` + +**AC-7: update_voting_status enforces the forward-transitions table** +Given a row with `voting_status = TRUSTED` +When `update_voting_status(tile_id, VotingStatus.PENDING)` is called +Then `TileMetadataError` is raised with a message naming the disallowed transition; the row is unchanged +And: TRUSTED → REJECTED is allowed (covers cache-poisoning recall); PENDING → TRUSTED and PENDING → REJECTED are allowed + +**AC-8: mark_uploaded sets uploaded_at and pending_uploads excludes it** +Given an `onboard_ingest` row with `uploaded_at = NULL` +When `mark_uploaded(tile_id, datetime.utcnow())` is called and then `pending_uploads()` is called +Then `pending_uploads()` does NOT contain that tile_id; the row's `uploaded_at` matches the supplied timestamp within 1 ms + +**AC-9: record_lru_access is monotonic** +Given a row with `accessed_at = T1` +When `record_lru_access(tile_id, T0 < T1)` is called +Then the row's `accessed_at` is unchanged (`T1`); a subsequent `record_lru_access(tile_id, T2 > T1)` updates to `T2` + +**AC-10: total_disk_bytes excludes rejected rows** +Given 5 rows with `disk_bytes = 100, 200, 300, 400, 500` and `voting_status = (TRUSTED, TRUSTED, TRUSTED, TRUSTED, REJECTED)` +When `total_disk_bytes()` is called +Then the result is `1000` (the rejected row's 500 bytes are excluded per I-5) + +**AC-11: delete_tile is idempotent and removes filesystem artefacts** +Given a row + JPEG + sidecar at canonical path +When `delete_tile(tile_id)` is called once and then again +Then the first call returns `True` and removes row + file + sidecar; the second call returns `False` (no exception); subsequent `tile_exists` returns `False` + +**AC-12: third-party exceptions are rewrapped** +Given a Postgres pool that is intentionally killed mid-call (testcontainer stop) +When any Protocol method is called +Then a `TileMetadataError` is raised (NOT a raw `psycopg.OperationalError`); the original error message is preserved in the rewrapped exception's `__cause__` + +**AC-13: read_tile_pixels p95 budget** +Given a warmed-up store (page cache hot for the queried tile) +When `read_tile_pixels` is called 1000 times +Then `__enter__()` returns within 0.5 ms p95 (failure threshold 5 ms); cold first read is within 50 ms (failure threshold 200 ms) — matches C6-PT-01 + +**AC-14: write_tile sustains 5 Hz peak F4 burst without dropping** +Given an idle store and `wgs_converter`-mocked path +When 100 `write_tile` calls are issued at 5 Hz from a single producer +Then all 100 land within 30 s; the metadata-store's `total_disk_bytes` reports the sum of all 100 `disk_bytes`; no INSERT failed (matches C6-IT-04 / AC-NEW-3) + +**AC-15: FDR record on every write** +Given a successful `write_tile` +When the FDR record is captured +Then a single `kind="c6.write"` record is emitted with `producer_id="c6_tile_cache.store"`, payload `{tile_id, source, disk_bytes, content_sha256}` matching `fdr_record_schema.md`; on `write_tile` failure, `kind="c6.write_failed"` is emitted with the failure reason + +## Non-Functional Requirements + +**Performance** +- `read_tile_pixels` p95 ≤ 0.5 ms warm; ≤ 50 ms cold (AC-13 / C6-PT-01). +- `write_tile` sustains 5 Hz burst (AC-14 / AC-NEW-3). +- `query_by_bbox` ≤ 50 ms typical for a single sector at zoom=18 (≤ a few hundred matched rows). +- `total_disk_bytes` ≤ 100 ms even at 100k rows (single SUM). +- Pool size default 4; pool checkout p99 ≤ 5 ms under nominal load. + +**Compatibility** +- Postgres 16.x. +- `psycopg_pool` 3.x and `psycopg` 3.x — already pinned. +- `mmap` from stdlib — no Python C-extension shenanigans. + +**Reliability** +- All errors rewrap third-party exceptions into `TileCacheError` family. +- Insert is transactional with a compensating filesystem delete on row-side failure (preserves the file+row pair invariant). +- The store NEVER blocks the F3 hot path on its own internal locks — `read_tile_pixels` acquires no locks beyond the OS page cache. +- The pool is bounded; a connection leak is detected at process exit with a WARN log and the leak count. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Round-trip write + read of a known JPEG | mmap bytes equal source; sidecar content matches; path matches wgs_converter | +| AC-2 | write_tile with bad content_sha256_hex | ContentHashMismatchError; no fs/db effects; ERROR log | +| AC-3 | Duplicate composite-key write | TileMetadataError; compensating delete leaves original intact | +| AC-4 | Row exists; file deleted out-of-band; read | TileMetadataError (not TileNotFoundError) | +| AC-5 | 100 rows in 1° bbox; query | 100 results in lat,lon ASC order; EXPLAIN uses idx_tiles_spatial | +| AC-6 | query with voting_filter / source_filter | Only matching rows returned | +| AC-7 | TRUSTED → PENDING attempt | TileMetadataError; TRUSTED → REJECTED allowed | +| AC-8 | mark_uploaded then pending_uploads | Tile excluded; uploaded_at matches | +| AC-9 | record_lru_access with backwards timestamp | accessed_at unchanged; forward-only | +| AC-10 | total_disk_bytes with 1 REJECTED among 5 rows | Sum excludes REJECTED | +| AC-11 | delete_tile twice | First returns True, removes artefacts; second returns False | +| AC-12 | Pool killed mid-call | TileMetadataError; original psycopg error in __cause__ | +| AC-13 | Microbench read_tile_pixels × 1000 warm | p95 ≤ 0.5 ms | +| AC-14 | 100 writes at 5 Hz | All land within 30 s; no drops | +| AC-15 | FDR capture on write success/failure | Record kind matches; payload fields match schema | +| NFR-perf-pool-checkout | Microbench pool checkout × 1000 | p99 ≤ 5 ms | +| NFR-reliability-leak-detection | Force-leak a connection then exit | WARN log on exit with leak count | + +## Constraints + +- The class implements BOTH Protocols on a single instance — splitting them across two classes would force the composition root to wire two near-identical objects against the same `(root_dir, postgres_pool)`. The single-instance pattern is documented in `tile_metadata_store.md` (the impl note that `PostgresFilesystemStore` also implements `TileStore`). +- Filesystem path computation MUST go through the injected `wgs_converter` — NEVER duplicate the Web-Mercator math here. Direct math is a `Architecture` finding (High) at code-review time. +- `mmap` opens the file `prot=mmap.PROT_READ`; tests assert the resulting `memoryview` is `readonly=True`. +- Postgres parameterised queries via psycopg's `cursor.execute(sql, params)` ONLY; no string-interpolated SQL. +- Compensating delete on insert failure is mandatory — leaving an orphan JPEG would skew `total_disk_bytes` and silently violate the cache budget. +- The class is NOT thread-safe for `write_tile` — concurrent writes to the same tile_id from two threads is undefined behaviour. Single-writer-per-tile is the F4 path's contract; any future multi-writer scenario is a separate task. +- The store does NOT log per-frame DEBUG by default — `read_tile_pixels` is in the F3 hot path and DEBUG would flood at 9 Hz aggregate. +- This task introduces no new third-party dependencies — `psycopg`, `psycopg_pool`, `mmap` (stdlib), and the AZ-280 / AZ-279 helpers are sufficient. + +## Risks & Mitigation + +**Risk 1: Filesystem writes survive but row insert fails (or vice-versa) in a partial-failure scenario** +- *Risk*: Crash between filesystem write and row insert leaves an orphan file with no row reference. On next read, `read_tile_pixels` reports a `TileMetadataError` per AC-4, but `total_disk_bytes` doesn't account for the orphan, and the cache budget is silently inflated. +- *Mitigation*: At store construction, an O(N) startup scan reconciles filesystem vs. metadata: orphan files (file present, no row) are deleted on construction with a WARN log naming the path. The companion process restarts on every flight, so the reconciliation runs at known-quiescent boundaries. + +**Risk 2: mmap'd `TilePixelHandle` outlives the file** +- *Risk*: A consumer holds the handle past a `delete_tile` call; the underlying fd is invalidated; reading through the mmap raises `BusError`. +- *Mitigation*: `delete_tile` does NOT actively invalidate live mmaps; the OS keeps the fd alive until the consumer's `__exit__`. Documented as a constraint: consumers MUST NOT cache `TilePixelHandle` instances across calls — use them inside a `with` block and release. + +**Risk 3: Postgres pool checkout latency spikes under burst** +- *Risk*: 5 Hz F4 burst exhausts the default 4-connection pool; subsequent writes wait, AC-14 fails. +- *Mitigation*: Pool size is config-driven (`config.tile_cache.postgres_pool_size`); benchmarks run at default 4 (which the description's hot-path estimate easily fits); operator can bump if needed. The bench in AC-NFR-perf-pool-checkout pins the regression. + +**Risk 4: Compensating delete itself fails** +- *Risk*: After a row insert fails, the compensating filesystem delete also fails (e.g., disk full, permission flip); orphan JPEG persists. +- *Mitigation*: The compensating delete logs ERROR if it fails, but does NOT raise — the original `TileMetadataError` is the operator-visible signal. The reconciliation scan at next start (Risk 1) will clean up. WARN-on-orphan is the steady-state visibility. + +**Risk 5: Forward-only voting transitions block legitimate operator overrides** +- *Risk*: Operator decides a TRUSTED tile should go back to PENDING for re-validation; the I-8 invariant blocks them. +- *Mitigation*: I-8 is documented; backward transitions are intentionally a separate operator-tooling concern (delete + re-insert as PENDING is the supported workflow). If operator demand is real, a future contract bump adds a `force_reset_voting` admin method — not in this cycle. + +## Runtime Completeness + +- **Named capability**: Postgres-backed spatial metadata index + filesystem JPEG store byte-identical to satellite-provider + atomic-write/sidecar via SHA-256 + LRU/voting/upload bookkeeping (description.md / E-C6 / AC-8.1 / AC-8.4 / AC-NEW-3 / AC-NEW-7 / RESTRICT-SAT-2 / D-C10-3). +- **Production code that must exist**: real `PostgresFilesystemStore` class implementing both Protocols; real `mmap`-backed `MmapTilePixelHandle`; real Postgres connection-pool checkout / parameterised query / single-transaction insert / compensating filesystem delete; real path computation via the injected `wgs_converter`; real sha256 sidecar via the injected `sha256_sidecar` helper; real FDR emission via the injected `FdrClient`. +- **Allowed external stubs**: tests MAY use a `testcontainers`-managed Postgres 16, a `tmp_path` filesystem, and fake `FdrClient` / `Logger` / `WgsConverter` (where the wgs_converter fake just returns a fixed path); production wiring uses real implementations from AZ-279 / AZ-280 / AZ-273 / AZ-266. +- **Unacceptable substitutes**: an in-memory dict masquerading as the metadata store (would defeat the byte-identity invariant + the EXPLAIN-plan check + the C6-PT-01 latency benchmark — none would be meaningful); a SQLite shim "for testing only" (test environment must mirror production per coderule.mdc); a path computation that bypasses `wgs_converter` (would duplicate the Web-Mercator math and is the exact byte-identity-drift failure mode the architecture forbids); skipping the compensating delete on row failure (would silently inflate `total_disk_bytes`); a non-rewrapping handler that lets `psycopg.OperationalError` escape (would break the family invariant from AZ-303 § I-2). + +## Contract + +This task implements the contracts at: +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` + +Consumers MUST read those files — not this task spec — to discover the interfaces. diff --git a/_docs/02_tasks/todo/AZ-306_c6_faiss_descriptor_index.md b/_docs/02_tasks/todo/AZ-306_c6_faiss_descriptor_index.md new file mode 100644 index 0000000..aaf5c09 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-306_c6_faiss_descriptor_index.md @@ -0,0 +1,236 @@ +# C6 FaissDescriptorIndex — HNSW Search + Atomic Rebuild + pybind11 Wrapper + +**Task**: AZ-306_c6_faiss_descriptor_index +**Name**: C6 FaissDescriptorIndex +**Description**: Implement `FaissDescriptorIndex`, the production-default `DescriptorIndex` Protocol strategy. Owns the F1 pre-flight `rebuild_from_descriptors` path (atomic `.index` file write + sidecar via AZ-280), the F2 takeoff load (mmap with `IO_FLAG_MMAP_IFC`), the F3 hot-path `search_topk` (HNSW; ≤ 5 ms p95 warm; sole consumer is C2 VPR), the `index_metadata` sidecar block, and the `cpp/faiss_index/` pybind11 wrapper that links FAISS HEAD-pinned per Plan-phase under the `BUILD_FAISS_INDEX` flag. +**Complexity**: 5 points +**Dependencies**: AZ-303_c6_storage_interfaces, AZ-280_sha256_sidecar, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c6_tile_cache (epic AZ-250 / E-C6) +**Tracker**: AZ-306 +**Epic**: AZ-250 (E-C6) + +### Document Dependencies + +- `_docs/02_document/contracts/c6_tile_cache/descriptor_index.md` — Protocol this task implements; produced by AZ-303. +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — atomic-write + sidecar pattern for the `.index` file. +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — `config.tile_cache.descriptor_index_runtime`, `config.tile_cache.faiss_index_path`, `config.tile_cache.faiss_warmup_query` fields. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO/WARN log shapes for load + warm-up + corruption events. + +## Problem + +Without a real `FaissDescriptorIndex`: + +- C2 VPR has no production retrieval path — `search_topk` is a hole; F3 hot path fails before C2.5. +- C10 CacheProvisioner has no production index builder — F1 pre-flight cannot persist a `.index` file; takeoff blocks. +- The F2 takeoff cold-start budget (AC-NEW-1 ≤ 12 s end-to-end) cannot be measured — without a real warm-up query, the first per-frame `search_topk` would pay the multi-second mmap page-in (description.md § 7). +- The `IndexUnavailableError` raise points (mismatched sidecar, dimension mismatch, mmap'd file replaced concurrently) are unenforced — silent corruption is possible. +- The `BUILD_FAISS_INDEX=OFF` Tier-0 dev path has no test surface — build matrix coverage is missing. + +This task is the production-default impl. The Protocol, contract, and the pinned FAISS dependency are now ready; this is the integration point. + +## Outcome + +- A `FaissDescriptorIndex` class at `src/gps_denied_onboard/components/c6_tile_cache/faiss_descriptor_index.py` conforming to the `DescriptorIndex` Protocol from AZ-303. +- A pybind11 wrapper at `cpp/faiss_index/` (CMake `BUILD_FAISS_INDEX` flag) that exposes only the methods this task needs: `read_index_mmap(path)`, `write_index(path, index_handle)`, `add_with_ids(index, vectors, ids)`, `search(index, query, k)`, `set_ef_search(index, ef)`. The wrapper holds NO Python state — all state is in the Python `FaissDescriptorIndex` class. +- Constructor signature: `__init__(self, *, index_path: Path, sha256_sidecar: Sha256SidecarHelper, logger: Logger, warmup_query: Optional[np.ndarray] = None)`. The composition root wires the dependencies; the warm-up query is loaded from `config.tile_cache.faiss_warmup_query` at startup if present. +- `search_topk(query, k) -> list[tuple[TileId, float]]`: + 1. Validates `query.shape == (descriptor_dim,)`, `query.dtype == np.float32`, `query.flags.c_contiguous`. Mismatch → `IndexUnavailableError` (per `descriptor_index.md` § I-3). + 2. Calls `cpp/faiss_index/search` with `k`. + 3. Maps the returned int64 ids back to `TileId` via the in-memory `_id_to_tile_id` map (built at load time from the sidecar metadata block). + 4. Returns up to `k` `(TileId, float)` pairs ordered by ascending distance. Fewer-than-k results are tolerated per § I-2. +- `descriptor_dim() -> int`: returns the cached `IndexMetadata.descriptor_dim` from load time. Constant-time. +- `mmap_handle() -> Path`: returns the `index_path` constructor arg. Raises `IndexUnavailableError` if the index is not currently loaded (e.g., construction failed and the operator caught the exception). +- `rebuild_from_descriptors(descriptors, tile_ids, hnsw_params) -> None`: + 1. Validates `descriptors.shape == (len(tile_ids), descriptor_dim)`, `descriptors.dtype == np.float32`, `descriptors.flags.c_contiguous`, `len(tile_ids) > 0`. Mismatch → `IndexBuildError`. + 2. Builds the HNSW index in C++ via `cpp/faiss_index/add_with_ids` with the supplied params. + 3. Serialises to a temp path under `index_path.parent` via `cpp/faiss_index/write_index`. + 4. Writes the sidecar metadata block (a separate `.meta.json` file carrying `IndexMetadata` JSON: `descriptor_dim`, `n_vectors`, `backbone_label`, `backbone_sha256_hex`, `built_at`, `hnsw_params`, plus the `tile_id` ↔ int64 mapping). + 5. Runs `sha256_sidecar.atomic_write_with_sidecar(index_path, temp_index_bytes)` for atomic rename of the `.index` file + `.sha256` sidecar. + 6. Reloads the in-memory index from the new file (so subsequent `search_topk` calls hit the fresh data). + 7. Emits an INFO log on success: `kind="c6.faiss.rebuilt"` with `n_vectors`, `descriptor_dim`, elapsed seconds. +- `index_metadata() -> IndexMetadata`: parses the `.meta.json` sidecar; raises `IndexUnavailableError` if missing or corrupt. +- Load flow at construction: + 1. Validates `index_path` exists; if missing, raises `IndexUnavailableError` (composition root catches and decides — Tier-0 dev may proceed with thermal-aware paths disabled, similar to AZ-302's pattern). + 2. Reads `.sha256` and validates it matches `sha256()`; mismatch → `IndexUnavailableError`. + 3. Reads `.meta.json` and validates it parses to `IndexMetadata`; corruption → `IndexUnavailableError`. + 4. Calls `cpp/faiss_index/read_index_mmap(index_path)` with `IO_FLAG_MMAP_IFC` (FAISS's mmap-backed read path). + 5. Caches `descriptor_dim`, `n_vectors`, the `_id_to_tile_id` map, and the FAISS index handle. + 6. If `warmup_query` is supplied, runs ONE `search_topk(warmup_query, k=1)` to page in the mmap'd file. +- `cpp/faiss_index/` is a thin pybind11 module — no Python-level state, no GIL holds beyond what FAISS itself does. The build is gated by CMake `BUILD_FAISS_INDEX=ON`; with the flag off, the Python `FaissDescriptorIndex` class is not even importable (the `from cpp_faiss_index import ...` line at module top fails import-time, exactly as `BUILD_TENSORRT_RUNTIME=OFF` makes `tensorrt_runtime.py` unimportable). +- All third-party FAISS exceptions (C++ exceptions surfaced via pybind11 as `RuntimeError`) are caught and rewrapped into `IndexUnavailableError` (read path) or `IndexBuildError` (rebuild path). + +## Scope + +### Included + +- `FaissDescriptorIndex` class implementation conforming to AZ-303's Protocol. +- `cpp/faiss_index/` pybind11 module with the five-method surface above. +- The `.meta.json` sidecar format — a JSON document carrying `IndexMetadata` plus the `tile_id` ↔ int64 mapping. +- The HNSW int64-id assignment scheme: a stable, deterministic mapping from `TileId` (composite tuple) to int64 id at rebuild time. The mapping function is `int64(sha256(zoom|lat|lon|source).first8bytes)` — collisions are detected at rebuild time (rebuild raises `IndexBuildError` on collision). +- Construction-time mmap of the existing `.index` file (or `IndexUnavailableError` if absent / corrupted). +- Optional construction-time warm-up query (no warm-up if `warmup_query=None`). +- Lazy-import gating: the `cpp_faiss_index` import lives at module top, so `BUILD_FAISS_INDEX=OFF` makes the module unimportable. The composition-root factory's `if BUILD_FAISS_INDEX:` guard prevents the import attempt under the OFF flag. +- Diagnostic INFO log on construction with `n_vectors`, `descriptor_dim`, sidecar SHA-256, build timestamp; INFO on `rebuild_from_descriptors` start + end with elapsed seconds. +- Standalone CLI `python -m c6_tile_cache.faiss_descriptor_index inspect ` for operator post-flight inspection (prints `IndexMetadata` + the first 5 vectors' ids). + +### Excluded + +- The C10 CacheProvisioner orchestration that calls `rebuild_from_descriptors` — owned by E-C10. This task exposes the API; C10 calls it. +- The C2 VPR consumer wiring of `search_topk` — owned by E-C2. +- A second `DescriptorIndex` impl (e.g., `FlatDescriptorIndex` for unit tests that don't want HNSW overhead) — out of scope this cycle. Tests use a fake satisfying the Protocol. +- GPU FAISS variants — explicitly forbidden by AZ-303 § I-4. +- Incremental updates / online learning — F1 pre-flight is full-rebuild only per `descriptor_index.md` Non-Goals. +- Descriptor compression / PQ quantisation — out of scope this cycle (HNSW32 raw float32). +- Cross-flight `.index` sharing — parent-suite concern (D-PROJ-2). +- Backbone retraining — owned by E-C7 / E-C10. + +## Acceptance Criteria + +**AC-1: search_topk returns ordered ids on a known corpus** +Given a freshly-rebuilt index from 1000 known descriptors with deterministic int64 ids +When `search_topk(query=descriptors[0], k=5)` is called +Then the result is a list of 5 `(TileId, float)` pairs; the first pair's `TileId` matches `tile_ids[0]`; the first pair's distance is < 1e-6 (self-match); pairs are ordered by ascending distance + +**AC-2: search_topk returns fewer-than-k when corpus is small** +Given a 3-vector corpus and `k=10` +When `search_topk(query, k=10)` is called +Then the result has length 3; every pair's `TileId` matches one of the 3 corpus tile_ids; no exception + +**AC-3: search_topk rejects shape / dtype / contiguity mismatch** +Given a query with `shape=(descriptor_dim+1,)` (wrong dim), or `dtype=float64`, or `flags.c_contiguous=False` +When `search_topk(query, k=5)` is called +Then `IndexUnavailableError` is raised with a message naming the violation; no FAISS call is made (verifiable via the C++ wrapper's call counter staying flat) + +**AC-4: rebuild_from_descriptors atomic on crash** +Given an existing valid `.index` and `.meta.json` and `.sha256` sidecars +When `rebuild_from_descriptors` is called and the test simulates `os._exit` AFTER the temp file is written but BEFORE the atomic rename +Then on next construction the original `.index` and sidecars are intact and loadable; the temp file is left behind for cleanup at next start (cleanup is the construction-time scan's responsibility) + +**AC-5: rebuild_from_descriptors writes correct sidecars** +Given a successful rebuild +When the test inspects the resulting files +Then the `.index` file's sha256 matches the `.sha256` sidecar content; the `.meta.json` `descriptor_dim` matches `descriptors.shape[1]`; `n_vectors` matches `len(tile_ids)`; `built_at` is within 1 s of the call time; `hnsw_params` matches the input + +**AC-6: Construction validates sidecar coherence** +Given an `.index` whose `.sha256` sidecar content is mutated to a wrong value +When `FaissDescriptorIndex(index_path=..., sha256_sidecar=..., ...)` is constructed +Then `IndexUnavailableError` is raised with a message naming the path; the FAISS handle is not loaded (verifiable via `mmap_handle()` raising `IndexUnavailableError` on the partially-constructed object) + +**AC-7: Construction validates meta.json** +Given an `.index` whose `.meta.json` is missing or contains malformed JSON +When the index is constructed +Then `IndexUnavailableError` is raised; the FAISS handle is not loaded + +**AC-8: Warm-up query pages the mmap on construction** +Given a freshly-loaded index whose mmap'd file is NOT in the OS page cache and a `warmup_query` is supplied +When the construction returns +Then a subsequent `search_topk` p95 < 5 ms (warm); without the warm-up, the first `search_topk` would be ≥ 100 ms (cold). The test fakes the cold-state by `posix_fadvise(POSIX_FADV_DONTNEED)` on the mapped file before construction. + +**AC-9: search_topk p95 latency budget** +Given a 100k-vector corpus, page cache warm +When `search_topk` is called 1000 times with random queries +Then p95 ≤ 5 ms (failure threshold 50 ms — but this is a sanity bound, NOT the C2 budget; the canonical C2-PT-01 measurement is in C2's test phase) + +**AC-10: BUILD_FAISS_INDEX=OFF makes the module unimportable** +Given a build with `BUILD_FAISS_INDEX=OFF` (the `cpp_faiss_index` shared lib is not built) +When `from gps_denied_onboard.components.c6_tile_cache import faiss_descriptor_index` is attempted +Then `ImportError` is raised at the `from cpp_faiss_index import ...` line; the composition-root factory's `if BUILD_FAISS_INDEX:` guard MUST prevent the import attempt. The factory raises `RuntimeNotAvailableError` instead. + +**AC-11: int64-id collision detection at rebuild** +Given two `tile_ids` whose deterministic int64 mapping collides (synthetic test using a hash-seed mock) +When `rebuild_from_descriptors` is called +Then `IndexBuildError` is raised with a message naming both colliding tile_ids; no `.index` is written; the original index (if any) is untouched + +**AC-12: index_metadata round-trip** +Given a rebuild with known `(descriptor_dim, n_vectors, backbone_label, backbone_sha256_hex, hnsw_params)` +When the post-rebuild `index_metadata()` is called +Then the returned `IndexMetadata` matches every field; `sidecar_sha256_hex` matches `sha256(.index)` content + +## Non-Functional Requirements + +**Performance** +- `search_topk` p95 ≤ 5 ms warm at 100k corpus (AC-9 / sanity bound; canonical budget is C2-PT-01). +- Construction with warm-up ≤ 10 s for a 100k-vector index (mmap page-in dominates; warm-up is a single search). +- `rebuild_from_descriptors` is bound by FAISS HNSW build time — minutes for 100k vectors. NOT a hot-path operation; F1 pre-flight only. + +**Compatibility** +- FAISS HEAD pinned per Plan-phase (description.md § 5). No version negotiation. +- pybind11 stable ABI as already pinned by AZ-263 bootstrap. +- numpy float32 C-contiguous arrays only on the search surface. + +**Reliability** +- All FAISS C++ exceptions are caught and rewrapped into `IndexUnavailableError` / `IndexBuildError`. +- The mmap'd file lifetime is bound to the `FaissDescriptorIndex` instance lifetime; the composition root holds the singleton for the flight. +- `rebuild_from_descriptors` is atomic — partial failure preserves the prior index. +- `.index` is never modified in place — always written to a temp path then atomically renamed. + +**Concurrency** +- `search_topk` is NOT re-entrant per AZ-303 § I-8. The F3 hot path is single-threaded (description.md). Multi-threaded callers MUST use a per-thread instance (out of scope this cycle; documented as a constraint). +- `rebuild_from_descriptors` is offline; never runs concurrently with `search_topk` in the same process. F1 pre-flight is in C10's pre-flight binary; F3 is in the airborne binary. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | rebuild + search_topk on 1000 descriptors | First result self-matches at distance < 1e-6; ordered by distance | +| AC-2 | search_topk with k > corpus size | Returns corpus-size results; no exception | +| AC-3 | search_topk with wrong shape / dtype / non-contiguous | IndexUnavailableError; no FAISS call | +| AC-4 | rebuild crash mid-rename (simulated) | Original index intact on next load | +| AC-5 | Inspect post-rebuild sidecars | `.sha256` matches; `.meta.json` matches input | +| AC-6 | Sidecar content corrupted | IndexUnavailableError on construct | +| AC-7 | `.meta.json` missing/malformed | IndexUnavailableError on construct | +| AC-8 | Warm-up forces mmap page-in | Subsequent search p95 < 5 ms even after fadvise DONTNEED | +| AC-9 | Microbench search × 1000 on 100k corpus | p95 ≤ 5 ms | +| AC-10 | Build with BUILD_FAISS_INDEX=OFF | ImportError; factory raises RuntimeNotAvailableError | +| AC-11 | Two tile_ids whose int64 mapping collides | IndexBuildError; no `.index` written | +| AC-12 | Round-trip IndexMetadata after rebuild | Every field matches input | +| NFR-perf-rebuild | 100k vectors, time the rebuild | Wall ≤ 5 minutes (sanity bound; F1 pre-flight runs offline) | +| NFR-reliability-fascade-rewrap | Inject a FAISS C++ exception | Rewrapped into IndexUnavailableError; original message in __cause__ | + +## Constraints + +- FAISS HEAD pinned per Plan-phase (description.md § 5); no version-negotiation logic. +- The `cpp/faiss_index/` wrapper exposes EXACTLY the five methods listed in Outcome — adding methods is a separate task. +- The pybind11 module holds NO Python state — all state is in Python; the wrapper is a stateless façade. +- numpy float32 C-contiguous on all array surfaces; no auto-casting. +- HNSW only this cycle — no `IndexFlat`, no `IndexIVF*`, no GPU variants. +- `.index` files are NEVER modified in place — always temp + atomic-rename. +- The int64-id deterministic mapping `int64(sha256(zoom|lat|lon|source).first8bytes)` is a project convention; if a future task changes it, every prior `.index` is invalidated and the operator must rebuild. +- The `.meta.json` sidecar is the source of truth for `tile_id` ↔ int64 mapping; the `.index` file alone is insufficient (FAISS HNSW stores int64 ids only). +- Lazy-import gating is mandatory — the `cpp_faiss_index` import at module top is the gate; the composition-root factory's `if BUILD_FAISS_INDEX:` block is what skips the import in OFF builds. +- This task adds no new third-party dependencies beyond FAISS HEAD (already pinned by description.md) and pybind11 (already pinned by AZ-263). +- The CLI inspect mode is for operators; not part of any consumer's public API. + +## Risks & Mitigation + +**Risk 1: FAISS HEAD breaks API across pin updates** +- *Risk*: An operator bumps FAISS pin; the C++ surface changes; the pybind11 wrapper fails to compile. +- *Mitigation*: FAISS pin is recorded in `description.md` § 5; the wrapper is the only place that depends on the C++ surface. Pin updates are a separate task with its own AC. Documented at the wrapper top. + +**Risk 2: Mmap'd file is replaced concurrently** +- *Risk*: An out-of-band process renames the `.index` file mid-flight; the mmap reads now hit corrupted bytes. +- *Mitigation*: AZ-303 § I-1 forbids mid-flight modification. The composition root holds the singleton for the flight; out-of-band renames are operator-error. A future defensive task could add a periodic sidecar re-check; out of scope this cycle. + +**Risk 3: Int64-id collision (cryptographic-hash) under adversarial inputs** +- *Risk*: With ~10k tiles per provisioning, the birthday-paradox collision probability for an 8-byte truncation of SHA-256 is ~10^-12; effectively zero, but adversarial inputs could engineer a collision. +- *Mitigation*: AC-11 detects collisions at rebuild time and aborts (raises `IndexBuildError`). Operator surfaces the error and either tweaks the corpus or bumps to a 16-byte id mapping — both are out-of-cycle, but the detection point is hard. + +**Risk 4: HNSW first-query cold latency exceeds AC-NEW-1 budget** +- *Risk*: The 100k-vector index's mmap takes seconds to page in; without warm-up, the first F3 search blocks for ≥ 1 s. +- *Mitigation*: AC-8 forces a warm-up at construction; the operator's pre-flight `config.tile_cache.faiss_warmup_query` ensures it's not None in production. C10's pre-flight orchestrator is responsible for ensuring the warm-up query is supplied. + +**Risk 5: pybind11 ABI mismatch between dev and CI** +- *Risk*: A developer compiles against a different Python minor than CI; the `.so` has a different ABI tag. +- *Mitigation*: AZ-263 pins Python minor + pybind11 version; CMake reads the same versions. The CI matrix's per-binary build job rebuilds the wrapper from source. + +## Runtime Completeness + +- **Named capability**: FAISS HNSW retrieval + atomic `.index` rebuild + sidecar coherence + mmap-backed read + pybind11 wrapper (description.md / E-C6 / NFT-LIM-01 / D-C10-3 / AC-NEW-1). +- **Production code that must exist**: real `FaissDescriptorIndex` Python class implementing AZ-303's Protocol; real `cpp/faiss_index/` pybind11 wrapper linking real FAISS; real HNSW build via FAISS's `add_with_ids`; real mmap'd read via `IO_FLAG_MMAP_IFC`; real atomic rename via the AZ-280 sidecar helper; real warm-up query at construction; real third-party-exception rewrap. +- **Allowed external stubs**: tests MAY use a fake `Sha256SidecarHelper` (where `atomic_write_with_sidecar` writes to a tmp path); production wiring uses the real AZ-280 helper. Tests MAY use synthetic descriptors and tile_ids; production uses real C10 CacheProvisioner output. +- **Unacceptable substitutes**: a Python-level fake "FAISS" that bypasses the C++ wrapper (would defeat AC-9 latency, the byte-identity of the `.index` file, and the mmap behaviour); a SciPy / scikit-learn `NearestNeighbors` shim "for testing" (different algorithm, different latency profile, different file format — would invalidate the rebuild contract); skipping the warm-up query "to keep construction fast" (would break AC-NEW-1 cold-start budget); an in-memory id map without the `.meta.json` sidecar (would lose the tile_id ↔ int64 mapping across process restarts); a non-rewrapping handler that lets FAISS C++ exceptions escape (would break the family invariant from AZ-303). + +## Contract + +This task implements the contract at `_docs/02_document/contracts/c6_tile_cache/descriptor_index.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-307_c6_freshness_gate.md b/_docs/02_tasks/todo/AZ-307_c6_freshness_gate.md new file mode 100644 index 0000000..f8cba06 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-307_c6_freshness_gate.md @@ -0,0 +1,202 @@ +# C6 Freshness Gate — Active-Conflict Reject + Stable-Rear Downgrade + +**Task**: AZ-307_c6_freshness_gate +**Name**: C6 Freshness Gate +**Description**: Implement the freshness gate that runs at every `write_tile` and `insert_metadata` call site: looks up the target `(lat, lon)`'s sector classification from `sector_boundaries`, reads the per-classification rule from `tile_freshness_rules` (`max_age_seconds`, `action`), and either raises `FreshnessRejectionError` (active_conflict + stale → reject) or stamps `freshness_label = DOWNGRADED` (stable_rear + stale → downgrade) before the row lands. Replaces the pass-through `_evaluate_freshness` hook the `PostgresFilesystemStore` ships in AZ-305. Reads the rules table once at construction (rules are per-flight; the flight is the lifetime). Caches sector boundaries in an in-memory R-tree (operator sets ≤ a few hundred per flight). Emits an FDR record on every rejection and every downgrade. +**Complexity**: 2 points +**Dependencies**: AZ-303_c6_storage_interfaces, AZ-304_c6_postgres_schema, AZ-305_c6_postgres_filesystem_store, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf +**Component**: c6_tile_cache (epic AZ-250 / E-C6) +**Tracker**: AZ-307 +**Epic**: AZ-250 (E-C6) + +### Document Dependencies + +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — Invariants I-2 (active_conflict reject) and I-3 (stable_rear downgrade) are the canonical statement of this task's behaviour. +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — defines `FreshnessRejectionError` and `FreshnessLabel`. +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — `kind="c6.freshness.rejected"` / `kind="c6.freshness.downgraded"` envelopes. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO log shape on rule load + WARN log shape on rejection / downgrade. + +## Problem + +Without a real freshness gate: + +- AC-8.2 (active_conflict ≤ 6 mo, stable_rear ≤ 12 mo) is unenforced — stale tiles in active sectors land silently and downstream consumers cannot tell them apart. +- AC-NEW-6 (system rejects/downgrades stale tiles) collapses — the test C6-IT-02 / C6-IT-05 cannot pass. +- The pass-through hook in `PostgresFilesystemStore` (AZ-305) accepts every freshness label as-is — an attacker could feed `freshness_label=FRESH` for a 10-year-old tile and bypass the safety budget. +- The cache-poisoning safety budget (AC-NEW-7) loses one of its layers — the rule-evaluation point is a defence boundary, not a label-trust point. +- Operator sector classifications (set via C12) have no read consumer at the C6 layer — the classifications would be data-only, never policy. + +This task wires the rule-evaluation logic into the AZ-305 store's `_evaluate_freshness` hook. + +## Outcome + +- A `FreshnessGate` class at `src/gps_denied_onboard/components/c6_tile_cache/freshness_gate.py` with a single public method `evaluate(metadata: TileMetadata) -> TileMetadata` that returns either: + - The same `metadata` if FRESH applies (no policy intervention). + - A new `metadata` with `freshness_label=DOWNGRADED` if stable_rear-stale. + - Raises `FreshnessRejectionError` if active_conflict-stale. +- Constructor signature: `__init__(self, *, postgres_pool: psycopg_pool.ConnectionPool, fdr_client: FdrClient, logger: Logger, clock: Clock)`. The `clock` injection lets tests advance time deterministically. +- At construction: + 1. Reads `sector_boundaries` once, builds an in-memory R-tree (using `rtree` library — already pinned by description.md or added here if not; check requirements file). + 2. Reads `tile_freshness_rules` once, caches the two rules in a frozen dict `{SectorClassification: FreshnessRule}`. + 3. Emits an INFO log: `kind="c6.freshness.loaded"` with `n_sectors`, `rules`. +- `evaluate(metadata)`: + 1. Computes `tile_age_seconds = now - metadata.capture_timestamp` via the injected `clock`. + 2. Queries the R-tree for the sector containing `(metadata.tile_id.lat, metadata.tile_id.lon)`. If multiple sectors match (overlap), the smallest by area wins (deterministic tie-break). + 3. If no sector matches → treats as `STABLE_REAR` default (per data_model.md convention; documented as the implicit default). + 4. Looks up the rule for that classification. + 5. If `tile_age_seconds <= rule.max_age_seconds` → returns `metadata` unchanged (FRESH). + 6. Else if `rule.action == 'reject'` → emits FDR `kind="c6.freshness.rejected"` and WARN log; raises `FreshnessRejectionError(tile_id, age_seconds, classification, rule)`. + 7. Else if `rule.action == 'downgrade'` → emits FDR `kind="c6.freshness.downgraded"` and INFO log; returns `dataclasses.replace(metadata, freshness_label=FreshnessLabel.DOWNGRADED)`. +- The `PostgresFilesystemStore`'s `_evaluate_freshness` hook is replaced — instead of `return metadata.freshness_label`, it now calls `freshness_gate.evaluate(metadata).freshness_label`. This is a wiring change in AZ-305's class — implemented as a small constructor argument addition (`freshness_gate: Optional[FreshnessGate] = None`) so AZ-305 remains testable in isolation. +- The composition root constructs `FreshnessGate` and passes it to `PostgresFilesystemStore` AFTER the migration runner (AZ-304) has populated the rules table. + +## Scope + +### Included + +- `FreshnessGate` class with `evaluate(metadata)` method. +- Construction-time R-tree build over `sector_boundaries`. +- Construction-time rules-table cache. +- FDR emission on every rejection and every downgrade. +- WARN log on rejection (per `tile_store.md` § log table); INFO log on downgrade (downgrade is recoverable, not an error). +- The smallest-area tie-break for overlapping sector boundaries (deterministic, documented). +- The implicit STABLE_REAR default for `(lat, lon)` outside any sector. +- A constructor `Optional[FreshnessGate]` arg on `PostgresFilesystemStore` so AZ-305 stays unit-testable without this gate. +- Composition-root wiring (the factory `build_tile_store` becomes `build_tile_store(config, freshness_gate)`). +- A standalone CLI `python -m c6_tile_cache.freshness_gate explain ` for operators to dry-run the gate. + +### Excluded + +- Sector-boundary CRUD — owned by C12 operator tooling. +- Tile-freshness-rule CRUD beyond the migration's seeded defaults — operators can edit at the DB level today; a future task adds an admin API. +- Rule reload mid-flight — out of scope this cycle. The flight is the lifetime; rules change requires a process restart. +- Cross-sector pose-error voting (the parent-suite D-PROJ-2 voting layer) — that lives in `satellite-provider`. +- Time-of-day or seasonal freshness adjustments — not in description.md, out of scope. +- Per-tile freshness override (operator manually marks one tile fresh) — out of scope; operator workaround is to delete + re-insert with a fresh capture_timestamp. + +## Acceptance Criteria + +**AC-1: Active-conflict stale tile is rejected** +Given a sector classified `ACTIVE_CONFLICT` with the default 6-month rule, and a tile inside it with `capture_timestamp = now - 7 months` +When `evaluate(metadata)` is called +Then `FreshnessRejectionError` is raised with a message naming the tile_id, the age, and the rule; ONE FDR `kind="c6.freshness.rejected"` record is emitted; ONE WARN log is emitted + +**AC-2: Active-conflict fresh tile passes** +Given the same sector and a tile with `capture_timestamp = now - 5 months` +When `evaluate(metadata)` is called +Then the call returns `metadata` unchanged; no FDR record is emitted; no WARN log is emitted + +**AC-3: Stable-rear stale tile is downgraded** +Given a sector classified `STABLE_REAR` with the default 12-month rule, and a tile inside it with `capture_timestamp = now - 13 months` +When `evaluate(metadata)` is called +Then the returned `TileMetadata` has `freshness_label = FreshnessLabel.DOWNGRADED`; the rest of the metadata is unchanged; ONE FDR `kind="c6.freshness.downgraded"` record is emitted; ONE INFO log is emitted + +**AC-4: Stable-rear fresh tile passes** +Given the same sector and a tile with `capture_timestamp = now - 10 months` +When `evaluate(metadata)` is called +Then the call returns `metadata` unchanged; no FDR; no log + +**AC-5: Tile outside all sectors defaults to STABLE_REAR** +Given a tile at `(lat, lon)` not contained in any `sector_boundaries` row +When `evaluate(metadata)` is called with a 13-month-old `capture_timestamp` +Then the result is `freshness_label = DOWNGRADED` (the implicit STABLE_REAR default applies); FDR `kind="c6.freshness.downgraded"` is emitted + +**AC-6: Overlapping sectors resolve by smallest area** +Given two `sector_boundaries` rows: a 1°×1° ACTIVE_CONFLICT box and a 0.1°×0.1° STABLE_REAR box, with the smaller box fully inside the larger +When `evaluate(metadata)` is called for a tile inside the smaller (and thus also the larger) box +Then the STABLE_REAR rule applies (smallest area wins); a 13-month-old tile is downgraded, NOT rejected + +**AC-7: Rules and sectors are loaded once at construction** +Given a `FreshnessGate` instance +When 10000 `evaluate` calls are made +Then no `sector_boundaries` or `tile_freshness_rules` SELECT is observed (verifiable via psycopg query log capture); only the construction-time SELECT pair is observed + +**AC-8: FreshnessRejectionError carries diagnostic fields** +Given an active_conflict rejection +When the test inspects the raised exception +Then `exc.tile_id`, `exc.age_seconds`, `exc.classification`, `exc.rule` are populated; the exception message starts with `"Tile rejected by freshness gate"` + +**AC-9: FDR record envelopes match contract** +Given a rejection or downgrade +When the FDR record is captured +Then the record matches `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` shape with the documented `kind`, `producer_id="c6_tile_cache.freshness"`, payload `{tile_id, age_seconds, classification, rule_action, rule_max_age_seconds}` + +**AC-10: Composition wiring change works end-to-end** +Given a `PostgresFilesystemStore` constructed WITH a `FreshnessGate` argument +When `write_tile` is called with a stale active_conflict tile +Then `FreshnessRejectionError` is raised; no JPEG / row / sidecar is written (verifiable via filesystem + DB inspection); the rejection FDR is emitted via the same `FdrClient` AZ-305 already holds + +## Non-Functional Requirements + +**Performance** +- `evaluate` p99 ≤ 100 µs (R-tree point-in-rect lookup is sub-microsecond; the hot bottleneck is the `now - capture_timestamp` arithmetic and the FDR emission, both fast). +- Construction takes ≤ 50 ms even for a few-hundred-sector flight (R-tree build is O(N log N) on a small N). + +**Compatibility** +- `rtree` Python library — verify the project pin already includes it; if not, this task adds it (compatible with the project's existing geospatial stack). +- `dataclasses.replace` is stdlib. + +**Reliability** +- Construction failure is fail-fast: a malformed `tile_freshness_rules` row (e.g., unknown `action` enum value) raises a `ConfigSchemaError` extension at construction; the composition root catches and aborts startup with a clear operator message. +- The gate is idempotent — calling `evaluate` on the same `metadata` twice returns deep-equal results (no hidden state changes). +- The injected `Clock` MUST be the same singleton used by AZ-305's `record_lru_access` and AZ-302's thermal publisher (already a project-wide singleton). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Active-conflict + 7-month tile | FreshnessRejectionError; one FDR; one WARN log | +| AC-2 | Active-conflict + 5-month tile | Returns unchanged; no FDR; no WARN | +| AC-3 | Stable-rear + 13-month tile | Returns with freshness_label=DOWNGRADED; one FDR; one INFO | +| AC-4 | Stable-rear + 10-month tile | Returns unchanged; no FDR; no log | +| AC-5 | Tile outside all sectors + 13-month | Defaults to STABLE_REAR; downgraded | +| AC-6 | Overlapping sectors (smaller STABLE_REAR inside larger ACTIVE_CONFLICT) | Smaller wins; downgrade, not reject | +| AC-7 | 10k evaluate calls + query-log capture | Only construction-time SELECTs observed | +| AC-8 | Inspect raised FreshnessRejectionError fields | tile_id, age_seconds, classification, rule populated | +| AC-9 | FDR record shape on reject and downgrade | Matches schema deep-equal | +| AC-10 | E2E PostgresFilesystemStore + FreshnessGate write | FreshnessRejectionError; no fs/db effects | +| NFR-perf-evaluate | Microbench evaluate × 100k | p99 ≤ 100 µs | +| NFR-reliability-malformed-rule | Inject `tile_freshness_rules` row with `action='unknown'` | ConfigSchemaError at construction | + +## Constraints + +- The R-tree is built ONCE at construction; mid-flight sector boundary changes are NOT honoured (process restart required). +- The implicit STABLE_REAR default for tiles outside all sectors is documented and is the safer default (downgrade, not reject — operator may add an explicit `whole_world` ACTIVE_CONFLICT sector if they want fail-closed behaviour). +- Tie-break for overlapping sectors is "smallest area wins" — deterministic and documented; bbox area is computed via `(max_lat - min_lat) * (max_lon - min_lon)` (degrees² — adequate for ranking, not for actual area). +- The gate raises `FreshnessRejectionError` (defined in AZ-303); this task does NOT define new error types. +- The gate's `evaluate` method MUST be idempotent and side-effect-free except for FDR + log emissions; future code-review treats internal state mutation as a `Reliability` finding (High). +- `Clock` injection is mandatory — no `time.time()` direct calls; tests assert deterministic output by advancing the fake clock. +- This task does NOT introduce new third-party dependencies beyond `rtree` (verify in requirements). + +## Risks & Mitigation + +**Risk 1: R-tree library API drift across pins** +- *Risk*: `rtree` minor version bump changes API; constructor calls fail at runtime. +- *Mitigation*: Pin recorded in requirements; the wrapper isolates `rtree` to this single class; future breaks fail-fast at construction. + +**Risk 2: Sector-boundary update mid-flight is silently ignored** +- *Risk*: Operator updates sector_boundaries via SQL during a flight; the gate's R-tree is stale; new tile classifications use old boundaries. +- *Mitigation*: Documented constraint — process restart required for boundary changes. Operator workflow: pre-flight sector setup is C12's responsibility; in-flight boundary changes are not in scope. + +**Risk 3: STABLE_REAR-default for tiles outside all sectors is too lenient** +- *Risk*: A tile from an unmapped area lands as DOWNGRADED rather than rejected, leaking past the safety budget. +- *Mitigation*: Documented as the safer default (operator adds explicit ACTIVE_CONFLICT whole_world sector for fail-closed). FDR `kind="c6.freshness.downgraded"` carries the classification, so the FDR-trace shows operators which tiles fell through. A future task could add a `config.tile_cache.freshness_gate.no_sector_default` config field — out of scope this cycle. + +**Risk 4: Smallest-area tie-break interacts badly with adversarial sector layouts** +- *Risk*: An operator (or attacker) inserts a tiny STABLE_REAR sector inside a large ACTIVE_CONFLICT box to bypass rejections. +- *Mitigation*: Sector boundary CRUD is C12-only and operator-authenticated (per architecture's threat model). The smallest-area rule is documented; if abused, the operator audit log (set_by_operator + set_at columns in `sector_boundaries`) surfaces the change. + +**Risk 5: Clock-injection mistake — fake clock used in production** +- *Risk*: Composition root accidentally wires `FakeClock` instead of `WallClock` to the gate; freshness ages are computed against a fixed time; everything looks fresh forever. +- *Mitigation*: AZ-265's `Clock` interface owns the WallClock vs. fake choice via the same composition-root selection that owns thermal-state polling. The factory's per-binary CMake `BUILD_*` flags already separate live (WallClock) from replay (TlogDerivedClock); test wiring is the only place fakes appear. Code review's wiring check (Phase 6 / Architecture) is the canonical guard. + +## Runtime Completeness + +- **Named capability**: per-sector freshness gate enforcing AC-8.2 / AC-NEW-6 (description.md / E-C6 / data_model.md). +- **Production code that must exist**: real `FreshnessGate` class with R-tree-backed sector lookup, real `tile_freshness_rules` query at construction, real `dataclasses.replace` for the downgrade label, real FDR emission on every reject and downgrade, real WARN/INFO logs. +- **Allowed external stubs**: tests MAY use a fake `Clock`, fake `FdrClient`, fake `Logger`, and an in-memory psycopg fake (testcontainer is also fine — both are equivalent under AZ-304's schema fixture); production wiring uses real WallClock + real AZ-273 `FdrClient` + real AZ-266 `Logger` + real Postgres pool. +- **Unacceptable substitutes**: a hardcoded "everything is fresh" pass-through (defeats the entire point); a Python in-memory boundary list ignoring `sector_boundaries` (would diverge from the operator's source of truth in C12); `time.time()` direct calls without Clock injection (would break test determinism); skipping the R-tree and doing a linear scan over sectors (works at small N but invites future regression at larger N — R-tree is pre-emptively the right shape per coderule.mdc's "the simplest solution that satisfies all requirements, including maintainability"). + +## Contract + +This task implements behaviour mandated by `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` § Invariants I-2 + I-3. No new contract file — the gate is a policy implementation behind an existing Protocol surface (`PostgresFilesystemStore.write_tile / insert_metadata` already raise `FreshnessRejectionError` per the contract; this task supplies the rule-evaluation logic). diff --git a/_docs/02_tasks/todo/AZ-308_c6_cache_budget_eviction.md b/_docs/02_tasks/todo/AZ-308_c6_cache_budget_eviction.md new file mode 100644 index 0000000..62e276a --- /dev/null +++ b/_docs/02_tasks/todo/AZ-308_c6_cache_budget_eviction.md @@ -0,0 +1,207 @@ +# C6 Cache Budget Eviction — 10 GB Hard Cap with LRU Sweep + +**Task**: AZ-308_c6_cache_budget_eviction +**Name**: C6 Cache Budget Eviction +**Description**: Implement the 10 GB cache-budget enforcer per RESTRICT-SAT-2 (cache budget across operational area). Wraps `PostgresFilesystemStore.write_tile` with a pre-write head-room check; on overflow, drives an LRU sweep using the store's `lru_candidates(max_count) -> list[TileMetadata]` and `delete_tile(tile_id) -> bool` primitives until enough head-room is freed. Emits an INFO log per eviction (`kind="c6.evicted"` with `tile_id`, `disk_bytes`, `accessed_at`) and an FDR record per eviction batch. Hard cap is config-driven (`config.tile_cache.cache_budget_bytes`, default 10 GB). Defends against the silent-overflow failure mode where a runaway F4 burst would push past the cap and either fill the disk or get arbitrarily evicted by the OS. +**Complexity**: 3 points +**Dependencies**: AZ-303_c6_storage_interfaces, AZ-305_c6_postgres_filesystem_store, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf +**Component**: c6_tile_cache (epic AZ-250 / E-C6) +**Tracker**: AZ-308 +**Epic**: AZ-250 (E-C6) + +### Document Dependencies + +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — defines `lru_candidates`, `record_lru_access`, `total_disk_bytes`, `delete_tile` primitives this task consumes; produced by AZ-303. +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — `delete_tile` semantics (idempotent, returns `False` on missing). +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — `kind="c6.eviction_batch"` envelope. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO log shape per evicted tile. + +## Problem + +Without budget enforcement: + +- RESTRICT-SAT-2 (10 GB cache budget) collapses — the cache grows unboundedly under sustained F4 mid-flight ingest; eventually the disk fills. +- An adversarial mid-flight ingest path (compromised companion sending bogus tiles via the F4 boundary, even before the parent-suite trust layer applies) could DoS the disk. +- The OS would silently evict the OS page cache to make room, blowing the C6-PT-01 read-side latency budget for legitimate tiles. +- C6-IT-06 (synthetic 10 GB fill + 100 MB overrun → eviction count matches insert overrun) cannot pass — the eviction is a hole. +- Operators cannot tell which tiles got dropped — there's no per-eviction log or FDR trace. + +This task is the budget enforcer. The store's primitives are now ready; this is the policy layer that consumes them. + +## Outcome + +- A `CacheBudgetEnforcer` class at `src/gps_denied_onboard/components/c6_tile_cache/cache_budget_enforcer.py` with one public method `reserve_headroom(needed_bytes: int) -> EvictionResult` that, when called pre-write: + 1. Reads `total_disk_bytes()` from the metadata store. + 2. Computes `available_bytes = budget_bytes - current_bytes`. + 3. If `available_bytes >= needed_bytes` → returns `EvictionResult(evicted=[], freed_bytes=0)` immediately (no eviction needed). + 4. Else → runs an LRU sweep: in batches of `eviction_batch_size` (default 32), calls `lru_candidates(max_count=batch_size)`, iterates the candidates, calls `delete_tile(tile_id)` for each, accumulates `freed_bytes` (from each candidate's `disk_bytes`), and stops when `freed_bytes >= (needed_bytes - available_bytes)`. If the candidate list is exhausted before the freed budget is hit, raises `CacheBudgetExhaustedError` (NEW error type subclassing `TileCacheError`). + 5. Returns `EvictionResult(evicted: list[TileMetadata], freed_bytes: int)` listing what was evicted. +- Constructor signature: `__init__(self, *, store: TileMetadataStore, fdr_client: FdrClient, logger: Logger, budget_bytes: int, eviction_batch_size: int = 32)`. The `store` reference also provides `delete_tile` (since the impl is `PostgresFilesystemStore`, both Protocols are on one instance — the type hint is `TileMetadataStore` for clarity but the runtime object also provides `delete_tile` via the `TileStore` Protocol). +- The `PostgresFilesystemStore.write_tile` method is wrapped at the composition root: a `BudgetEnforcedTileStore` decorator class wraps the store and calls `enforcer.reserve_headroom(len(tile_blob))` before delegating to the wrapped store's `write_tile`. This is the same decorator pattern AZ-307's freshness gate uses (composition-root wiring). +- Per-eviction INFO log: `kind="c6.evicted"`, payload `{tile_id, disk_bytes, accessed_at, evicted_at}` — one log entry per evicted tile. +- Per-batch FDR record: `kind="c6.eviction_batch"`, payload `{trigger_tile_id, freed_bytes, evicted_count, evicted_tile_ids[:5]}` (first 5 evicted ids — keeps the FDR record bounded; the full list is in the logs). +- The `record_lru_access` primitive is wired into the read path of `PostgresFilesystemStore.read_tile_pixels` (AZ-305 already declares it; this task's wiring change ensures every read updates the LRU clock so eviction picks the right candidates). + +## Scope + +### Included + +- `CacheBudgetEnforcer` class with the `reserve_headroom` method. +- `EvictionResult` dataclass `@dataclass(frozen=True)`. +- `CacheBudgetExhaustedError` (subclass of `TileCacheError` — added to `c6_tile_cache.errors`). +- `BudgetEnforcedTileStore` decorator class wrapping a `TileStore` and calling the enforcer pre-write. +- Composition-root wiring: the factory `build_tile_store` returns a `BudgetEnforcedTileStore` wrapping `PostgresFilesystemStore` when `config.tile_cache.cache_budget_bytes > 0`; returns the bare store otherwise (for Tier-0 dev where the budget is irrelevant). +- INFO log per evicted tile. +- FDR record per eviction batch. +- A wiring change in `PostgresFilesystemStore.read_tile_pixels` to call `record_lru_access(tile_id, now)` on every read (so the LRU clock stays current). This is a small additive change to AZ-305's class — implemented as a constructor argument `lru_clock: Clock | None = None` so AZ-305 stays unit-testable; when `lru_clock` is provided, every read appends to the LRU clock. +- A standalone CLI `python -m c6_tile_cache.cache_budget_enforcer dry-run --pretend-needed-bytes N` for operators to inspect what would be evicted without performing the eviction. +- A construction-time INFO log with `budget_bytes`, `current_disk_bytes`, `headroom_bytes`. + +### Excluded + +- Voting-status-aware eviction (e.g., "evict PENDING before TRUSTED") — out of scope this cycle. The LRU-only policy is the simplest enforcement and matches RESTRICT-SAT-2 directly. A future task can add voting-tier-weighted eviction. +- Eviction-throttling under sustained-burst pressure (e.g., "stop accepting new writes if eviction-rate exceeds threshold") — out of scope; the budget is a hard cap, not a soft one. +- Per-zoom or per-source quota — out of scope; the budget is global. +- Background-sweep eviction (eager eviction on a timer) — out of scope; eviction runs only on write-side budget pressure. +- `delete_tile` failure handling beyond logging — if `delete_tile` returns `False` (already missing) or raises `TileFsError`, the enforcer logs and continues; the budget calculation still subtracts the row's `disk_bytes` because the row is gone. +- Cross-flight cache state — every flight starts with whatever the prior flight's persistent state was; eviction is per-flight bookkeeping. + +## Acceptance Criteria + +**AC-1: No-eviction fast path** +Given `budget_bytes = 10 GB`, `current_disk_bytes = 1 GB` +When `reserve_headroom(needed_bytes=10 MB)` is called +Then the result is `EvictionResult(evicted=[], freed_bytes=0)`; no `lru_candidates` call is made (verifiable via mock counter); no INFO log; no FDR record + +**AC-2: Single-tile eviction frees enough** +Given `budget_bytes = 10 GB`, `current_disk_bytes = 9.99 GB` (10 MB headroom), and an LRU candidate with `disk_bytes = 50 MB` +When `reserve_headroom(needed_bytes = 30 MB)` is called +Then the candidate is deleted; result: `evicted=[that tile]`, `freed_bytes=50 MB`; one INFO log per evicted tile; one FDR `kind="c6.eviction_batch"` record + +**AC-3: Multi-tile eviction iterates LRU candidates** +Given `current_disk_bytes` at the cap and 10 LRU candidates each `disk_bytes = 5 MB` +When `reserve_headroom(needed_bytes = 30 MB)` is called +Then exactly the 6 oldest candidates are evicted (6 × 5 MB = 30 MB matches the need); the 7th (and onwards) are left alone + +**AC-4: Eviction batches respect `eviction_batch_size`** +Given 100 LRU candidates and a `eviction_batch_size=32` +When the eviction needs to free 50 candidates' worth +Then `lru_candidates` is called with `max_count=32` first, then again with `max_count=32` for the remaining; total 2 SELECTs (verifiable via psycopg query log) + +**AC-5: Insufficient candidates raise CacheBudgetExhaustedError** +Given a budget so small that even evicting every existing tile won't free `needed_bytes` +When `reserve_headroom(needed_bytes = 50 GB)` is called +Then `CacheBudgetExhaustedError` is raised AFTER all candidates have been evicted (the eviction loop runs to completion before raising; this is so the operator's recovery path has the maximum head-room possible); the error message names `needed_bytes`, `available_bytes`, `evicted_count` + +**AC-6: BudgetEnforcedTileStore decorator integrates with write_tile** +Given a `BudgetEnforcedTileStore` wrapping a `PostgresFilesystemStore`, with `current_disk_bytes` near the cap +When `write_tile(tile_blob, metadata)` is called +Then the enforcer's `reserve_headroom(len(tile_blob))` runs first; if eviction was triggered, the evicted tiles are gone before the new write proceeds; the new tile lands successfully + +**AC-7: BudgetEnforcedTileStore propagates TileCacheError unchanged** +Given the wrapped store raises a `ContentHashMismatchError` +When `write_tile` is called via the decorator +Then the same `ContentHashMismatchError` propagates unchanged; the decorator does NOT swallow or rewrap + +**AC-8: read_tile_pixels updates the LRU clock** +Given a `PostgresFilesystemStore` constructed WITH the `lru_clock` injection +When `read_tile_pixels(tile_id)` is called +Then `record_lru_access(tile_id, now())` is invoked exactly once with `now() = clock.utcnow()`; the row's `accessed_at` is updated (verifiable via subsequent `lru_candidates` ordering) + +**AC-9: Single eviction is O(1) extra disk-bytes query** +Given a no-eviction-needed call and an eviction-needed call +When the test counts SELECT queries +Then the no-eviction path executes 1 SELECT (`total_disk_bytes()`); the eviction path executes 1 SELECT for `total_disk_bytes` + N SELECTs for `lru_candidates` (each batch) + N UPDATEs for `delete_tile` row deletes; no quadratic blowup + +**AC-10: 10 GB budget enforcement under synthetic load** +Given `budget_bytes = 10 GB - 50 MB`, then 100 MB of new tiles inserted in 5 MB chunks +When the test runs +Then total disk usage stays ≤ 10 GB at all times (verifiable via `total_disk_bytes` between every write); the eviction count matches the insert overrun (≥ 50 MB - 50 MB = 0; depends on prior current_bytes — the test's exact pre-state is documented in C6-IT-06); every eviction is logged at INFO + +**AC-11: FDR eviction-batch payload bounded** +Given a single `reserve_headroom` call that triggers 100 evictions +When the FDR record is captured +Then the record contains `evicted_count=100`, `evicted_tile_ids` of length AT MOST 5 (first 5 in eviction order); the record size stays bounded regardless of how many evictions occurred + +**AC-12: Construction-time disk-bytes report** +Given a `CacheBudgetEnforcer` constructed against a non-empty store +When the construction completes +Then an INFO log `kind="c6.budget.loaded"` is emitted with `budget_bytes`, `current_disk_bytes`, `headroom_bytes`; if `current_disk_bytes > budget_bytes` (over-budget at startup), an additional WARN log is emitted naming the overage + +## Non-Functional Requirements + +**Performance** +- No-eviction path p99 ≤ 5 ms (one `total_disk_bytes` query). +- Eviction path: per-evicted-tile cost is dominated by the `delete_tile` UPDATE + filesystem unlink (~5–10 ms each on Tier-2 SSD); a typical 5 Hz F4 burst evicting 1–2 tiles per write keeps the write-side latency under 30 ms. +- The eviction loop does NOT block the F3 hot path — eviction runs synchronously inside `write_tile`, which is on the F4 producer thread (not C2 / C2.5 / C3 reads). + +**Reliability** +- Eviction is idempotent — `delete_tile` returning `False` is a no-op (the candidate was already evicted by a concurrent path); the enforcer logs and continues. +- Construction-time over-budget detection (AC-12 WARN log) catches the case where the prior flight ended over-budget; the enforcer does NOT proactively evict on construction (operator may want to inspect the over-budget state first). +- The enforcer is the SOLE eviction path during a flight — no other component evicts tiles. Code-review's Architecture phase treats unauthorised `delete_tile` callers as findings. + +**Compatibility** +- Reuses the AZ-303 Protocols + AZ-273 FdrClient + AZ-266 logger. +- No new third-party dependencies. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | reserve_headroom with ample budget | EvictionResult empty; no lru_candidates call; no logs | +| AC-2 | reserve_headroom with tight budget; one 50 MB candidate | One eviction; freed_bytes=50 MB; one INFO; one FDR | +| AC-3 | 10 candidates × 5 MB; need 30 MB | Exactly 6 evicted; 7th-10th untouched | +| AC-4 | 100 candidates needed; batch_size=32 | Exactly 2 lru_candidates calls (or 3 if last batch needed) | +| AC-5 | Need 50 GB; budget far smaller | All candidates evicted; CacheBudgetExhaustedError raised AFTER | +| AC-6 | BudgetEnforcedTileStore + write_tile near cap | Pre-write eviction; new tile lands | +| AC-7 | Wrapped store raises ContentHashMismatchError | Same exception propagates from decorator | +| AC-8 | read_tile_pixels with lru_clock injected | record_lru_access called exactly once with clock.utcnow() | +| AC-9 | Query-count tally for no-evict and evict paths | 1 select for no-evict; 1+N+N for evict | +| AC-10 | C6-IT-06 synthetic 10 GB fill + 100 MB overrun | Disk usage ≤ 10 GB throughout; eviction count matches overrun; logs emitted | +| AC-11 | reserve_headroom triggering 100 evictions | FDR record has evicted_count=100; evicted_tile_ids length ≤ 5 | +| AC-12 | Construct enforcer over a non-empty store | INFO log on construction; WARN if over-budget | +| NFR-perf-no-evict | Microbench reserve_headroom × 10000 (no-evict path) | p99 ≤ 5 ms | +| NFR-reliability-delete-already-gone | reserve_headroom with a candidate that's racing-deleted by a concurrent caller | delete_tile returns False; enforcer logs at INFO and continues | + +## Constraints + +- LRU is the only eviction policy this cycle — voting-tier-aware is a future task. +- `eviction_batch_size` is config-driven; default 32 is a reasonable balance between query overhead and memory residence of the candidate list. +- `CacheBudgetExhaustedError` is raised AFTER the eviction loop completes — partial eviction is preferable to no eviction even when the budget cannot be met (frees up as much head-room as possible for whatever the operator decides to do next). +- The decorator pattern (`BudgetEnforcedTileStore`) is mandatory — modifying `PostgresFilesystemStore.write_tile` to do the budget check directly would couple the policy to the impl, breaking the single-responsibility design. +- The `record_lru_access` injection into `read_tile_pixels` is OPT-IN (constructor arg `lru_clock: Clock | None = None`) so AZ-305's tests can run the store WITHOUT the LRU update; production wiring always passes the clock. +- The FDR `evicted_tile_ids` cap (first 5) keeps the record bounded; the full list is in the INFO logs which can be replayed post-flight. +- This task does NOT introduce new third-party dependencies. + +## Risks & Mitigation + +**Risk 1: LRU thrashing under sustained F4 burst** +- *Risk*: Under a 5 Hz mid-flight ingest sustained near the cap, every write evicts an old tile; the cache becomes a sliding window and the operational area shrinks. +- *Mitigation*: This is the intended behaviour — the cap is hard, and freshness wins. Operator can bump the cap via `config.tile_cache.cache_budget_bytes`. The FDR's eviction-batch records show post-flight whether thrashing occurred. + +**Risk 2: Eviction races a concurrent read** +- *Risk*: A reader (C2/C2.5/C3) holds a `TilePixelHandle` for tile T; the enforcer evicts T; the reader's mmap goes stale. +- *Mitigation*: Per `tile_store.md` Constraints, consumers MUST NOT cache `TilePixelHandle` across calls — use within a `with` block and release. The OS keeps the fd alive until the consumer's `__exit__`, so the mmap is read-correct even after the file is unlinked. Documented and tested in AZ-305's read-handle lifecycle test. + +**Risk 3: Construction-time over-budget triggers cascade eviction** +- *Risk*: The previous flight ended over-budget; on next start, the enforcer sees `current_disk_bytes > budget_bytes`; the first `reserve_headroom` evicts a lot to get back under cap. +- *Mitigation*: AC-12 WARN log surfaces the over-budget state at construction. The first F4 write triggers normal eviction; AC-5 covers the worst case (`CacheBudgetExhaustedError` if even all candidates can't fit the new tile). + +**Risk 4: `delete_tile` partially fails (filesystem unlink succeeds, row delete fails) leaving a dangling row** +- *Risk*: AZ-305's `delete_tile` is supposed to be a single transaction with the filesystem op; if the filesystem unlink succeeds but the row delete fails, the row claims `disk_bytes > 0` but the file is gone. `total_disk_bytes` is now wrong. +- *Mitigation*: This is the same partial-failure window AZ-305 § Risk 1 covers via the construction-time reconciliation scan. The enforcer doesn't add new risk here; the reconciliation is the fix-up point. + +**Risk 5: Operator bumps `cache_budget_bytes` mid-flight** +- *Risk*: Operator edits config to raise the cap mid-flight; the enforcer's `budget_bytes` is fixed at construction; the change is silent. +- *Mitigation*: Documented constraint — config is per-flight; mid-flight changes require a process restart. Future task could add a SIGHUP-driven reload — out of scope this cycle. + +## Runtime Completeness + +- **Named capability**: 10 GB hard-cap eviction with LRU sweep enforcing RESTRICT-SAT-2 (description.md / E-C6 / RESTRICT-SAT-2 / C6-IT-06). +- **Production code that must exist**: real `CacheBudgetEnforcer` class with real `total_disk_bytes` query, real `lru_candidates` iteration, real `delete_tile` calls, real INFO logs per eviction, real FDR records per batch, real `BudgetEnforcedTileStore` decorator wrapping the production store, real `record_lru_access` wiring on every read. +- **Allowed external stubs**: tests MAY use a fake `TileMetadataStore` (in-memory implementation of the AZ-303 Protocol with simulated `lru_candidates` ordering) and fake `FdrClient` / `Logger`; production wiring uses real `PostgresFilesystemStore` + real AZ-273 `FdrClient` + real AZ-266 `Logger`. +- **Unacceptable substitutes**: a "soft cap" that logs but doesn't actually evict (would defeat RESTRICT-SAT-2 — the cap is hard); a background-sweep timer that evicts asynchronously (would race with `write_tile` and lose the head-room guarantee at the call site); skipping the LRU update on read (would make eviction pick wrong candidates); rewrapping/swallowing `TileCacheError` in the decorator (would hide insert-side errors from the F4 path). + +## Contract + +This task implements behaviour mandated by `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` § I-4 (LRU clock) + § I-5 (disk-budget invariant) + the methods `lru_candidates`, `record_lru_access`, `total_disk_bytes`. No new contract file — the enforcer is a policy implementation behind the existing Protocol surface. `CacheBudgetExhaustedError` is added to the documented `TileCacheError` family in `tile_store.md` § Error types via a minor contract version bump (1.0.0 → 1.1.0); the producer task (this one) updates the contract's Change Log when shipping. diff --git a/_docs/02_tasks/todo/AZ-316_c11_tile_downloader.md b/_docs/02_tasks/todo/AZ-316_c11_tile_downloader.md new file mode 100644 index 0000000..d5f96af --- /dev/null +++ b/_docs/02_tasks/todo/AZ-316_c11_tile_downloader.md @@ -0,0 +1,229 @@ +# C11 TileDownloader — GET + Resolution Gate + C6 Write + +**Task**: AZ-316_c11_tile_downloader +**Name**: C11 TileDownloader +**Description**: Implement the `TileDownloader` Protocol — C11's operator-side download path. `download_tiles_for_area` issues authenticated `httpx` GETs against `satellite-provider`, enforces RESTRICT-SAT-4 (≥ 0.5 m/px) at the C11 boundary, writes accepted tiles via the AZ-303 `TileStore` + `TileMetadataStore` Protocols (which run AZ-307's freshness gate at write), pre-checks cache headroom against AZ-308's budget enforcer, and journals download progress for idempotent re-runs. `enumerate_remote_coverage` is the read-only enumeration helper used by C12 for pre-flight area sizing. Honours `Retry-After` on 429s, fails fast on TLS / auth errors, retries with backoff on 5xx. Surfaces freshness, resolution, downgrade, and outcome counts in `DownloadBatchReport`. +**Complexity**: 5 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-303_c6_storage_interfaces, AZ-305_c6_postgres_filesystem_store, AZ-307_c6_freshness_gate, AZ-308_c6_cache_budget_eviction +**Component**: c11_tilemanager (epic AZ-251 / E-C11) +**Tracker**: AZ-316 +**Epic**: AZ-251 (E-C11) + +### Document Dependencies + +- `_docs/02_document/contracts/c11_tilemanager/tile_downloader.md` — produced by this task (frozen Protocol + DTO shape, invariants, test cases). +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — consumed: `write_tile`, `tile_exists` are the write-side endpoints used by this task. +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — consumed: `insert_metadata`, `query_by_bbox` for idempotence. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO/WARN/ERROR log shapes for download events. +- `_docs/02_document/components/12_c11_tilemanager/description.md` — § 3.1 download API, § 5 error handling, § 7 caveats (R02 enforcement). + +## Problem + +Without a real `TileDownloader`: + +- F1 pre-flight cache build cannot start (C10 has no source; C12 has no operation to invoke). +- AC-8.1 (imagery via Suite Sat Service offline cache) and AC-8.3 (pre-loaded onto companion before flight) collapse — both require C11 populating C6 first. +- RESTRICT-SAT-4 (≥ 0.5 m/px) has no enforcement point — sub-spec tiles would silently land in C6 and downstream pose estimation would degrade. +- AC-NEW-6 (system rejects/downgrades stale tiles) is unenforced for the download path; C6's gate (AZ-307) catches it but the operator has no visibility into how many were rejected per batch. +- A re-run of an already-completed download would re-issue every GET against `satellite-provider`, wasting bandwidth and fighting the operator workflow which expects idempotence (D-C10-1 manifest hash check downstream relies on stable downloads). +- Network failures (TLS, auth, 5xx, 429) without a structured handler would either silently drop tiles or crash the operator-tooling CLI mid-batch. + +This task delivers the production downloader. It writes no C6 storage logic (AZ-303/305 own that), no freshness rule (AZ-307 owns), and no eviction (AZ-308 owns) — it composes them. + +## Outcome + +- A `TileDownloader` Protocol + concrete `HttpTileDownloader` class at `src/gps_denied_onboard/components/c11_tilemanager/`: + - `interface.py` exposes `TileDownloader` Protocol (`runtime_checkable`). + - `tile_downloader.py` houses `HttpTileDownloader`. + - `_types.py` houses `DownloadRequest`, `DownloadBatchReport`, `TileSummary`, `DownloadOutcome` (all `@dataclass(frozen=True)` for the data DTOs; `DownloadOutcome` is a `StrEnum`). + - `errors.py` houses `SatelliteProviderError`, `RateLimitedError`, `ResolutionRejectionError`, `CacheBudgetExceededError`, `TileManagerError` (family parent). +- Constructor signature: + `__init__(self, *, http_client: httpx.Client, tile_store: TileStore, tile_metadata_store: TileMetadataStore, budget_enforcer: CacheBudgetEnforcer, logger: Logger, clock: Clock)`. Injected dependencies — no module-level singletons. +- `download_tiles_for_area(request)` flow: + 1. Validates the `DownloadRequest` (bbox ordering, zoom levels in `[0, 21]`, `cache_root` is writable). + 2. Looks up `cache_root/.c11/journal/__.json`. If present and complete → `outcome = idempotent_no_op`; return early with zero counts. + 3. Calls `enumerate_remote_coverage(bbox, zoom_levels)` to size the batch. + 4. Calls `budget_enforcer.reserve_headroom(estimated_bytes)`. If insufficient → `CacheBudgetExceededError`; `outcome = failure`. + 5. For each tile in the batch, fetches the `httpx` GET response (per-tile streaming to bound memory): + - Auth: TLS + `Authorization: Bearer `. + - On 429 → honour `Retry-After` (sleep then retry once; second 429 → `RateLimitedError`). + - On 5xx → exponential backoff with cap (1s, 2s, 4s; 4 retries max); persistent 5xx → `SatelliteProviderError`. + - On TLS / 401 / 403 → fail fast → `SatelliteProviderError`. + 6. Resolution gate: read `resolution_m_per_px` from response headers / metadata; if `< 0.5` → increment `tiles_rejected_resolution`; do NOT write to C6. + 7. Calls `tile_store.write_tile(...)` and `tile_metadata_store.insert_metadata(...)`. C6's freshness gate (AZ-307) may raise `FreshnessRejectionError` here — catch, increment `tiles_rejected_freshness`, continue. + 8. On `FreshnessLabel.DOWNGRADED` (returned by C6 when stable_rear-stale) → increment `tiles_downgraded`. + 9. After every successful tile write, append the tile id to the journal (`fsync` per atomicwrites pattern from description.md). + 10. On batch completion: write the journal's terminal record, return `DownloadBatchReport(outcome=success)`. +- `enumerate_remote_coverage(bbox, zoom_levels)` issues a `GET /api/satellite/tiles?bbox=...&zoom=...&list-only=true` (matches the existing parent-suite GET surface; the response carries the per-tile `produced_at` + `resolution_m_per_px` + `estimated_bytes`); returns `list[TileSummary]`. +- INFO log: `kind="c11.download.session.start"` / `kind="c11.download.session.end"` with batch counts. +- WARN log: per retry, per freshness rejection batch, per resolution rejection. +- ERROR log: persistent `SatelliteProviderError`, `CacheBudgetExceededError`, TLS/auth failures. +- The composition root constructs `HttpTileDownloader` via `build_tile_downloader(config) -> TileDownloader` at `src/gps_denied_onboard/runtime_root/c11_factory.py`. +- Configuration extension to AZ-269 loader: `config.c11.satellite_provider_url`, `config.c11.service_api_key`, `config.c11.cache_root`, `config.c11.http_timeout_s`, `config.c11.max_5xx_retries`. +- Type-only conformance test verifies `isinstance(HttpTileDownloader(...), TileDownloader)` via `runtime_checkable`. +- The contract file at `_docs/02_document/contracts/c11_tilemanager/tile_downloader.md` is frozen as part of this task; consumers (C12) read it, not this spec. + +## Scope + +### Included + +- `TileDownloader` Protocol declaration + `HttpTileDownloader` concrete class. +- All four DTOs (`DownloadRequest`, `DownloadBatchReport`, `TileSummary`) and the `DownloadOutcome` + `FreshnessLabel`-consumption enums. +- The download-progress journal under `cache_root/.c11/journal/` (atomicwrites + fsync; one file per `(flight_id, request_hash)` pair). +- The resolution gate at C11 boundary (RESTRICT-SAT-4). +- The cache-headroom pre-check via AZ-308's `CacheBudgetEnforcer.reserve_headroom`. +- HTTP retry / backoff / `Retry-After` handling. +- Composition-root factory `build_tile_downloader`. +- Config schema extension for the C11 download fields. +- Conformance test at `tests/unit/c11_tilemanager/test_protocol_conformance.py`. + +### Excluded + +- The `TileUploader` Protocol and concrete impl — separate task in this epic. +- The per-flight signing key — separate task; downloads use a static service-internal API key, NOT the per-flight key. +- The flight-state gate — download has no `flight_state` precondition (it runs operator-side, never airborne). The gate task applies to upload only. +- Idempotent-retry-on-partial-success — that's the upload concern; download idempotence is handled here via the journal. +- The `mock-suite-sat-service` fixture — the e2e-test fixture in `tests/fixtures/` is for the upload path. Download tests run against the REAL `satellite-provider` GET surface (or its existing test Docker fixture). +- C12 CLI plumbing — owned by E-C12. +- Sector boundary CRUD (C12) — the `DownloadRequest.sector_class` is supplied per request; C11 does NOT look up sector boundaries. +- The R02 ADR-004 build-time exclusion of `c11_tilemanager` from the airborne image — owned by E-BOOT (build-system task). + +## Acceptance Criteria + +**AC-1: Downloader writes accepted tiles to C6** +Given a `DownloadRequest` for a Derkachi-bbox with 100 tiles all fresh and ≥ 0.5 m/px +When `download_tiles_for_area` is called +Then `DownloadBatchReport.tiles_downloaded == 100`; every tile is present in `TileStore` (verifiable via `tile_exists`); every tile has a `TileMetadata` row; `outcome = success` + +**AC-2: Resolution-gate rejects sub-spec tiles at C11 boundary** +Given a batch of 50 tiles where 10 have `resolution_m_per_px = 0.3` +When `download_tiles_for_area` is called +Then `tiles_rejected_resolution == 10`; the 10 are NOT in `TileStore`; no `tile_store.write_tile` was attempted for them (verifiable via spy); ONE `c11.download.resolution_rejected` WARN log per rejected tile + +**AC-3: Freshness rejections from C6 are counted, not propagated** +Given C6's AZ-307 raises `FreshnessRejectionError` for 5 tiles in an active_conflict batch +When `download_tiles_for_area` is called +Then `tiles_rejected_freshness == 5`; the run continues for the remaining tiles; no exception escapes to the caller; ONE summary WARN log at session end + +**AC-4: Stable_rear stale tiles are surfaced as downgraded** +Given C6 returns `FreshnessLabel.DOWNGRADED` for 3 tiles (stable_rear stale) +When `download_tiles_for_area` is called +Then `tiles_downgraded == 3`; those tiles ARE present in `TileStore` with the `DOWNGRADED` label; no exception is raised + +**AC-5: 429 honours Retry-After** +Given `satellite-provider` returns 429 with `Retry-After: 30` +When the downloader hits the 429 +Then the downloader sleeps ≥ 30s before the retry (verifiable via injected `Clock.sleep` capture); on success the run proceeds normally; no `RateLimitedError` is raised + +**AC-6: Persistent 5xx aborts with structured error** +Given `satellite-provider` returns 503 for 5 consecutive attempts (exceeds the retry budget) +When the downloader exhausts retries +Then `SatelliteProviderError` is raised with the response body, the URL, and the attempt count; `outcome = failure`; partial writes are journaled (the run is resumable on next call) + +**AC-7: TLS / 401 / 403 fail fast (no retry)** +Given the first GET returns 401 +When the downloader processes the response +Then `SatelliteProviderError` is raised on the FIRST attempt (zero retries); no plaintext fallback is attempted; the API key is NOT logged (security) + +**AC-8: Idempotent re-run after success** +Given a successful prior `download_tiles_for_area(R)` whose journal recorded all 100 tiles +When `download_tiles_for_area(R)` is called again +Then `outcome = idempotent_no_op`; zero GETs observed; zero `tile_store.write_tile` calls; the report's counts are zero except `tiles_downloaded` which equals the journaled count (transparency for the caller) + +**AC-9: Cache-budget pre-check aborts before any write** +Given `budget_enforcer.reserve_headroom(estimated_bytes)` returns `EvictionResult.insufficient` +When `download_tiles_for_area` is called +Then `CacheBudgetExceededError` is raised with the budget delta; zero GETs are issued; zero writes attempted; ONE ERROR log + +**AC-10: Conformance — concrete impl satisfies Protocol** +Given a `HttpTileDownloader` instance +When `isinstance(impl, TileDownloader)` is checked under `runtime_checkable` +Then the result is `True`; a fake omitting `enumerate_remote_coverage` returns `False` + +**AC-11: Service API key never logged** +Given any log path (INFO, WARN, ERROR) +When the downloader logs the request URL or headers +Then the `service_api_key` value is redacted (`Bearer ***`); no test log capture observes the raw key + +**AC-12: Journal survives mid-batch crash** +Given the process is killed after 30 of 100 tiles have been written and journaled +When `download_tiles_for_area(R)` is called on restart +Then the journal is read; the 30 already-journaled tiles are skipped; only the remaining 70 are GET-fetched; final report shows `tiles_downloaded = 100` (30 prior + 70 new); `outcome = success` + +## Non-Functional Requirements + +**Performance** +- Download throughput ≥ 50 MB/s on a 1 Gbps link (C11-PT-01); the bottleneck is the network, not the writer. +- Per-tile resolution-gate check ≤ 50 µs (header parse + comparison). +- Journal append ≤ 5 ms per tile (atomicwrites + fsync; bounded by disk). + +**Compatibility** +- `httpx` per project pin (verify before adding); no per-task version bump unless absolutely required. +- `atomicwrites` for journal — already used elsewhere (per description.md § 5). + +**Reliability** +- The downloader MUST tolerate process kill at any point and recover via journal on restart (AC-12). +- The downloader MUST NOT corrupt C6 — the AZ-303 Protocol guarantees atomic writes; this task does not add new fs operations. +- The injected `Clock` MUST be the same singleton used by AZ-308's enforcer and AZ-307's gate. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | 100-tile happy path against fake `httpx.Client` | All in C6; `tiles_downloaded=100`; `outcome=success` | +| AC-2 | Mixed batch, 10 sub-spec tiles | `tiles_rejected_resolution=10`; spy on `write_tile` shows 40 calls | +| AC-3 | C6 raises `FreshnessRejectionError` 5x | `tiles_rejected_freshness=5`; run completes | +| AC-4 | C6 returns DOWNGRADED for 3 | `tiles_downgraded=3`; tiles present with DOWNGRADED label | +| AC-5 | 429 + `Retry-After: 30` | `Clock.sleep` captured ≥ 30s; success path resumes | +| AC-6 | 5x 503 | `SatelliteProviderError` with attempt count | +| AC-7 | 401 first attempt | `SatelliteProviderError` on first call; zero retries | +| AC-8 | Re-run identical request post-success | `outcome=idempotent_no_op`; zero GETs | +| AC-9 | `reserve_headroom` returns insufficient | `CacheBudgetExceededError`; zero GETs | +| AC-10 | `isinstance` check on impl + on partial fake | True / False | +| AC-11 | Log capture during full session | Header values redacted; no raw key in any record | +| AC-12 | Kill mid-batch + restart | 30 prior journaled + 70 new = 100 final | +| NFR-perf-throughput | 10 GB synthetic batch over loopback | ≥ 50 MB/s | + +## Constraints + +- The `service_api_key` is logged ONLY redacted (`Bearer ***`); the raw value MUST never appear in INFO/WARN/ERROR logs (security finding High in code-review otherwise). +- The downloader runs operator-side ONLY; the airborne build target excludes the entire `c11_tilemanager/` source tree (E-BOOT enforces; this task does NOT add the exclusion but its tests assert via R02 that the module's import surface is honest). +- The journal location is `cache_root/.c11/journal/__.json` — `request_hash` is `sha256(bbox|zoom_levels|sector_class|service_api_key_hash).hex()[:16]`. +- The hot path uses `httpx.Client` (sync), not `httpx.AsyncClient` — the operator workflow is offline minutes; async adds complexity without a measurable win. +- Per-tile streaming is mandatory; loading an entire batch into memory is rejected (a 10 GB Derkachi area would OOM on operator workstations). +- This task introduces no new runtime dependencies beyond `httpx` and `atomicwrites` (verify both are in the project's `requirements.txt`). +- The R02 ADR-004 exclusion is enforced at build time by E-BOOT; this task adds no runtime self-check (security tasks C11-ST-01/02/03 cover that in Step 9). + +## Risks & Mitigation + +**Risk 1: `Retry-After` header format ambiguity (HTTP-date vs. seconds)** +- *Risk*: Some servers send `Retry-After: Wed, 21 Oct 2026 07:28:00 GMT`; naïve integer parsing crashes. +- *Mitigation*: Parse both forms (`int` first, `email.utils.parsedate_to_datetime` fallback); cap the wait at `config.c11.max_retry_after_s` (default 300s) to prevent server-pinned hangs. + +**Risk 2: API key leakage via exception messages** +- *Risk*: An unhandled exception's repr includes the `httpx.Request` whose headers carry the raw `service_api_key`. +- *Mitigation*: Wrap every `httpx` call in try/except → re-raise `SatelliteProviderError` with sanitised message; never include `httpx.Request.headers` in the error payload. + +**Risk 3: Journal corruption on power-loss** +- *Risk*: Process killed mid-`fsync` leaves a torn JSON line; restart fails to parse. +- *Mitigation*: `atomicwrites` provides write-then-rename semantics; one journal record per file segment with a checksum; corrupt records are skipped on read with a WARN log and the affected tile is re-fetched. Documented behaviour. + +**Risk 4: `satellite-provider`'s GET surface returns tiles in a format C11 doesn't expect** +- *Risk*: The parent suite changes the response schema (new field, renamed metadata key); C11 silently misparses. +- *Mitigation*: Strict pydantic-free DTO parsing — extra fields ignored, required fields → `SatelliteProviderError("response schema mismatch", field=...)`. Versioning for the GET API is owned by `satellite-provider`'s contract; this task pins to the current shape. + +**Risk 5: Concurrent C11 invocations corrupt the journal** +- *Risk*: Operator launches two `download_tiles_for_area` runs against the same `cache_root` simultaneously; both write to the same journal. +- *Mitigation*: Per `description.md` § 7, C12 owns a filesystem lockfile that gates concurrent invocations. This task asserts the lockfile exists at construction (`cache_root/.c11/lock`) and refuses to start otherwise. The lockfile creation is C12's job. + +## Runtime Completeness + +- **Named capability**: operator-side download from `satellite-provider` GET surface, RESTRICT-SAT-4 enforcement, AC-NEW-6 freshness counting, AC-8.1/8.3 cache provisioning input. +- **Production code that must exist**: real `httpx.Client`-backed `HttpTileDownloader`, real journal write/read with atomicwrites, real resolution gate, real composition-root factory wiring AZ-303/305/307/308 dependencies, real config schema extension. +- **Allowed external stubs**: tests MAY use a fake `httpx.Client` (transport-level) to script responses, fake `Clock` for deterministic retries, fake `TileStore`/`TileMetadataStore` (already provided by AZ-303's conformance fakes); production wiring uses real httpx + real Postgres-backed C6. +- **Unacceptable substitutes**: a hardcoded "every tile passes" resolution gate (defeats RESTRICT-SAT-4); skipping the journal and re-issuing every GET (defeats I-2 idempotence); silently catching `FreshnessRejectionError` without counting (loses operator visibility for AC-NEW-6); using `requests` instead of `httpx` (project pinned to httpx; switching mid-component is an `Architecture` finding). + +## Contract + +This task produces/implements the contract at `_docs/02_document/contracts/c11_tilemanager/tile_downloader.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-317_c11_flight_state_gate.md b/_docs/02_tasks/todo/AZ-317_c11_flight_state_gate.md new file mode 100644 index 0000000..4a1987b --- /dev/null +++ b/_docs/02_tasks/todo/AZ-317_c11_flight_state_gate.md @@ -0,0 +1,171 @@ +# C11 Flight-State Gate — ON_GROUND Defence-in-Depth for Upload + +**Task**: AZ-317_c11_flight_state_gate +**Name**: C11 Flight-State Gate +**Description**: Implement the `flight_state == ON_GROUND` precondition check that `TileUploader.upload_pending_tiles` calls before any network egress. Defines a thin C11-internal `FlightStateSource` Protocol with one method `current_flight_state() -> FlightStateSignal`; the concrete impl is supplied by E-C8 later (subscribes to the FC adapter's flight-state stream). The gate raises `FlightStateNotOnGroundError` if the current state is anything other than `ON_GROUND` (`IN_FLIGHT`, `UNKNOWN`, `TAKING_OFF`, `LANDING` all block). Logs an ERROR with the observed state and refuses to proceed; this is defence-in-depth atop ADR-004's process-level isolation, NOT the primary control. +**Complexity**: 2 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c11_tilemanager (epic AZ-251 / E-C11) +**Tracker**: AZ-317 +**Epic**: AZ-251 (E-C11) + +### Document Dependencies + +- `_docs/02_document/components/12_c11_tilemanager/description.md` — § 2 `confirm_flight_state` method, § 5 `FlightStateNotOnGroundError`, § 7 ADR-004 process isolation as the primary control. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — ERROR log shape on refusal. + +## Problem + +Without an ON_GROUND gate at the upload entry point: + +- AC-8.4 collapses partially: ADR-004 process isolation alone protects the airborne process from importing C11, but if the operator workstation accidentally triggers `upload_pending_tiles` while the FC reports `IN_FLIGHT` (e.g. operator started the upload during a pre-landing approach window), the upload would proceed — which is the exact scenario the safety case forbids. +- `RESTRICT-SAT-1` (no in-flight Service calls) loses one of its enforcement points; the operator workflow assumes uploads only happen when wheels are on the ground. +- `TileUploader` has no place to reach for "what is the FC saying right now?" without coupling tightly to E-C8's full FC adapter surface. +- The Risk-7 mitigation in description.md ("The FC believes it's airborne") becomes a documentation-only claim with no test surface. + +This task delivers the gate as a thin pre-call hook. It does NOT implement the FC subscription itself (that's E-C8's job); it consumes whatever C8 ships via the `FlightStateSource` Protocol declared here. + +## Outcome + +- A `FlightStateSource` Protocol at `src/gps_denied_onboard/components/c11_tilemanager/interface.py` (re-exported from `__init__.py`): + ```python + @runtime_checkable + class FlightStateSource(Protocol): + def current_flight_state(self) -> FlightStateSignal: ... + ``` +- `FlightStateSignal` enum at `src/gps_denied_onboard/components/c11_tilemanager/_types.py`: + ```python + class FlightStateSignal(StrEnum): + ON_GROUND = "on_ground" + TAKING_OFF = "taking_off" + IN_FLIGHT = "in_flight" + LANDING = "landing" + UNKNOWN = "unknown" + ``` +- A `FlightStateGate` class at `src/gps_denied_onboard/components/c11_tilemanager/flight_state_gate.py`: + - Constructor: `__init__(self, *, source: FlightStateSource, logger: Logger)`. + - One public method: `confirm_on_ground() -> FlightStateSignal`. Returns `FlightStateSignal.ON_GROUND` on pass; raises `FlightStateNotOnGroundError(observed: FlightStateSignal, observed_at: datetime)` on fail. + - Emits an ERROR log on every refusal with `kind="c11.upload.refused.flight_state"` carrying `{observed, observed_at_iso}`. + - Emits an INFO log on pass with `kind="c11.upload.flight_state_confirmed"`. +- `FlightStateNotOnGroundError` defined at `src/gps_denied_onboard/components/c11_tilemanager/errors.py`. Subclasses `TileManagerError` (the C11 error family parent declared in AZ-316). +- The gate is integrated by the TileUploader task (separate task; called once per `upload_pending_tiles` invocation BEFORE any C6 read or network setup). +- Composition root constructs `FlightStateGate` with a `FlightStateSource` impl supplied by the C8 adapter wiring (when E-C8 ships). For now, a fake-source pattern is documented in this task's tests; the production wiring is a one-line factory swap. +- A `Clock` injection is NOT needed here — the gate reads "now" via `datetime.utcnow()` at the call site for the error's `observed_at` timestamp, which is purely diagnostic and not a control surface. + +## Scope + +### Included + +- `FlightStateSource` Protocol (single method `current_flight_state`). +- `FlightStateSignal` enum (5 states). +- `FlightStateGate` class with `confirm_on_ground()` method. +- `FlightStateNotOnGroundError` definition. +- ERROR log on refusal; INFO log on pass. +- Composition-root entry for the gate (factory `build_flight_state_gate(source) -> FlightStateGate`). +- Conformance test for `FlightStateSource` Protocol against a fake. + +### Excluded + +- The actual FC subscription (subscribing to MAVLink heartbeat or equivalent) — owned by E-C8. +- The TileUploader integration of this gate — owned by the TileUploader task in this epic. +- ADR-004 build-time exclusion enforcement — owned by E-BOOT. +- Sector boundary or any geographic awareness — gate is state-only. +- Mid-upload re-checks (the gate fires once at start; in-progress uploads are NOT torn down if the FC transitions mid-upload, per the operator workflow which expects atomic batches). + +## Acceptance Criteria + +**AC-1: ON_GROUND passes** +Given a `FlightStateSource` returning `ON_GROUND` +When `confirm_on_ground()` is called +Then the call returns `FlightStateSignal.ON_GROUND`; no exception is raised; ONE INFO log `kind="c11.upload.flight_state_confirmed"` is emitted + +**AC-2: IN_FLIGHT raises** +Given a `FlightStateSource` returning `IN_FLIGHT` +When `confirm_on_ground()` is called +Then `FlightStateNotOnGroundError` is raised with `observed = IN_FLIGHT`; ONE ERROR log `kind="c11.upload.refused.flight_state"` is emitted; the exception message names the observed state + +**AC-3: UNKNOWN raises (fail-closed)** +Given a `FlightStateSource` returning `UNKNOWN` +When `confirm_on_ground()` is called +Then `FlightStateNotOnGroundError` is raised; the gate is fail-closed by design (UNKNOWN is treated as "not safe to upload") + +**AC-4: TAKING_OFF and LANDING raise** +Given a `FlightStateSource` returning `TAKING_OFF` or `LANDING` +When `confirm_on_ground()` is called +Then `FlightStateNotOnGroundError` is raised in both cases; transition states are NOT treated as ON_GROUND + +**AC-5: Source exception propagates with context** +Given a `FlightStateSource` whose `current_flight_state()` raises `RuntimeError("FC disconnected")` +When `confirm_on_ground()` is called +Then `FlightStateNotOnGroundError` is raised with `observed = UNKNOWN` (the gate maps source failure to UNKNOWN, not to the raw exception); the original `RuntimeError` is set as `__cause__` on the new exception; ONE ERROR log carries the original exception's message + +**AC-6: FlightStateSource Protocol is conformance-checkable** +Given a class implementing `current_flight_state` returning `FlightStateSignal` +When `isinstance(impl, FlightStateSource)` is evaluated under `runtime_checkable` +Then the result is `True`; for a class missing the method, the result is `False` + +**AC-7: Error carries diagnostic fields** +Given a refusal +When the test inspects the raised `FlightStateNotOnGroundError` +Then `exc.observed`, `exc.observed_at` (datetime, UTC, second-precision) are populated; the message starts with `"Upload refused: flight state is "` followed by the observed state name + +**AC-8: Gate does not retry** +Given the source returns `IN_FLIGHT` then `ON_GROUND` on a hypothetical second call +When `confirm_on_ground()` is called once +Then `current_flight_state()` is called EXACTLY once (verifiable via spy); the gate does NOT poll-and-retry + +## Non-Functional Requirements + +**Performance** +- `confirm_on_ground` p99 ≤ 1 ms when the source returns synchronously (the gate is a thin wrapper; its cost is dominated by the source's own implementation, which this task does not constrain). + +**Compatibility** +- `FlightStateSignal` is a stdlib `StrEnum` (Python 3.11+); no `pydantic` or `attrs`. +- No new third-party dependencies. + +**Reliability** +- The gate is fail-closed: UNKNOWN, transition states, and source-failures all block the upload. The cost of a false-block (skipped upload) is small; the cost of a false-pass (upload during flight) is unbounded per the safety case. +- The gate does NOT cache state across calls; each `confirm_on_ground()` invocation re-queries the source. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Fake source returns ON_GROUND | `confirm_on_ground` returns `ON_GROUND`; INFO log emitted | +| AC-2 | Fake source returns IN_FLIGHT | `FlightStateNotOnGroundError`; ERROR log; observed=IN_FLIGHT | +| AC-3 | Fake source returns UNKNOWN | `FlightStateNotOnGroundError` | +| AC-4 | Fake source returns TAKING_OFF; then LANDING | Both raise | +| AC-5 | Fake source raises `RuntimeError` | `FlightStateNotOnGroundError` with `observed=UNKNOWN`; `__cause__` set; ERROR log carries original message | +| AC-6 | `isinstance` check on conforming + non-conforming fakes | True / False | +| AC-7 | Inspect raised exception fields | `observed`, `observed_at` populated; message format correct | +| AC-8 | Spy on `current_flight_state` call count | Exactly 1 call per `confirm_on_ground` | +| NFR-perf | Microbench gate × 100k with synchronous fake | p99 ≤ 1 ms | +| NFR-reliability-fail-closed | Each non-ON_GROUND state | All raise; coverage matrix complete | + +## Constraints + +- The gate is fail-closed for UNKNOWN and any source exception. Documented; opening this default would require a Choose A/B/C/D coordination with the safety reviewer. +- Transition states (`TAKING_OFF`, `LANDING`) are NOT treated as ON_GROUND — operators must wait until the FC reports `ON_GROUND`. This is intentional and documented; the operator workflow's typical pause between landing and upload-trigger covers it. +- The gate calls `current_flight_state()` exactly once per `confirm_on_ground` — no polling, no retries. Documented behaviour; the upper-layer TileUploader handles retries at the upload-batch level if the operator wants to retry after a fail. +- This task introduces no new third-party dependencies. + +## Risks & Mitigation + +**Risk 1: `FlightStateSource` Protocol surface diverges from C8's eventual impl** +- *Risk*: When E-C8 (AZ-261) ships its FC adapter, the natural public method might not be `current_flight_state` — could be `latest_heartbeat()` or `state_stream`. +- *Mitigation*: Document the Protocol as a thin C11-facing adapter; if C8's natural surface differs, an adapter class wraps it (`FlightStateSourceAdapter(c8_fc_adapter)` — owned by E-C8's wiring task). The Protocol's narrow surface (one method, one return type) makes adapting trivial. + +**Risk 2: UNKNOWN state during FC link recovery is too aggressive** +- *Risk*: A transient FC connection blip causes UNKNOWN; operator sees a refusal during a perfectly-on-ground state. +- *Mitigation*: Documented as fail-closed; the operator workflow tolerates re-triggering the upload after the FC recovers (the upload journal preserves pending tiles between attempts). E-C8's FC adapter is responsible for state debouncing if the false-UNKNOWN rate is operationally too high; not C11's concern. + +**Risk 3: Fail-closed during a real on-ground emergency upload** +- *Risk*: A safety officer urgently needs to trigger an upload but the FC is reporting UNKNOWN. +- *Mitigation*: Per architecture, no operational scenario exists where an upload MUST succeed during FC-disconnect. The pending-upload journal preserves data; the upload runs after the FC reconnects. No override flag is provided — adding one would weaken the safety case and require Choose A/B/C/D approval. + +## Runtime Completeness + +- **Named capability**: defence-in-depth ON_GROUND check at upload entry (description.md § 5; ADR-004; AC-8.4). +- **Production code that must exist**: real `FlightStateSource` Protocol declaration, real `FlightStateSignal` enum, real `FlightStateGate` class with logging, real `FlightStateNotOnGroundError` in the C11 error family. +- **Allowed external stubs**: tests MAY use a fake `FlightStateSource` impl (synchronous return value); production wiring uses the real C8 FC-adapter source via the composition root (when E-C8 ships). +- **Unacceptable substitutes**: a hardcoded "always ON_GROUND" source (defeats the entire point); polling the source N times to "average" the state (introduces TOCTOU windows where the FC transitions mid-poll); silently mapping `UNKNOWN` to `ON_GROUND` (defeats fail-closed); reading FC state from a static config file (the FC's actual telemetry IS the source of truth). diff --git a/_docs/02_tasks/todo/AZ-318_c11_signing_key.md b/_docs/02_tasks/todo/AZ-318_c11_signing_key.md new file mode 100644 index 0000000..0d25e03 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-318_c11_signing_key.md @@ -0,0 +1,205 @@ +# C11 Per-Flight Signing Key — Generation + Sign + Zeroise + +**Task**: AZ-318_c11_signing_key +**Name**: C11 Per-Flight Signing Key +**Description**: Implement the per-flight ephemeral signing key used by `TileUploader` to authenticate each uploaded tile against the parent suite's D-PROJ-2 ingest contract. `PerFlightKeyManager` generates one fresh Ed25519 keypair per flight at upload-session start, signs the multipart payload per tile, and zeroises the secret-key buffer in memory after the session completes (success OR failure). The public key is recorded in the FDR (`kind="c11.upload.session.key.public"`) so the safety officer can later correlate which key signed which tiles. On `SignatureRejectedError` from `satellite-provider`, the manager emits an FDR alert (`kind="c11.upload.signature_rejected"`) — security-critical event, never silently dropped. Uses the project-pinned `cryptography` library; no custom crypto. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf +**Component**: c11_tilemanager (epic AZ-251 / E-C11) +**Tracker**: AZ-318 +**Epic**: AZ-251 (E-C11) + +### Document Dependencies + +- `_docs/02_document/components/12_c11_tilemanager/description.md` — § 3.2 D-PROJ-2 contract sketch (signature requirement), § 5 `SignatureRejectedError`, § 7 R09 key-compromise mitigation. +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — `kind="c11.upload.session.key.public"` and `kind="c11.upload.signature_rejected"` envelopes. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO/ERROR log shapes for key lifecycle events. +- `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md` — D-PROJ-2 design task #1 (parent-suite ingest contract), specifically the `signature` field requirement. + +## Problem + +Without a per-flight ephemeral signing key: + +- D-PROJ-2 contract sketch demands every uploaded tile carry a `signature` field; without it, `satellite-provider`'s ingest endpoint will reject every payload. +- The R09 risk (key compromise) is unmitigated — a single static API key would compromise every flight's uploads on first leak; per-flight keys bound the blast radius to one flight. +- The "ingest-side voting layer" (D-PROJ-2 design task #2) cannot trust uploaded tiles without a way to associate each tile with its source flight; the public key is the binding. +- AC-NEW-7 (cache-poisoning safety budget) loses one of its layers — the voting layer relies on per-flight keys to detect collusion (multiple compromised companions colluding becomes detectable when their key fingerprints differ from the safety officer's pre-flight enrolment record). +- Per `description.md` § 5: `SignatureRejectedError` is a security-critical event; without a structured handler, it would either crash the upload run or be silently caught. +- The C11-ST-03 security test (key zeroised after upload) has no implementation to verify against — without zeroisation, the secret-key bytes remain in heap memory long after the upload completes, increasing exfil window. + +This task delivers the key lifecycle. It does NOT plumb the key into the upload payload (TileUploader task does that); it provides `sign(payload)` as the boundary. + +## Outcome + +- A `PerFlightKeyManager` class at `src/gps_denied_onboard/components/c11_tilemanager/signing_key.py`: + - Constructor: `__init__(self, *, fdr_client: FdrClient, logger: Logger)`. No state at construction time. + - `start_session(flight_id: uuid.UUID) -> PublicKeyFingerprint`: + 1. Generates a fresh Ed25519 keypair via `cryptography.hazmat.primitives.asymmetric.ed25519.Ed25519PrivateKey.generate()`. + 2. Stores the private key in `self._private_key` (instance state, not module-level). + 3. Computes `public_key_pem = private_key.public_key().public_bytes(...)`. + 4. Computes `fingerprint = sha256(public_key_pem).hex()[:16]`. + 5. Emits FDR `kind="c11.upload.session.key.public"` with `{flight_id, public_key_pem, fingerprint, generated_at_iso}`. + 6. Emits INFO log `kind="c11.upload.session.key.generated"` with `{flight_id, fingerprint}` (NEVER the private key). + 7. Returns `PublicKeyFingerprint(flight_id, public_key_pem, fingerprint, generated_at)`. + - `sign(payload: bytes) -> bytes`: + 1. Raises `SessionNotActiveError` if `self._private_key is None`. + 2. Returns `self._private_key.sign(payload)` (Ed25519 signature is 64 bytes). + 3. No log emission per call (would flood at upload throughput). + - `end_session() -> None`: + 1. If `self._private_key is None`, no-op. + 2. Calls `self._zeroise_private_key()` (overwrites the secret-key bytes with zeros via `cryptography`'s key-deletion guidance, then sets `self._private_key = None`). + 3. Emits INFO log `kind="c11.upload.session.key.zeroised"`. + - `record_signature_rejection(flight_id, tile_id) -> None`: + 1. Emits FDR `kind="c11.upload.signature_rejected"` with `{flight_id, tile_id, fingerprint, observed_at_iso}`. + 2. Emits ERROR log with the same payload. +- `PublicKeyFingerprint` DTO at `src/gps_denied_onboard/components/c11_tilemanager/_types.py` — `@dataclass(frozen=True)` with the four fields above. +- `SessionNotActiveError` defined in `src/gps_denied_onboard/components/c11_tilemanager/errors.py` — subclasses `TileManagerError`. (`SignatureRejectedError` is also defined here, but raised by `TileUploader` after parsing the ingest response, NOT by this task.) +- The TileUploader task (separate) calls: + - `start_session(flight_id)` once per upload run. + - `sign(payload)` once per tile. + - `record_signature_rejection(...)` on each per-tile rejection from the ingest response. + - `end_session()` in a `finally` block guaranteeing zeroisation on success or failure. +- The composition root constructs `PerFlightKeyManager` and injects it into `TileUploader`. Factory: `build_per_flight_key_manager(fdr_client, logger) -> PerFlightKeyManager`. +- A `__del__` safety net calls `end_session()` if it was never explicitly called, with a WARN log noting the leak. This is a belt-and-braces guarantee, not the primary control. + +## Scope + +### Included + +- `PerFlightKeyManager` class (4 public methods + `__del__` safety net). +- `PublicKeyFingerprint` DTO. +- `SessionNotActiveError` definition. +- Ed25519 keypair generation using the project-pinned `cryptography` library. +- Best-effort zeroisation of the secret-key buffer (via `cryptography` library's recommended deletion path; documented as "best-effort" because Python heap zeroisation cannot be guaranteed without ctypes-level pinning). +- FDR emission on session start (public key) and on signature rejection. +- INFO log on session lifecycle events; ERROR log on signature rejection. +- Composition-root factory. + +### Excluded + +- The TileUploader integration (signing into multipart payloads) — owned by the TileUploader task. +- Pre-flight key enrolment workflow (the safety officer's record of expected per-flight public keys) — owned by C12 operator tooling. +- HSM / TPM-backed key storage — out of scope this cycle; the assumption is that the operator workstation's process is trusted enough for ephemeral in-memory keys, with zeroisation as the residual hygiene. +- Mid-session key rotation — one key per session; rotation requires `end_session` + `start_session`. +- Key persistence between processes — the key is in-memory ONLY; an upload session must complete in one process lifetime. +- The `SignatureRejectedError` class itself is defined here but raised by TileUploader. + +## Acceptance Criteria + +**AC-1: `start_session` generates a fresh keypair and emits FDR** +Given a fresh `PerFlightKeyManager` +When `start_session(flight_id)` is called +Then the manager holds a non-None `_private_key`; `PublicKeyFingerprint` is returned with a 16-char hex fingerprint; ONE FDR `kind="c11.upload.session.key.public"` is emitted with the public-key PEM; ONE INFO log without the private key + +**AC-2: Two consecutive sessions produce different keys** +Given `start_session(F1)` followed by `end_session()` followed by `start_session(F2)` +When fingerprints are compared +Then `fingerprint_F1 != fingerprint_F2` (cryptographically distinct keys); two FDR records are emitted, one per session + +**AC-3: `sign` returns 64-byte Ed25519 signature** +Given an active session +When `sign(b"hello world")` is called +Then a 64-byte signature is returned; the signature verifies against the session's public key (verifiable via `Ed25519PublicKey.verify`) + +**AC-4: `sign` before `start_session` raises** +Given a fresh `PerFlightKeyManager` +When `sign(b"...")` is called without prior `start_session` +Then `SessionNotActiveError` is raised; no signature is computed + +**AC-5: `sign` after `end_session` raises** +Given `start_session(F)` then `end_session()` +When `sign(b"...")` is called +Then `SessionNotActiveError` is raised + +**AC-6: `end_session` zeroises and emits log** +Given an active session +When `end_session()` is called +Then `self._private_key is None`; the underlying secret-key buffer is overwritten with zeros (verifiable via `ctypes.string_at` against the buffer address captured pre-zeroise); ONE INFO log `kind="c11.upload.session.key.zeroised"` + +**AC-7: `__del__` safety net zeroises if `end_session` was missed** +Given an active session whose owner is garbage-collected without calling `end_session` +When the GC runs `__del__` +Then `end_session()` runs implicitly; ONE WARN log `kind="c11.upload.session.key.zeroised_via_finalizer"`; the buffer is zeroised + +**AC-8: `record_signature_rejection` emits FDR + ERROR log** +Given an active session and a tile_id +When `record_signature_rejection(flight_id, tile_id)` is called +Then ONE FDR `kind="c11.upload.signature_rejected"` is emitted with `{flight_id, tile_id, fingerprint, observed_at_iso}`; ONE ERROR log with the same payload + +**AC-9: Private key never logged anywhere** +Given the full session lifecycle +When all log records and all FDR records are captured +Then the private-key PEM does NOT appear in ANY record (verifiable via byte search across the captured stream) + +**AC-10: `end_session` is idempotent** +Given an active session +When `end_session()` is called twice in a row +Then the second call is a no-op; no exception is raised; no second INFO log is emitted + +## Non-Functional Requirements + +**Performance** +- `sign` p99 ≤ 200 µs on the operator workstation (Ed25519 is fast; the bottleneck is the upload network, not signing). +- `start_session` ≤ 5 ms (Ed25519 keygen is sub-millisecond; FDR emission + log emission dominate). + +**Compatibility** +- `cryptography` library at the project-pinned version. Verify before adding; do NOT bump unilaterally. +- Ed25519 is available in `cryptography.hazmat.primitives.asymmetric.ed25519` since 2.6 — the project pin must be ≥ 2.6. + +**Reliability** +- The manager guarantees zeroisation on `end_session` AND on `__del__` — both paths converge through the same `_zeroise_private_key` helper. +- The Python heap layer cannot guarantee bit-perfect zeroisation (objects may be relocated by the GC); this is documented. The mitigation is: keep the key buffer's lifetime as short as possible (one upload session) and rely on the OS-level memory protections (no swap on the operator workstation per RESTRICT-OPS-1). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `start_session` then capture FDR + log | Public PEM in FDR; fingerprint 16 hex chars; private key not in log | +| AC-2 | Two sessions back-to-back | Different fingerprints | +| AC-3 | Sign + verify roundtrip | 64-byte signature; verifies against public key | +| AC-4 | `sign` without `start_session` | `SessionNotActiveError` | +| AC-5 | `sign` after `end_session` | `SessionNotActiveError` | +| AC-6 | `end_session` and inspect zeroised buffer | Buffer is all zeros; log emitted | +| AC-7 | Drop reference + force GC | `__del__` runs `end_session`; WARN log | +| AC-8 | `record_signature_rejection` | FDR + ERROR log with all fields | +| AC-9 | Capture all logs/FDR for a full session; byte-search private PEM | Not present | +| AC-10 | `end_session` twice | Second call is no-op; no second log | +| NFR-perf-sign | Microbench `sign` × 100k | p99 ≤ 200 µs | +| NFR-reliability-fingerprint-uniqueness | 1000 sessions with unique flight_ids | All 1000 fingerprints distinct (collision-resistant) | + +## Constraints + +- The signing algorithm is Ed25519; no per-task choice (the parent suite's D-PROJ-2 contract requires Ed25519 per the leftover file's design). +- The secret-key never leaves the manager — `sign(payload) -> bytes` is the only method that uses it; consumers do NOT touch the private key. +- The public key is logged AND FDR'd (it is public by definition); the private key is NOT logged anywhere — code-review treats any private-key reference outside `signing_key.py` as a `Security` finding (Critical). +- This task pins to the project's existing `cryptography` version. If the version doesn't support `Ed25519PrivateKey.generate()`, ASK the user before bumping (per `coderule.mdc` "verify the API actually exists in the pinned version"). +- `__del__` is a safety net, NOT the primary contract — consumers MUST call `end_session()` explicitly. Code-review treats reliance on `__del__` as a `Reliability` finding. + +## Risks & Mitigation + +**Risk 1: Python heap zeroisation is not bit-perfect** +- *Risk*: The `cryptography` library returns the private key as a Python object; freeing the object's memory does not guarantee zeroisation (the GC may relocate objects). +- *Mitigation*: Documented as "best-effort"; the operator workstation runs no-swap (RESTRICT-OPS-1); the key lifetime is bounded to one upload session (typically minutes); the residual exfil window is minimised. A future task could add ctypes-level pinning if the threat model tightens. + +**Risk 2: `__del__` doesn't run when the process is killed (`SIGKILL`)** +- *Risk*: A SIGKILL during an active session leaves the key buffer in heap memory until the OS reclaims the process pages. +- *Mitigation*: Documented; the OS-level mitigation is process termination → memory pages reclaimed; on Linux with no swap, the bytes never hit disk. No software mitigation is feasible inside the killed process. + +**Risk 3: FDR ringbuffer overrun loses the public-key record** +- *Risk*: Under FDR backpressure (AZ-274 overrun), the `kind="c11.upload.session.key.public"` record might be dropped — the safety officer cannot correlate the upload with a key fingerprint later. +- *Mitigation*: AZ-273's ringbuffer is sized per `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md`; this task adds NO new pressure but is documented as critical-priority. Mid-flight FDR loss is already an AC-NEW-1 concern; this task surfaces the dependency. + +**Risk 4: `cryptography` library API drift across pins** +- *Risk*: A minor `cryptography` bump renames `Ed25519PrivateKey.generate()` or changes its signature. +- *Mitigation*: The task verifies the API against the pinned version (per `coderule.mdc`); the pin is recorded in `requirements.txt`; a wrapper isolates the library to this single class. + +**Risk 5: Replay attack — captured signed payloads re-uploaded by an attacker** +- *Risk*: An MITM captures a valid `(payload, signature)` pair and re-uploads to `satellite-provider`'s ingest endpoint. +- *Mitigation*: Out of scope for this task — the parent suite's ingest endpoint owns nonce / timestamp validation per the D-PROJ-2 design. C11 includes `capture_timestamp` in the signed payload (per the leftover file's contract sketch); the parent suite rejects timestamps outside its acceptance window. This task does NOT add a separate nonce. + +## Runtime Completeness + +- **Named capability**: per-flight ephemeral signing key per D-PROJ-2 contract, R09 mitigation, AC-NEW-7 voting-layer enabler (description.md § 7, leftover file design task #1). +- **Production code that must exist**: real `PerFlightKeyManager` with real Ed25519 keypair generation via `cryptography`, real `sign`, real best-effort zeroisation, real FDR emission for public-key + signature-rejection events, real `__del__` safety net. +- **Allowed external stubs**: tests MAY use a fake `FdrClient` (already provided by AZ-275 fake_fdr_sink) and a fake `Logger`; production wiring uses the real AZ-273 ringbuffer + AZ-266 logger. +- **Unacceptable substitutes**: a hardcoded shared key reused across flights (defeats R09 mitigation); a pseudo-random "key" generated from `random.getrandbits` instead of `cryptography`'s CSPRNG (rolling our own crypto is rejected per `coderule.mdc`); skipping `end_session` zeroisation (loses C11-ST-03 test surface); logging the private key for "debugging" (Critical Security finding). diff --git a/_docs/02_tasks/todo/AZ-319_c11_tile_uploader.md b/_docs/02_tasks/todo/AZ-319_c11_tile_uploader.md new file mode 100644 index 0000000..91a4bc6 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-319_c11_tile_uploader.md @@ -0,0 +1,250 @@ +# C11 TileUploader — Read Pending + Sign + POST + Mark Uploaded + +**Task**: AZ-319_c11_tile_uploader +**Name**: C11 TileUploader +**Description**: Implement the `TileUploader` Protocol — C11's operator-side post-landing upload path. `upload_pending_tiles` calls AZ-317's `FlightStateGate.confirm_on_ground()` first, starts an AZ-318 signing session, reads pending mid-flight tiles from C6 (`source = onboard_ingest`, `voting_status = pending`) via the AZ-303 metadata store, packages each tile per the D-PROJ-2 multipart contract sketch (tile_blob, geo metadata, capture_timestamp, flight_id, companion_id, quality_metadata, signature), signs each payload, POSTs to `/api/satellite/tiles/ingest`, parses the per-tile response, and marks acknowledged tiles uploaded in C6. Honours `Retry-After` on 429s; fails fast on TLS / auth; surfaces `signature_rejected` per tile via FDR. The signing key is zeroised in a try/finally guarantee. Idempotent-retry across partial-success batches is a separate decorator task in this epic. +**Complexity**: 5 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf, AZ-303_c6_storage_interfaces, AZ-305_c6_postgres_filesystem_store, AZ-317_c11_flight_state_gate, AZ-318_c11_signing_key +**Component**: c11_tilemanager (epic AZ-251 / E-C11) +**Tracker**: AZ-319 +**Epic**: AZ-251 (E-C11) + +### Document Dependencies + +- `_docs/02_document/contracts/c11_tilemanager/tile_uploader.md` — produced by this task (frozen Protocol + DTO shape, invariants, test cases). +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — consumed: `pending_uploads`, `mark_uploaded`, `get_by_id`. +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — consumed: `read_tile_pixels` for the multipart blob. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO/WARN/ERROR log shapes for upload events. +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — `kind="c11.upload.tile.queued"` / `kind="c11.upload.tile.rejected"` / `kind="c11.upload.batch.complete"` envelopes. +- `_docs/02_document/components/12_c11_tilemanager/description.md` — § 3.2 D-PROJ-2 contract sketch, § 5 error handling. +- `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md` — D-PROJ-2 design task #1 ingest endpoint shape. + +## Problem + +Without a real `TileUploader`: + +- AC-8.4 (post-landing upload of mid-flight tiles to the parent suite) collapses — the pending-upload journal grows unboundedly across flights. +- D-PROJ-2's safety-officer correlation cannot work — the public-key + tile-id linkage exists only at upload time. +- The AC-NEW-7 voting / trust layer (parent-suite side) has no inputs — without uploads, no flights ever vote. +- Mid-flight tile generation (E-C13 mid-flight tile snapshot, AZ-294) becomes a leaf system: tiles land in C6 with `voting_status = pending` and stay there forever. +- `SignatureRejectedError` from the parent suite has no detection path; a key compromise would not surface to the safety officer until manual log inspection. +- Operators have no observable post-landing operation; the F10 functional flow has no implementation. + +This task delivers the production uploader. It composes AZ-317 (gate) + AZ-318 (signing) + AZ-303/305 (C6) + httpx; it adds no new responsibilities beyond orchestration, so the surface area is tight. + +## Outcome + +- A `TileUploader` Protocol + concrete `HttpTileUploader` class at `src/gps_denied_onboard/components/c11_tilemanager/`: + - `interface.py` exposes `TileUploader` Protocol (`runtime_checkable`). + - `tile_uploader.py` houses `HttpTileUploader`. + - `_types.py` adds `UploadRequest`, `UploadBatchReport`, `PerTileStatus`, `UploadOutcome` (StrEnum), `IngestStatus` (StrEnum) — all `@dataclass(frozen=True)` for the data DTOs. + - `errors.py` adds `SignatureRejectedError` (subclasses `TileManagerError`); `FlightStateNotOnGroundError` and the rest are already declared in AZ-317/AZ-318/AZ-316. +- Constructor signature: + `__init__(self, *, http_client: httpx.Client, tile_store: TileStore, tile_metadata_store: TileMetadataStore, flight_state_gate: FlightStateGate, key_manager: PerFlightKeyManager, fdr_client: FdrClient, logger: Logger, clock: Clock, config: C11Config)`. Injected dependencies — no module-level singletons. +- `upload_pending_tiles(request)` flow: + 1. Calls `flight_state_gate.confirm_on_ground()` (raises if not ON_GROUND; ZERO state-mutation prior to this). + 2. Calls `key_manager.start_session(flight_id_for_session)` — `flight_id_for_session` is `request.flight_id` if provided else `uuid.uuid4()` ("session id" for the multi-flight case). + 3. In a `try` block: + - Calls `tile_metadata_store.pending_uploads(flight_id=request.flight_id)` to enumerate pending tiles. + - If empty → returns `UploadBatchReport(outcome=success, per_tile_status=(), batch_uuid=uuid4())`. + - Splits the pending list into batches of `request.batch_size`. + - For each batch: + - Reads each tile's pixel bytes via `tile_store.read_tile_pixels(tile_id)`. + - Builds the multipart payload per tile: `tile_blob`, `zoomLevel`, `latitude`, `longitude`, `tile_size_meters`, `tile_size_pixels`, `capture_timestamp`, `flight_id`, `companion_id`, `quality_metadata` (JSON), `signature` (`key_manager.sign(canonical_payload_bytes)`). + - Canonical payload bytes for signing: SHA-256 of `tile_blob || zoomLevel || latitude || longitude || capture_timestamp || flight_id || companion_id || quality_metadata_json` (deterministic byte concatenation; documented). + - POSTs the multipart to `{config.satellite_provider_url}/api/satellite/tiles/ingest`. + - On 202: parses `batch_uuid` + `per_tile_status[]` from the response body. For each `queued | duplicate | superseded` tile, calls `tile_metadata_store.mark_uploaded(tile_id, batch_uuid)`. For each `rejected` tile, calls `key_manager.record_signature_rejection(flight_id, tile_id)` if the rejection reason mentions signature; emits FDR `kind="c11.upload.tile.rejected"` with the reason regardless. + - On 429: honours `Retry-After`; on persistent 429 → `RateLimitedError`. + - On 5xx: exponential backoff (1s, 2s, 4s; 4 retries max); persistent → `SatelliteProviderError`. + - On TLS / 401 / 403: fail fast → `SatelliteProviderError`. + - Aggregates `UploadBatchReport`: + - `outcome = success` if ALL tiles are `queued | duplicate | superseded`. + - `outcome = partial` if any `rejected` OR any unparseable response with otherwise-acked tiles. + - `outcome = failure` if the gate blocked, the API key was invalid, or zero tiles could be POSTed. + - `public_key_fingerprint` = the AZ-318 fingerprint from `start_session`. + - `batch_uuid` = the LAST successful batch's UUID (or `uuid4()` if none succeeded; documented). + 4. In a `finally` block: + - Calls `key_manager.end_session()` — guaranteed zeroisation regardless of success / failure / exception. + - Emits FDR `kind="c11.upload.batch.complete"` with `{flight_id_for_session, public_key_fingerprint, total_attempted, total_queued, total_rejected, outcome, observed_at_iso}`. +- `enumerate_pending_tiles(flight_id)` returns `tile_metadata_store.pending_uploads(flight_id)` directly (read-only enumeration). +- `confirm_flight_state()` returns `flight_state_gate.confirm_on_ground()` (passes through; raises on non-ON_GROUND). +- INFO log on session start/end with batch counts; WARN log per retry; ERROR log on `SatelliteProviderError`, `FlightStateNotOnGroundError` (caught and re-raised after log). +- Composition root constructs `HttpTileUploader` via `build_tile_uploader(config) -> TileUploader` at `src/gps_denied_onboard/runtime_root/c11_factory.py`. +- Configuration extension to AZ-269 loader: `config.c11.satellite_provider_ingest_url`, `config.c11.upload_batch_size`, `config.c11.upload_http_timeout_s`, `config.c11.companion_id`. +- Type-only conformance test verifies `isinstance(HttpTileUploader(...), TileUploader)`. + +## Scope + +### Included + +- `TileUploader` Protocol declaration + `HttpTileUploader` concrete class. +- `UploadRequest`, `UploadBatchReport`, `PerTileStatus`, `UploadOutcome`, `IngestStatus` DTOs. +- `SignatureRejectedError` definition (parent of `TileManagerError`). +- The orchestration: gate → start_session → enumerate → batch loop → mark_uploaded / FDR alert → end_session. +- Multipart payload construction + canonical bytes for signing. +- HTTP retry / backoff / `Retry-After` handling for the upload path. +- Composition-root factory `build_tile_uploader`. +- Config schema extension for the C11 upload fields. +- Conformance test at `tests/unit/c11_tilemanager/test_protocol_conformance.py`. + +### Excluded + +- The `TileDownloader` Protocol and concrete impl — separate task (AZ-316). +- `FlightStateGate` impl — owned by AZ-317. +- `PerFlightKeyManager` impl — owned by AZ-318. +- Idempotent-retry-on-partial-success batch decorator — separate task in this epic (AZ-320_c11_idempotent_retry). +- The R02 ADR-004 build-time exclusion — owned by E-BOOT. +- The pre-flight key enrolment workflow at C12 — owned by E-C12. +- The `mock-suite-sat-service` fixture under `tests/fixtures/` — owned by E-BBT (test infrastructure). +- Voting / trust promotion — owned by D-PROJ-2 / `satellite-provider`. +- E-C8's `FlightStateSource` impl — owned by E-C8 (AZ-261). + +## Acceptance Criteria + +**AC-1: Happy path uploads all pending tiles** +Given 50 pending tiles in C6, ON_GROUND, parent suite returns 202 with all `queued` +When `upload_pending_tiles(request)` is called +Then 50 POSTs issued (one per tile or batched per `batch_size`); all 50 marked `uploaded` in C6 (verifiable via `mark_uploaded` spy); `UploadBatchReport.outcome = success`; one FDR `kind="c11.upload.batch.complete"` with `total_attempted=50, total_queued=50` + +**AC-2: Flight-state gate blocks before any read or POST** +Given `FlightStateGate.confirm_on_ground()` raises `FlightStateNotOnGroundError(IN_FLIGHT)` +When `upload_pending_tiles(request)` is called +Then `FlightStateNotOnGroundError` is raised; ZERO calls to `pending_uploads` (verifiable via spy); ZERO HTTP POSTs; ZERO calls to `key_manager.start_session` (key generation is also gated); `key_manager.end_session()` is NOT called (no session was started) + +**AC-3: Signature rejection per tile is FDR'd and not marked uploaded** +Given parent suite returns `rejected` for 1 tile with reason `"invalid signature"` +When the response is parsed +Then `key_manager.record_signature_rejection(flight_id, tile_id)` is called once; `tile_metadata_store.mark_uploaded` is NOT called for that tile; the tile remains `voting_status = pending`; FDR `kind="c11.upload.tile.rejected"` is emitted with the reason; report's `outcome = partial` + +**AC-4: `duplicate` and `superseded` are treated as success** +Given parent suite returns `duplicate` for 5 tiles and `superseded` for 3 tiles +When the response is parsed +Then all 8 are `mark_uploaded`'d in C6 with the batch_uuid; report's per_tile_status reflects the original status; `outcome = success` if no `rejected` + +**AC-5: Signing key is zeroised on success** +Given a successful upload +When `upload_pending_tiles` returns +Then `key_manager.end_session()` was called once (verifiable via spy); the AZ-318 manager's `_private_key is None` + +**AC-6: Signing key is zeroised on failure** +Given the FIRST POST raises a connection-reset error +When `upload_pending_tiles` raises `SatelliteProviderError` +Then `key_manager.end_session()` was called (try/finally executed); the manager's `_private_key is None`; the partial state in C6 is consistent (no half-marked tiles) + +**AC-7: Public-key FDR record precedes any tile FDR** +Given a session with at least one tile +When the FDR stream is captured +Then `kind="c11.upload.session.key.public"` is observed BEFORE any `kind="c11.upload.tile.*"` record + +**AC-8: 429 honours Retry-After** +Given parent suite returns 429 with `Retry-After: 60` on the first POST +When the uploader processes the response +Then `Clock.sleep` is called with ≥ 60s; on success the run proceeds; the report includes `retry_count >= 1` + +**AC-9: Persistent 5xx aborts with structured error** +Given parent suite returns 503 for 5 consecutive attempts +When the uploader exhausts retries +Then `SatelliteProviderError` is raised; the report is NOT returned (the exception propagates); `key_manager.end_session()` was called via finally + +**AC-10: TLS / 401 / 403 fail fast** +Given the first POST returns 401 +When the uploader processes the response +Then `SatelliteProviderError` is raised on the first attempt; zero retries; the public key is NOT logged; the API key (if any TLS auth header) is NOT logged + +**AC-11: Empty pending set is success with no POSTs** +Given zero pending tiles in C6 +When `upload_pending_tiles(request)` is called +Then `outcome = success`; `per_tile_status` is empty; `key_manager.start_session` was called (signature still required by D-PROJ-2 for the empty-batch ack record per § 3.2; documented); `end_session` was called; ONE FDR `c11.upload.batch.complete` with `total_attempted=0` + +**AC-12: Conformance — concrete impl satisfies Protocol** +Given an `HttpTileUploader` instance +When `isinstance(impl, TileUploader)` is checked under `runtime_checkable` +Then the result is `True`; a fake omitting `confirm_flight_state` returns `False` + +**AC-13: Canonical signing bytes are deterministic** +Given the same tile metadata + tile bytes +When `_canonical_payload_bytes(tile)` is computed twice +Then the two byte strings are bitwise identical (no map ordering, no JSON whitespace drift); the SHA-256 over them matches; this is asserted via property test with N random tiles + +**AC-14: Partial-success batches return without raising** +Given a 10-tile batch where 7 are `queued`, 3 are `rejected` +When `upload_pending_tiles` returns +Then NO exception is raised; `outcome = partial`; `per_tile_status` has all 10 entries with their respective statuses; the 7 acked tiles are marked uploaded in C6; the 3 rejected stay pending + +## Non-Functional Requirements + +**Performance** +- Upload throughput ≥ 20 tile/s with signing (C11-PT-02); the bottleneck is the network plus signing per tile. +- Per-tile signing ≤ 200 µs (Ed25519 from AZ-318); per-tile multipart construction ≤ 1 ms. + +**Compatibility** +- `httpx` per project pin; `cryptography` per project pin. +- Multipart form encoding per `httpx`'s `files=` parameter — no manual boundary construction. + +**Reliability** +- Try/finally ensures `key_manager.end_session()` runs in EVERY exit path including unexpected exceptions and KeyboardInterrupt. +- The uploader writes to C6 ONLY via the AZ-303 Protocol (`mark_uploaded`); it does NOT touch the metadata table directly. +- Concurrent invocations against the same `cache_root` are gated by C12's filesystem lockfile (same lock as TileDownloader); the uploader asserts the lock at construction. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | 50-tile happy path | All `mark_uploaded`'d; `outcome=success`; FDR batch.complete present | +| AC-2 | Gate raises before any work | Zero spies fire on `pending_uploads`, POST, `start_session` | +| AC-3 | One signature rejection in a 5-tile batch | `record_signature_rejection` called once; rejected tile NOT marked uploaded; outcome=partial | +| AC-4 | Mix of `duplicate` and `superseded` responses | All marked uploaded; outcome=success | +| AC-5 | Successful upload | `end_session` called; `_private_key is None` | +| AC-6 | Mid-batch failure | `end_session` called; key zeroised | +| AC-7 | FDR stream order | `key.public` before any `tile.*` | +| AC-8 | 429 + Retry-After: 60 | `Clock.sleep` ≥ 60s; retry succeeds | +| AC-9 | 5x 503 | `SatelliteProviderError`; finally still ran | +| AC-10 | 401 first attempt | Fail-fast; no API-key in any log | +| AC-11 | Empty pending set | outcome=success; zero POSTs; key still session-started/ended | +| AC-12 | `isinstance` check on impl + partial fake | True / False | +| AC-13 | Property test: deterministic canonical bytes | Bitwise equal for N samples | +| AC-14 | Partial-success batch | No exception; outcome=partial; per-tile statuses correct | +| NFR-perf-throughput | 1000 tiles via fake httpx | ≥ 20 tile/s including signing | + +## Constraints + +- The signing canonical-bytes scheme is `sha256(tile_blob || zoomLevel || latitude || longitude || capture_timestamp || flight_id || companion_id || quality_metadata_json)`; the parent suite's D-PROJ-2 ingest endpoint MUST agree on this scheme (the leftover file documents the contract sketch). Any divergence at the parent-suite side surfaces as `signature_rejected` and gets FDR-alerted. +- The uploader does NOT modify the multipart payload's tile_blob — bytes go from C6 directly into the POST body. +- The order of operations is gate → start_session → enumerate → batch loop → finally end_session. Reordering is a Reliability finding (High). +- Concurrent C11 invocations are blocked by C12's lockfile; this task asserts the lock exists at construction. +- This task introduces no new third-party dependencies beyond `httpx` and `cryptography` (already used in AZ-316 and AZ-318). +- The `companion_id` field comes from `config.c11.companion_id` — not auto-detected, not derived from hostname; documented because the parent suite's voting layer relies on stable per-companion identifiers. + +## Risks & Mitigation + +**Risk 1: Parent-suite ingest endpoint not yet implemented (D-PROJ-2)** +- *Risk*: Until `satellite-provider` ships the POST endpoint, every upload fails with 404. +- *Mitigation*: The e2e-test `mock-suite-sat-service` fixture (under `tests/fixtures/`, owned by E-BBT) implements the planned POST contract. The C11 unit tests run against a fake `httpx.Client`; integration tests run against the mock fixture. Production retire to the real endpoint when it ships; no code change in C11. + +**Risk 2: Signature canonical-bytes drift between C11 and parent suite** +- *Risk*: A subtle JSON-ordering or float-formatting drift produces signatures that don't verify on the parent side. +- *Mitigation*: AC-13 property test asserts bitwise determinism on the C11 side; the leftover file documents the canonical scheme; the parent-suite team's Plan cycle will reuse the same scheme. If they diverge, `signature_rejected` surfaces immediately and the safety officer is alerted. + +**Risk 3: `Retry-After` parsing for HTTP-date format** +- *Risk*: The parent suite returns `Retry-After: ` not ``; naïve parsing crashes. +- *Mitigation*: Same as AZ-316 (TileDownloader Risk 1) — parse both forms; cap wait at `config.c11.max_retry_after_s`. + +**Risk 4: try/finally violation (key not zeroised on `KeyboardInterrupt`)** +- *Risk*: A `KeyboardInterrupt` during the batch loop bypasses the finally if poorly written. +- *Mitigation*: The finally is unconditional (Python's `try/finally` runs for `KeyboardInterrupt`); a unit test injects `KeyboardInterrupt` mid-batch and asserts `end_session` ran. + +**Risk 5: Partial-success state inconsistency** +- *Risk*: A tile is marked `uploaded` in C6 but the parent suite later disputes (race between `mark_uploaded` and the safety officer's audit). +- *Mitigation*: `mark_uploaded` records the `batch_uuid` (per AZ-303 contract); audits cross-reference `batch_uuid` + `tile_id` against the parent suite's ingest log. The race window is ≤ 1 sec (mark happens immediately after the per-tile response is parsed). Documented; not addressed in this task. + +## Runtime Completeness + +- **Named capability**: post-landing tile upload to D-PROJ-2 ingest endpoint, AC-8.4 enforcement, F10 functional flow, R09 mitigation via per-flight key (composed from AZ-318), parent-suite voting-layer enabler. +- **Production code that must exist**: real `HttpTileUploader` orchestrating real `httpx` POSTs, real C6 `mark_uploaded` calls, real `try/finally` zeroisation, real composition-root factory, real config schema extension, real canonical-byte scheme. +- **Allowed external stubs**: tests MAY use a fake `httpx.Client`, fake `Clock`, fake C6 stores (already provided by AZ-303's conformance fakes), fake `FlightStateGate` and `PerFlightKeyManager` (so this task's tests don't drag in AZ-317/AZ-318 internals); production wiring uses real all the way down. +- **Unacceptable substitutes**: skipping the gate (defeats AC-8.4 defence-in-depth); silently retrying signature rejections without FDR (loses safety officer surface); reusing a static signing key (reintroduces R09); marking a tile uploaded before the parent suite acks (data integrity violation); manually building the multipart boundary (`httpx`'s `files=` is the right interface). + +## Contract + +This task produces/implements the contract at `_docs/02_document/contracts/c11_tilemanager/tile_uploader.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-320_c11_idempotent_retry.md b/_docs/02_tasks/todo/AZ-320_c11_idempotent_retry.md new file mode 100644 index 0000000..1b6efdd --- /dev/null +++ b/_docs/02_tasks/todo/AZ-320_c11_idempotent_retry.md @@ -0,0 +1,214 @@ +# C11 Idempotent Retry — In-Call Retry Loop on Partial-Success Batches + +**Task**: AZ-320_c11_idempotent_retry +**Name**: C11 Idempotent Retry Decorator +**Description**: Implement `IdempotentRetryTileUploader`, a decorator that wraps the AZ-319 `TileUploader` Protocol impl and adds bounded in-call retry on partial-success batches. After the underlying uploader returns `outcome=partial`, the decorator re-queries C6's `pending_uploads` (already-acked tiles were `mark_uploaded`'d, so the second pass naturally targets only the unacked subset), waits an exponential-backoff delay, and re-invokes the underlying upload. Caps at `config.c11.max_in_call_retries` (default 3); on budget exhaustion, the final report's `outcome` stays `partial` and `next_retry_at_s` carries an operator hint for when to retry later. A per-tile rejection counter in C6 metadata (`upload_attempts`) bounds the per-tile retry budget — after `config.c11.max_per_tile_attempts` (default 5), the tile is moved to `voting_status = upload_giveup` (a new enum value added by this task) and surfaced via FDR for human review. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-273_fdr_client_ringbuf, AZ-303_c6_storage_interfaces, AZ-319_c11_tile_uploader +**Component**: c11_tilemanager (epic AZ-251 / E-C11) +**Tracker**: AZ-320 +**Epic**: AZ-251 (E-C11) + +### Document Dependencies + +- `_docs/02_document/contracts/c11_tilemanager/tile_uploader.md` — the underlying Protocol this decorator wraps; the decorator itself implements the same Protocol (drop-in replacement). +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — consumed: `pending_uploads`, `mark_uploaded`, `update_voting_status`. This task adds an `upload_attempts` integer field and an `upload_giveup` value to `VotingStatus` — a contract change that bumps `tile_metadata_store.md` to v1.1.0 (non-breaking minor). +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — `kind="c11.upload.giveup"` envelope. +- `_docs/02_document/components/12_c11_tilemanager/tests.md` — C11-IT-05 test scenario. + +## Problem + +Without bounded in-call retry: + +- C11-IT-05 ("idempotent uploads on retry — re-running `upload_pending` after a partial-success batch only POSTs the tiles that weren't acknowledged before") relies on the operator manually re-invoking `upload_pending`. Operators tolerate one re-invocation but resent doing it 3-4 times after transient `satellite-provider` flakiness. +- A single tile that ALWAYS fails (e.g., truncated tile_blob in C6 fails ingest validation forever) becomes a poison pill — every retry attempt re-uploads it AND every other unacked tile, wasting bandwidth and signing cycles. Without a per-tile budget, the operator cannot distinguish transient failures from terminal ones. +- The `next_retry_at_s` field of `UploadBatchReport` (per AZ-319 contract) has no producer — without backoff calculation, the field is always None and the operator gets no hint on retry timing. +- The parent suite's voting layer assumes uploaded tiles are eventually-consistent; an unbounded retry loop with no per-tile state would create lockstep retry storms. + +This task delivers the retry decorator. It changes NO underlying logic in AZ-319; it composes. + +## Outcome + +- An `IdempotentRetryTileUploader` class at `src/gps_denied_onboard/components/c11_tilemanager/idempotent_retry.py`: + - Implements the `TileUploader` Protocol (drop-in for `HttpTileUploader`). + - Constructor: `__init__(self, *, inner: TileUploader, tile_metadata_store: TileMetadataStore, fdr_client: FdrClient, logger: Logger, clock: Clock, config: C11RetryConfig)`. + - `C11RetryConfig` is a frozen dataclass with `max_in_call_retries: int = 3`, `max_per_tile_attempts: int = 5`, `backoff_base_s: float = 2.0`, `backoff_cap_s: float = 60.0`. +- `upload_pending_tiles(request)` flow: + 1. Calls `inner.upload_pending_tiles(request)` once. + 2. If the inner returns `outcome in (success, failure)` → return as-is. + 3. If `outcome == partial`: + - For each `PerTileStatus.status == rejected`, increments the tile's `upload_attempts` in C6 via a new `tile_metadata_store.increment_upload_attempts(tile_id)` method. + - For each tile whose `upload_attempts >= config.max_per_tile_attempts`, calls `tile_metadata_store.update_voting_status(tile_id, VotingStatus.UPLOAD_GIVEUP)`; emits FDR `kind="c11.upload.giveup"` with `{tile_id, attempts, last_rejection_reason}`; emits ERROR log. + - If `retries_used < config.max_in_call_retries` AND there are still tiles with `voting_status == pending`: + - Sleeps `min(config.backoff_base_s ** retries_used, config.backoff_cap_s)` seconds via injected `Clock.sleep`. + - Recurses with `retries_used += 1` (via internal helper, NOT actual recursion — bounded loop). + - Else (budget exhausted): + - Aggregates the final `UploadBatchReport`: `outcome = partial`; `retry_count = retries_used`; `next_retry_at_s = clock.now() + config.backoff_cap_s` (operator hint). + - Returns the aggregated report. +- `enumerate_pending_tiles(flight_id)` and `confirm_flight_state()` pass through to the inner unchanged. +- A new `VotingStatus.UPLOAD_GIVEUP` enum value is added to AZ-303's `VotingStatus` (in C6's `_types.py`); this is a non-breaking minor bump of `tile_metadata_store.md` to v1.1.0 — the producer (AZ-303) stays in v1, but C6's contract file's `Change Log` is appended by this task with a note pointing to the bump. +- A new `tile_metadata_store.increment_upload_attempts(tile_id) -> int` method is added to AZ-303's `TileMetadataStore` Protocol; returns the new attempt count post-increment. This is a Protocol surface addition (minor bump). The implementation lives in AZ-305's `PostgresFilesystemStore`. This task adds: + - The Protocol method declaration in `c6_tile_cache/interface.py`. + - The impl in `c6_tile_cache/postgres_filesystem_store.py` (a single SQL `UPDATE ... SET upload_attempts = upload_attempts + 1 WHERE tile_id = $1 RETURNING upload_attempts`). + - The Postgres column `upload_attempts INTEGER NOT NULL DEFAULT 0` via a NEW alembic migration `_alembic/0002_upload_attempts.sql` (NOT modifying AZ-304's 0001 migration; per `coderule.mdc` migrations are append-only). +- The composition root wraps `HttpTileUploader` with `IdempotentRetryTileUploader` by default. A `config.c11.disable_retry_decorator: bool = false` lets operators bypass the decorator for debugging. +- INFO log on session start with retry config; INFO log per retry attempt with `attempt_number, sleep_s, remaining_pending_count`; ERROR log on per-tile giveup; FDR `kind="c11.upload.giveup"` per tile. + +## Scope + +### Included + +- `IdempotentRetryTileUploader` decorator class. +- `C11RetryConfig` frozen dataclass. +- `VotingStatus.UPLOAD_GIVEUP` enum value addition (in C6's `_types.py`). +- `tile_metadata_store.increment_upload_attempts(tile_id) -> int` Protocol method addition + AZ-305 SQL impl. +- `_alembic/0002_upload_attempts.sql` migration — adds `upload_attempts INTEGER NOT NULL DEFAULT 0` column to the tiles table. +- Composition-root wiring (decorate `HttpTileUploader` by default; `config.c11.disable_retry_decorator` lets operators opt out). +- Bumping `tile_metadata_store.md` to v1.1.0 with a Change Log entry. +- INFO/ERROR logs and FDR `c11.upload.giveup` emission. +- Conformance test: `isinstance(IdempotentRetryTileUploader(...), TileUploader)`. + +### Excluded + +- The underlying `HttpTileUploader` impl — owned by AZ-319. +- The decision rule for what counts as a transient vs. terminal rejection — the decorator treats EVERY rejection as transient until the per-tile attempt budget is hit; the operator may manually move `UPLOAD_GIVEUP` tiles back to `pending` after investigation (out-of-band SQL UPDATE; no API surface). +- A separate background-retry daemon — the retry happens within `upload_pending_tiles`; the operator decides when to invoke it. +- Cross-process retry coordination — the C12 lockfile already prevents concurrent C11 invocations. +- Surfacing `UPLOAD_GIVEUP` in the operator-tooling CLI — owned by E-C12. +- Auto-promotion of `UPLOAD_GIVEUP` back to `pending` after manual fixes — operator concern; out of scope. + +## Acceptance Criteria + +**AC-1: Success on first attempt — no retry** +Given the inner uploader returns `outcome = success` on the first call +When `upload_pending_tiles(request)` is called +Then the decorator returns immediately; ZERO calls to `Clock.sleep`; ZERO calls to `increment_upload_attempts`; report passes through unchanged + +**AC-2: Partial-success with retry budget available** +Given inner returns `outcome=partial` with 3 of 10 tiles rejected (per_tile_status), and `max_in_call_retries=3` +When the decorator processes the partial +Then `increment_upload_attempts` is called 3 times (one per rejected tile); `Clock.sleep(2.0)` is called once; inner is re-invoked; if the second attempt is `success`, the final aggregated report shows `outcome = success` and `retry_count = 1` + +**AC-3: Per-tile budget exhausted moves tile to UPLOAD_GIVEUP** +Given a tile whose `upload_attempts` reaches `max_per_tile_attempts=5` +When the decorator increments the counter +Then `update_voting_status(tile_id, UPLOAD_GIVEUP)` is called; ONE FDR `kind="c11.upload.giveup"` is emitted with `{tile_id, attempts=5, last_rejection_reason}`; ONE ERROR log; the tile is NOT re-uploaded in subsequent retries (since `pending_uploads` excludes UPLOAD_GIVEUP) + +**AC-4: In-call retry budget exhausted** +Given inner consistently returns `outcome=partial` with the same rejected tile, and `max_in_call_retries=3` +When the decorator runs out of in-call retries +Then 3 retries are attempted (4 total inner calls including the first); `Clock.sleep` is called 3 times with backoffs `2.0, 4.0, 8.0`; the final report has `outcome=partial`, `retry_count=3`, `next_retry_at_s = clock.now() + backoff_cap_s` + +**AC-5: Backoff cap honoured** +Given `max_in_call_retries=10` and `backoff_cap_s=10` +When the decorator computes the 6th retry delay +Then `Clock.sleep(10.0)` is called (capped at 10s, not `2^6 = 64s`) + +**AC-6: VotingStatus.UPLOAD_GIVEUP enum exposed** +Given the AZ-303 `VotingStatus` enum (post-this-task) +When a consumer imports it +Then `VotingStatus.UPLOAD_GIVEUP` is present alongside `PENDING`, `TRUSTED`, `REJECTED`; the contract file's Change Log shows v1.1.0 + +**AC-7: `increment_upload_attempts` returns new count** +Given a tile with `upload_attempts = 2` +When `increment_upload_attempts(tile_id)` is called +Then the SQL row's `upload_attempts` is now 3; the method returns `3`; concurrent invocations on different tiles produce no contention (per-row lock) + +**AC-8: Migration 0002 adds the column** +Given a fresh DB at AZ-304's 0001 migration +When 0002 is applied +Then the `tiles` table has an `upload_attempts INTEGER NOT NULL DEFAULT 0` column; existing rows have `upload_attempts = 0`; the migration is reversible (drops the column on rollback) + +**AC-9: Decorator is a drop-in for the Protocol** +Given an `IdempotentRetryTileUploader` instance +When `isinstance(impl, TileUploader)` is checked under `runtime_checkable` +Then the result is `True`; consumers that depend on the Protocol see no shape difference + +**AC-10: `disable_retry_decorator` config bypass** +Given `config.c11.disable_retry_decorator = true` +When the composition root constructs the uploader +Then `build_tile_uploader(config)` returns the bare `HttpTileUploader` (no decorator); a debug INFO log records the bypass + +**AC-11: Pass-through methods** +Given the decorator +When `enumerate_pending_tiles(flight_id)` and `confirm_flight_state()` are called +Then both delegate to `inner` directly with no added logic + +**AC-12: Inner exception propagates without retry** +Given inner raises `FlightStateNotOnGroundError` or `SatelliteProviderError` +When the decorator catches the exception +Then it re-raises immediately; no retry is attempted (these are not partial-success cases); ZERO `Clock.sleep` calls + +**AC-13: Idempotent across re-invocations (the C11-IT-05 scenario)** +Given a 50-tile batch where 30 succeed and 20 are rejected on first call (no in-call retry to avoid mixing); operator re-invokes after 5 minutes +When the second call runs +Then `pending_uploads` returns only the 20 tiles (the 30 are already `voting_status = uploaded`); `inner.upload_pending_tiles` is called with the request; only those 20 are POSTed; the 30 are NOT re-sent + +## Non-Functional Requirements + +**Performance** +- Decorator overhead per `upload_pending_tiles` call ≤ 5 ms (plus `Clock.sleep` time, which is intentional). +- `increment_upload_attempts` SQL call ≤ 5 ms p99 against the local Postgres. + +**Compatibility** +- The new migration is append-only (NOT a modification of AZ-304's 0001 migration). +- The new `VotingStatus.UPLOAD_GIVEUP` value is additive (non-breaking). +- `increment_upload_attempts` is a Protocol method addition; existing AZ-303 conformance tests pass (the method has a default impl that raises `NotImplementedError` if a future implementation forgets it — but the AZ-305 impl provides the SQL version). + +**Reliability** +- The retry loop is bounded by BOTH `max_in_call_retries` AND `max_per_tile_attempts`; neither alone can produce unbounded behaviour. +- The decorator does NOT swallow exceptions from `inner`; only `outcome=partial` results are eligible for retry. +- The injected `Clock.sleep` makes retry timing deterministic in tests. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Inner success | Pass-through; zero retries | +| AC-2 | Partial → retry → success | One `Clock.sleep(2.0)`; retry_count=1; final outcome=success | +| AC-3 | Per-tile attempts hits 5 | `update_voting_status` called; FDR + ERROR log emitted | +| AC-4 | Persistent partial across 4 attempts | `Clock.sleep(2.0)`, `(4.0)`, `(8.0)`; retry_count=3; final partial | +| AC-5 | Cap=10 with high attempt number | `Clock.sleep(10.0)` not `Clock.sleep(64.0)` | +| AC-6 | Import `VotingStatus.UPLOAD_GIVEUP` | Present; tile_metadata_store.md v1.1.0 | +| AC-7 | Concurrent `increment_upload_attempts` on different tiles | No deadlock; correct counts | +| AC-8 | Apply 0002 migration | Column added; default 0; rollback drops | +| AC-9 | `isinstance` check | True | +| AC-10 | Config bypass | Bare impl; debug log | +| AC-11 | Pass-through methods | Delegated unchanged | +| AC-12 | Inner raises | Re-raised; zero retries | +| AC-13 | Two-call scenario across operator re-invocations | First call: 30 acked / 20 rejected; second call: only 20 POSTed | +| NFR-perf-overhead | Microbench decorator with success-on-first | ≤ 5 ms overhead | + +## Constraints + +- The decorator MUST be a drop-in for `TileUploader`; the composition root selects via `config.c11.disable_retry_decorator` only. +- The retry budget is per-call (in-call) and per-tile (across calls); neither budget alone fully bounds — both are required. +- `increment_upload_attempts` is the ONLY method that mutates `upload_attempts`; consumers do NOT directly UPDATE the column. This is a contract invariant; code-review treats direct UPDATEs as `Architecture` finding (High). +- The `UPLOAD_GIVEUP` voting status is a HUMAN-decision boundary — automated promotion back to `pending` is forbidden in this task. An out-of-band SQL UPDATE by the operator is the documented recovery path. +- The migration 0002 is APPEND-ONLY relative to 0001; it does NOT alter existing column types. +- This task introduces no new third-party dependencies. + +## Risks & Mitigation + +**Risk 1: AZ-303 contract bump cascades to other consumers** +- *Risk*: Adding `increment_upload_attempts` to the Protocol forces every existing C6 consumer (C2 VPR, C2.5 ReRanker, C3 CrossDomainMatcher, C10 CacheProvisioner, C12 OperatorTooling) to re-confirm conformance. +- *Mitigation*: The new method is OPTIONAL via a Protocol default impl that raises `NotImplementedError`; consumers that don't call it are unaffected. The conformance test verifies only that AZ-305's impl provides it. + +**Risk 2: Backoff cap interacts badly with operator workflows** +- *Risk*: A 60-second cap means the operator may walk away during retries; the visible CLI hangs. +- *Mitigation*: The decorator emits an INFO log per retry attempt with `attempt_number, sleep_s, remaining_pending_count`; C12's CLI surfaces this so the operator sees progress. Cap is configurable. + +**Risk 3: `UPLOAD_GIVEUP` tiles accumulating without operator visibility** +- *Risk*: A subtle data corruption in C6 causes 100% of tiles to hit `UPLOAD_GIVEUP`; the operator notices only when they manually inspect C6. +- *Mitigation*: Each `UPLOAD_GIVEUP` event emits FDR `kind="c11.upload.giveup"` AND ERROR log; C12's CLI summary surfaces the count post-upload-run. This task adds NO direct UI; C12's task list will include surfacing. + +**Risk 4: Clock.sleep blocking on KeyboardInterrupt** +- *Risk*: A long backoff (60s) blocks the process; Ctrl+C aborts mid-sleep but might leave state inconsistent. +- *Mitigation*: The decorator uses the injected `Clock` which is the same singleton as AZ-307/AZ-308; KeyboardInterrupt propagates upward and AZ-319's try/finally still runs `key_manager.end_session()`; the decorator's own state is just the retry counter (in-memory; no on-disk side effects between retries). + +## Runtime Completeness + +- **Named capability**: bounded in-call retry on partial-success uploads, per-tile retry budget with `UPLOAD_GIVEUP` terminal state, operator-friendly `next_retry_at_s` hint (description.md § 5, C11-IT-05). +- **Production code that must exist**: real `IdempotentRetryTileUploader` decorator, real `increment_upload_attempts` SQL, real migration 0002, real `VotingStatus.UPLOAD_GIVEUP` enum value, real composition-root wiring with the bypass flag. +- **Allowed external stubs**: tests MAY use a fake `inner` (mock TileUploader implementing the Protocol with scripted responses), fake `Clock`, fake `tile_metadata_store` (already provided by AZ-303 conformance fakes); production wiring uses real all the way down. +- **Unacceptable substitutes**: a recursive Python implementation of the retry loop (stack-explosion risk; bounded iteration is required); skipping the per-tile budget (lets one bad tile poison every retry); silently moving tiles to `UPLOAD_GIVEUP` without FDR (loses safety officer surface); modifying AZ-304's 0001 migration in place (breaks deployment idempotence — migrations are append-only). diff --git a/_docs/02_tasks/todo/AZ-321_c10_engine_compiler.md b/_docs/02_tasks/todo/AZ-321_c10_engine_compiler.md new file mode 100644 index 0000000..e9fd90f --- /dev/null +++ b/_docs/02_tasks/todo/AZ-321_c10_engine_compiler.md @@ -0,0 +1,197 @@ +# C10 Engine Compiler — Per-Model TRT Compile + Hardware-Tied Cache Reuse + +**Task**: AZ-321_c10_engine_compiler +**Name**: C10 Engine Compiler +**Description**: Implement `EngineCompiler`, the C10-internal phase that compiles or re-uses TensorRT engines for every backbone the corpus needs (DINOv2 reduced for VPR, LightGlue, ALIKED descriptor head, plus any C7-runtime-required model). For each backbone, computes the AZ-281 self-describing filename `{model}_{sm}_{jp}_{trt}_{precision}.engine`, looks for an existing engine + sidecar at that path, and either re-uses it (cache hit, D-C10-6) or invokes AZ-298's TensorRT runtime to compile from the ONNX source + calibration cache. Writes each new engine via AZ-280's `Sha256Sidecar` for the takeoff content-hash gate. Returns a `list[EngineCacheEntry]` recording the per-backbone outcome (built / reused) plus the cache hit ratio. The compile is hardware-tied: SM, Jetpack, TRT version, and precision flags are baked into the filename so re-running on a different device produces a cache miss (correct behaviour, not a bug). +**Complexity**: 5 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-298_c7_tensorrt_runtime +**Component**: c10_provisioning (epic AZ-252 / E-C10) +**Tracker**: AZ-321 +**Epic**: AZ-252 (E-C10) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — filename shape + parser (AZ-281). +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — atomic write + sidecar pattern (AZ-280). +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — engine compile API (AZ-298). +- `_docs/02_document/components/11_c10_provisioning/description.md` — § 5 error handling, § 7 caveats (D-C10-6 hardware-tied). + +## Problem + +Without a real engine compiler: + +- AC-NEW-1 (no engine deserialization at takeoff before manifest verify) collapses on the build side — F1 cannot produce the `.engine` artifacts the airborne C7 deserialise step expects. +- D-C10-6 (calibration cache reuse on identical hardware) is unobservable — every build re-compiles from scratch, blowing the C10-PT-01 ≤ 12 min cold target on warm runs. +- D-C10-7 (self-describing engine filename) has no producer — without `{model}_{sm}_{jp}_{trt}_{precision}.engine`, hardware mismatches between operator workstation and Jetson airborne would silently load wrong-arch engines. +- The C10-PT-01 warm idempotent re-run target (≤ 1 min) cannot be hit; engines dominate build time. +- C10-IT-05 (Tier-2 build produces SM 87 / JP 6.2 / TRT 10.3 / FP16 engines) has no implementation. +- Operators have no way to inspect which engines came from cache vs. were rebuilt — a critical signal for diagnosing GPU-OOM or calibration regressions. + +This task delivers the per-model compile + cache-reuse logic. It does NOT own the orchestration (T5 owns `build_cache_artifacts`), the descriptor batching (T2), or the manifest writing (T3). + +## Outcome + +- An `EngineCompiler` class at `src/gps_denied_onboard/components/c10_provisioning/engine_compiler.py`: + - Constructor: `__init__(self, *, inference_runtime: InferenceRuntime, sidecar: Sha256Sidecar, filename_schema: EngineFilenameSchema, logger: Logger)`. + - Public method: `compile_engines_for_corpus(request: EngineCompileRequest) -> list[EngineCacheEntry]`. + - `EngineCompileRequest` (`@dataclass(frozen=True)`): `backbones: tuple[BackboneSpec, ...]`, `calibration_path: Path`, `cache_root: Path`, `precision: enum {fp16, int8}`. + - `BackboneSpec` (`@dataclass(frozen=True)`): `model_name: str`, `onnx_path: Path`, `expected_input_shape: tuple[int, ...]`. + - `EngineCacheEntry` (`@dataclass(frozen=True)`): `model_name: str`, `engine_path: Path`, `sidecar_path: Path`, `outcome: enum {built, reused}`, `compile_duration_s: float | None`, `engine_sha256_hex: str`. +- Method flow: + 1. For each `BackboneSpec`: + a. Detect runtime hardware (SM, JP, TRT version) via `inference_runtime.host_info()`. + b. Compute the target filename via `filename_schema.format(...)`: `{model}_{sm}_{jp}_{trt}_{precision}.engine`. + c. Compute the target path: `{cache_root}/engines/{filename}`. + d. If `target_path.exists()` AND `sidecar.verify(target_path)` returns `True`: + - Outcome = `reused`; emit INFO log `kind="c10.engine.cache.hit"`; append `EngineCacheEntry`; continue. + e. Else (cache miss): + - Emit WARN log `kind="c10.engine.cache.miss"` with `{model_name, target_filename}`. + - Call `inference_runtime.compile_engine(onnx_path, calibration_path, precision, expected_input_shape) -> bytes` (raises `EngineBuildError` or `CalibrationCacheError` on failure — propagate). + - Write the engine bytes via `sidecar.write_with_sidecar(target_path, engine_bytes)` (atomic write + SHA-256 sidecar at `{target_path}.sha256`). + - Outcome = `built`; record `compile_duration_s` from `time.monotonic()` deltas; append `EngineCacheEntry`. + 2. Return the list. Aggregate count: `engines_built`, `engines_reused`, total cache hit ratio. INFO log `kind="c10.engine.compile.summary"` with the totals. +- The composition root constructs `EngineCompiler` and injects it into the T5 CacheProvisioner. Factory: `build_engine_compiler(config) -> EngineCompiler`. +- A `BackboneSpec` registry at `src/gps_denied_onboard/runtime_root/c10_factory.py` enumerates the project's backbones (initially DINOv2-VPR + LightGlue + ALIKED — cross-referenced against E-C2/E-C2.5/E-C3 component descriptions). The list is config-driven via `config.c10.backbones: list[BackboneSpec]` so a future model addition does not require code change. +- INFO log on every cache hit; WARN on every cache miss; ERROR on `EngineBuildError` / `CalibrationCacheError` with the offending model name. + +## Scope + +### Included + +- `EngineCompiler` class with the single public method. +- The 3 DTOs (`EngineCompileRequest`, `BackboneSpec`, `EngineCacheEntry`) plus their enum types. +- Hardware-tied filename construction via AZ-281's schema. +- Cache-hit detection via `sidecar.verify` (sha256 sidecar matches). +- Cache-miss compile via AZ-298's `InferenceRuntime.compile_engine`. +- Atomic engine write + sidecar via AZ-280. +- Composition-root factory. +- Conformance test: a fake `InferenceRuntime` returns scripted engine bytes; the test asserts cache hit / miss outcomes for the documented matrix. +- Per-cache-entry timing instrumentation. +- `config.c10.backbones` schema extension on AZ-269's loader. + +### Excluded + +- The orchestration of when to compile (T5 owns `build_cache_artifacts`). +- Descriptor generation (T2 owns). +- Manifest writing (T3 owns). +- TensorRT internals — owned by AZ-298 (the `compile_engine` impl); this task only consumes the protocol. +- Engine deserialization at takeoff — owned by AZ-298 (load side) + the C7 component runtime self-check. +- Engine version compatibility checks across deployments — out of scope; the filename schema (AZ-281) carries enough signal that mismatches surface as cache miss. +- Multi-GPU compile — operator workstation is single-GPU per RESTRICT-OPS-2. +- A re-build-now CLI flag — operator workflow goes through T5; force-rebuild is achieved by deleting the engine cache directory. + +## Acceptance Criteria + +**AC-1: Cold cache compiles every backbone** +Given an empty `cache_root/engines/` and 3 backbones in `BackboneSpec[]` +When `compile_engines_for_corpus(request)` is called +Then 3 `EngineCacheEntry` are returned, all with `outcome = built`; 3 `.engine` files + 3 `.sha256` sidecars are present at `cache_root/engines/`; ONE WARN log per backbone (`c10.engine.cache.miss`); ONE INFO log summary with `engines_built=3, engines_reused=0` + +**AC-2: Warm cache reuses every backbone** +Given the same `cache_root/engines/` populated by a prior cold run +When `compile_engines_for_corpus(request)` is called with identical request +Then 3 `EngineCacheEntry` are returned, all `outcome = reused`; ZERO calls to `inference_runtime.compile_engine` (verifiable via spy); ONE INFO log per backbone (`c10.engine.cache.hit`); summary log shows `engines_reused=3` + +**AC-3: Mixed cache (1 hit + 2 miss)** +Given the cache contains only the DINOv2 engine; LightGlue and ALIKED are missing +When `compile_engines_for_corpus(request)` is called +Then DINOv2 → reused, LightGlue + ALIKED → built; the report shows `engines_built=2, engines_reused=1` + +**AC-4: Hardware change invalidates cache** +Given a cache populated for `(sm=87, jp=6.2, trt=10.3, fp16)` and the runtime now reports `(sm=89, jp=6.3, trt=10.5, fp16)` +When `compile_engines_for_corpus(request)` is called +Then ALL backbones have `outcome = built` (the filename differs, so the existing engines are not even consulted); the existing engines remain on disk (this task does NOT delete stale engines — that's the orchestrator's call) + +**AC-5: Tampered sidecar invalidates that one engine** +Given a `.engine` file matches its sidecar but a malicious actor flipped a bit in the sidecar (or the engine bytes drifted) +When `compile_engines_for_corpus(request)` is called +Then `sidecar.verify` returns `False` for that entry; that backbone is recompiled (`outcome = built`); ONE WARN log `kind="c10.engine.sidecar.mismatch"` with the offending path + +**AC-6: `EngineBuildError` propagates without partial state** +Given `inference_runtime.compile_engine` raises `EngineBuildError("CUDA OOM")` on the second of 3 backbones +When `compile_engines_for_corpus(request)` is called +Then `EngineBuildError` is raised; the first backbone's engine + sidecar ARE present (already-written cache reuse from prior runs); the second backbone's engine is NOT half-written (atomic write); the third backbone is NOT attempted; ONE ERROR log with the model name + +**AC-7: `CalibrationCacheError` propagates with diagnostic** +Given `inference_runtime.compile_engine` raises `CalibrationCacheError("calibration table missing for INT8")` +When the compiler hits the failing backbone +Then the error propagates; ONE ERROR log with `{model_name, calibration_path}`; partial state is consistent (atomic writes guarantee no half-engine on disk) + +**AC-8: Filename schema + sidecar layout matches spec** +Given a freshly-built DINOv2 engine on Tier-2 hardware (SM 87, JP 6.2, TRT 10.3, FP16) +When inspecting `cache_root/engines/` +Then the file is named `dinov2_vpr_sm87_jp62_trt103_fp16.engine`; the sidecar at `dinov2_vpr_sm87_jp62_trt103_fp16.engine.sha256` contains the 64-char hex digest; both match `EngineFilenameSchema.parse` and `Sha256Sidecar.verify` + +**AC-9: `compile_duration_s` recorded for built; None for reused** +Given a mix of hits and misses +When inspecting `EngineCacheEntry` +Then `compile_duration_s is not None` for every `built` entry; `compile_duration_s is None` for every `reused` entry; built durations are positive floats + +**AC-10: Empty `backbones` list returns empty result** +Given `request.backbones == ()` +When `compile_engines_for_corpus(request)` is called +Then `[]` is returned; ZERO calls to `inference_runtime.compile_engine`; ZERO files written; ONE INFO log summary with all-zero counts + +## Non-Functional Requirements + +**Performance** +- Cache-hit path per backbone ≤ 100 ms (one filename construction + one `Path.exists` + one sidecar verify dominated by SHA-256 of the engine file, which is bounded by disk read bandwidth). For a 200 MB engine, this is ~1 s on NVMe — measure and document. +- Cold compile is dominated by AZ-298's TensorRT runtime; this task imposes no additional time budget beyond AZ-298's. + +**Compatibility** +- AZ-281 (`EngineFilenameSchema`) and AZ-280 (`Sha256Sidecar`) are the schema and atomic-write helpers; this task introduces NO new third-party dependencies. + +**Reliability** +- Atomic writes via AZ-280 guarantee no half-engine on disk after a process kill. +- Cache-miss recompile is idempotent — running the same compile twice produces identical bytes (TRT engine determinism is owned by AZ-298; this task assumes it). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Empty cache_root + 3 backbones | All `built`; sidecars present | +| AC-2 | Warm cache + identical request | All `reused`; zero `compile_engine` calls | +| AC-3 | Cache populated for 1 of 3 backbones | 1 reused + 2 built | +| AC-4 | Hardware change (different SM in fake runtime) | All `built`; old engines untouched | +| AC-5 | Tampered sidecar (flip 1 byte) | That engine rebuilds; WARN log | +| AC-6 | Fake runtime raises `EngineBuildError` mid-run | Error propagates; partial state consistent | +| AC-7 | Fake runtime raises `CalibrationCacheError` | Error propagates with diagnostic | +| AC-8 | Inspect filename + sidecar layout | Matches schema; both verify | +| AC-9 | Compile_duration recorded | Set on `built`, None on `reused` | +| AC-10 | Empty backbones | Empty result; zero side effects | +| NFR-perf-cache-hit | Microbench cache-hit path × 100 with 200 MB engine | p99 ≤ 1.5 s (mostly SHA-256 read) | +| NFR-reliability-atomic-write | Kill process mid-`compile_engine` | No half-engine on disk after restart | + +## Constraints + +- The filename schema is canonical via AZ-281; this task does NOT invent its own (per `coderule.mdc` "follow established project patterns"). +- The atomic-write + sidecar pattern is canonical via AZ-280; this task does NOT use `open(...).write()` or naked `pathlib.Path.write_bytes()`. +- Cache hit is decided by `sidecar.verify` (file SHA-256 matches sidecar value); filename match alone is NOT sufficient (defends against bit-rot or bit-flip). +- The `BackboneSpec` registry is config-driven; adding a new model is a config change, not a code change. +- This task does NOT clean up stale engines (the orchestrator T5 may emit `ManifestCoverageError` on orphan files; cleanup is the operator's call). +- This task introduces no new third-party dependencies. + +## Risks & Mitigation + +**Risk 1: SHA-256 verification of large engines is slow on warm path** +- *Risk*: 200 MB engine × 5 backbones = 1 GB SHA-256 per warm idempotent run; on slow disks, this exceeds C10-PT-01's 1 min budget alone. +- *Mitigation*: AZ-280's `Sha256Sidecar.verify` uses `sendfile` / `mmap` paths where available; benchmark documented in AZ-280. If still too slow, a future task adds an `mtime + size` quick-check fallback (out of scope this cycle). + +**Risk 2: Partial cache after `EngineBuildError` on backbone N** +- *Risk*: Backbones 1..N-1 are `built` and on disk; the N-th fails; backbones N+1..M are never attempted. The cache is "partially valid" — the orchestrator (T5) sees inconsistent state. +- *Mitigation*: T5's coverage check + `ManifestCoverageError` surface this. The compiler does NOT delete the partial state; T5 decides whether to retry, fail, or roll back per the operator's request mode. + +**Risk 3: TensorRT engine determinism not guaranteed across builds** +- *Risk*: Two compiles of the same ONNX + calibration produce different bytes; cache-hit detection via SHA-256 fails post-rebuild. +- *Mitigation*: TRT engine determinism is AZ-298's contract obligation; if it fails, this task's cache-hit ratio drops to 0 and operators see WARN logs. AZ-298's tests assert determinism; this task assumes it. + +**Risk 4: Operator manually edits engine file but not sidecar** +- *Risk*: Hand-debugging or manual tuning leaves an engine file whose bytes don't match its sidecar; AC-5 covers detection. +- *Mitigation*: AC-5 + WARN log `c10.engine.sidecar.mismatch` surface the case immediately on next compile run; operators should re-generate via the build command. + +## Runtime Completeness + +- **Named capability**: TRT engine compile + hardware-tied cache reuse per D-C10-6 + D-C10-7 (description.md § 5; epic § Acceptance C10-IT-05; AC-NEW-1). +- **Production code that must exist**: real `EngineCompiler` orchestrating real AZ-298 `compile_engine` + real AZ-280 atomic write/verify + real AZ-281 filename construction; real config-driven `BackboneSpec` registry. +- **Allowed external stubs**: tests MAY use a fake `InferenceRuntime` that returns scripted bytes + a fake `host_info()` for hardware variation; production wiring uses the real AZ-298 runtime + real Sha256Sidecar. +- **Unacceptable substitutes**: a Python-level `pickle` of a "fake engine" object (TRT engines are opaque CUDA blobs; faking them in production breaks AC-NEW-1's takeoff verify); skipping the sidecar (loses bit-rot detection); inventing a new filename scheme inside this task (defeats D-C10-7); `Path.write_bytes()` instead of AZ-280 (no atomicity guarantee). diff --git a/_docs/02_tasks/todo/AZ-322_c10_descriptor_batcher.md b/_docs/02_tasks/todo/AZ-322_c10_descriptor_batcher.md new file mode 100644 index 0000000..2c68102 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-322_c10_descriptor_batcher.md @@ -0,0 +1,208 @@ +# C10 Descriptor Batcher — Embed Corpus Tiles via C2 Backbone + Write FAISS + +**Task**: AZ-322_c10_descriptor_batcher +**Name**: C10 Descriptor Batcher +**Description**: Implement `DescriptorBatcher`, the C10-internal phase that walks every tile in C6 for the requested `(bbox, zoom_levels)`, runs them through C2's VPR backbone (via the C7 engine produced by AZ-321) in batches sized for the operator workstation's GPU, collects the resulting fixed-dimension descriptors, and rebuilds the FAISS HNSW index via AZ-303's `DescriptorIndex.rebuild_from_descriptors`. Handles CUDA OOM with halve-and-retry; surfaces per-batch progress via DEBUG logs and a callback. Returns a `DescriptorBatchReport` with `descriptors_generated`, `tiles_consumed`, `oom_retries`, `elapsed_s`. Defines a thin C10-internal `BackboneEmbedder` Protocol with one method `embed_batch(tile_pixels: list[TilePixelHandle]) -> ndarray`; the concrete impl is supplied by E-C2 (AZ-255) later via a thin adapter, OR a direct call into the AZ-321-produced engine if E-C2 ships a public embed API by then. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-303_c6_storage_interfaces, AZ-306_c6_faiss_descriptor_index, AZ-321_c10_engine_compiler +**Component**: c10_provisioning (epic AZ-252 / E-C10) +**Tracker**: AZ-322 +**Epic**: AZ-252 (E-C10) + +### Document Dependencies + +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — `query_by_bbox` (read tile list) and `tile_store.read_tile_pixels` (read tile bytes via mmap handle). +- `_docs/02_document/contracts/c6_tile_cache/descriptor_index.md` — `rebuild_from_descriptors` (atomic write target). +- `_docs/02_document/components/11_c10_provisioning/description.md` — § 5 `DescriptorBatchError` handling; § 7 GPU-bound bottleneck. + +## Problem + +Without a real descriptor batcher: + +- AC-NEW-1's takeoff verify has no FAISS index to verify; the airborne C2 VPR step returns empty top-k. +- AC-8.1 collapses partially — even with imagery in C6, the airborne system cannot localize without descriptors. +- The C10-PT-01 cold-build budget (≤ 12 min) is unobservable; the descriptor phase is the dominant cost on Jetson. +- D-C10-3's "every artifact in Manifest" requirement (AC-NEW-1) cannot list `.index` artifacts that don't exist. +- CUDA OOM during build is the most common failure mode operators hit per the description.md § 5; without a structured halve-and-retry, every OOM is a manual restart. +- Per-batch progress is invisible — operators staring at a `c10 build` command for 8+ minutes see nothing without DEBUG logs they don't enable. + +This task delivers the embed-and-write phase. It does NOT compile engines (AZ-321) or write the Manifest (T3) or orchestrate idempotence (T5). + +## Outcome + +- A `DescriptorBatcher` class at `src/gps_denied_onboard/components/c10_provisioning/descriptor_batcher.py`: + - Constructor: `__init__(self, *, backbone_embedder: BackboneEmbedder, tile_metadata_store: TileMetadataStore, tile_store: TileStore, descriptor_index: DescriptorIndex, logger: Logger, clock: Clock, config: C10BatcherConfig)`. + - `C10BatcherConfig` (`@dataclass(frozen=True)`): `initial_batch_size: int = 64`, `max_oom_retries: int = 1`, `progress_callback: Callable[[ProgressEvent], None] | None = None`. + - Public method: `populate_descriptors(corpus_filter: CorpusFilter) -> DescriptorBatchReport`. + - `CorpusFilter` (`@dataclass(frozen=True)`): `bbox: Bbox`, `zoom_levels: tuple[int, ...]`, `sector_class: SectorClassification`. + - `DescriptorBatchReport` (`@dataclass(frozen=True)`): `descriptors_generated: int`, `tiles_consumed: int`, `oom_retries: int`, `elapsed_s: float`, `outcome: enum {success, failure}`, `failure_reason: str | None`. +- A `BackboneEmbedder` Protocol at `src/gps_denied_onboard/components/c10_provisioning/interface.py`: + ```python + @runtime_checkable + class BackboneEmbedder(Protocol): + def embed_batch(self, tiles: list[TilePixelHandle]) -> np.ndarray: ... + def descriptor_dim(self) -> int: ... + ``` +- Method flow: + 1. Call `tile_metadata_store.query_by_bbox(bbox=request.bbox, zoom_levels=request.zoom_levels, sector_class=request.sector_class)` → list of `TileMetadata` rows. If empty → return `DescriptorBatchReport(outcome=failure, failure_reason="no tiles in C6 for the requested scope; run C11 TileDownloader first")` per description.md § 5. + 2. Open every tile via `tile_store.read_tile_pixels(tile_id)` lazily (context manager; release after each batch). + 3. Walk tiles in batches of `current_batch_size` (initially `config.initial_batch_size`): + - Call `backbone_embedder.embed_batch(tile_pixel_handles)` → `np.ndarray` of shape `(batch_size, descriptor_dim)`. + - On `DescriptorBatchError("CUDA OOM")`: + - If `oom_retries < config.max_oom_retries` AND `current_batch_size > 1`: halve `current_batch_size`, increment `oom_retries`, re-run THIS batch with the smaller size. + - Else: raise `DescriptorBatchError` with full context (batch index, tile ids, current batch size). + - Append the descriptors to a running buffer; record `(tile_id, descriptor_row_index)` mapping. + - Emit a `ProgressEvent(tiles_done, tiles_total, current_batch_size, elapsed_s)` via `config.progress_callback` if set. + - Emit DEBUG log every 10% progress (`c10.descriptor.progress`). + 4. After all tiles consumed: + - Construct the descriptor `np.ndarray` of shape `(tiles_consumed, descriptor_dim)`. + - Construct the int64 id mapping per AZ-306's documented scheme (`int64(sha256(zoom|lat|lon|source).first8bytes)`). + - Call `descriptor_index.rebuild_from_descriptors(descriptors, ids, hnsw_params)` — this writes the `.index` file atomically via AZ-280. + - Return `DescriptorBatchReport(outcome=success, descriptors_generated=tiles_consumed, ...)`. +- Composition root constructs `DescriptorBatcher` with a `BackboneEmbedder` impl. Initially this is a thin `C7EngineBackboneEmbedder` that wraps `inference_runtime.run_engine(engine_path, batch)`; when E-C2 (AZ-255) ships, an adapter wires C2's public embed surface in (one-line factory swap). +- INFO log on session start (with batch counts); DEBUG on per-10% progress; WARN on every OOM retry; ERROR on terminal `DescriptorBatchError`. + +## Scope + +### Included + +- `DescriptorBatcher` class with the single public method. +- `BackboneEmbedder` Protocol declaration + a default `C7EngineBackboneEmbedder` adapter that wraps the AZ-298 inference runtime + the AZ-321-produced engine path. +- `CorpusFilter`, `DescriptorBatchReport`, `ProgressEvent`, `C10BatcherConfig` DTOs. +- CUDA OOM halve-and-retry logic. +- Atomic FAISS index rebuild via AZ-303/306's Protocol. +- Progress callback + DEBUG log emission. +- Composition-root factory `build_descriptor_batcher`. +- Conformance test for `BackboneEmbedder` Protocol. + +### Excluded + +- The actual C2 VPR backbone — owned by E-C2 (AZ-255). +- TensorRT engine compile — owned by AZ-321 (the engine the embedder runs). +- Manifest writing — owned by T3. +- Tile download — owned by E-C11 (AZ-316). +- HNSW parameter selection — `hnsw_params` is config-driven; the orchestrator T5 supplies them. The batcher does NOT pick them. +- Multi-GPU / batched-across-GPUs — operator workstation is single-GPU per RESTRICT-OPS-2. +- Resumability mid-batch — if the process is killed at batch N of M, the next run starts from batch 0; descriptors are only written in one shot via `rebuild_from_descriptors` (atomic). Documented constraint. + +## Acceptance Criteria + +**AC-1: Happy path embeds all tiles and rebuilds index** +Given C6 contains 1000 tiles for the requested bbox + zoom_levels +When `populate_descriptors(filter)` is called +Then `embed_batch` is called `ceil(1000 / 64) = 16` times; the final descriptor array has shape `(1000, descriptor_dim)`; `descriptor_index.rebuild_from_descriptors` is called ONCE with this array; report shows `descriptors_generated=1000, tiles_consumed=1000, oom_retries=0, outcome=success` + +**AC-2: CUDA OOM halves batch size and retries** +Given `embed_batch` raises `DescriptorBatchError("CUDA OOM")` on the first call with batch_size=64 +When the batcher catches the OOM +Then `embed_batch` is called again with batch_size=32 (halved); `oom_retries` becomes 1; if 32 succeeds, the run continues with batch_size=32 for subsequent batches; ONE WARN log `c10.descriptor.oom.retry` + +**AC-3: Persistent OOM after halve-retry exhausted raises** +Given `embed_batch` raises `DescriptorBatchError("CUDA OOM")` at every batch size from 64 down to 1, and `max_oom_retries=1` +When the batcher exhausts retries +Then `DescriptorBatchError` is raised with the final batch_size + tile_ids context; ZERO `rebuild_from_descriptors` calls; ONE ERROR log + +**AC-4: Empty corpus surfaces as failure with explicit hint** +Given C6 has zero tiles for the requested scope +When `populate_descriptors(filter)` is called +Then `outcome=failure`, `failure_reason="no tiles in C6 for the requested scope; run C11 TileDownloader first"`; ZERO `embed_batch` calls; ONE ERROR log directing the operator to run C11 + +**AC-5: Progress callback fires every 10%** +Given a 1000-tile corpus and a callback spy +When `populate_descriptors(filter)` is called +Then the callback fires at 10%, 20%, ..., 100% (10 times); each event carries `tiles_done`, `tiles_total=1000`, `current_batch_size`, `elapsed_s` + +**AC-6: Descriptor id mapping matches AZ-306's scheme** +Given the same tile (zoom=18, lat=49.5, lon=37.0, source=googlemaps) +When the batcher computes the int64 id +Then the value equals `int.from_bytes(sha256(b"18|49.5|37.0|googlemaps").digest()[:8], "big", signed=True)`; the same call elsewhere produces the same id (deterministic across runs) + +**AC-7: Atomic FAISS rebuild — partial write impossible** +Given the FAISS index already exists from a prior run +When `populate_descriptors` is killed mid-`rebuild_from_descriptors` +Then either the previous-good index OR the new index is on disk; never a half-written `.index`. (AZ-303/306's contract guarantees atomicity; this AC just asserts the batcher does not bypass it.) + +**AC-8: BackboneEmbedder Protocol is conformance-checkable** +Given a concrete `C7EngineBackboneEmbedder` instance +When `isinstance(impl, BackboneEmbedder)` is checked under `runtime_checkable` +Then the result is `True`; a fake omitting `descriptor_dim` returns `False` + +**AC-9: descriptor_dim matches across embed_batch and HNSW params** +Given `backbone_embedder.descriptor_dim() == 512` +When `embed_batch` returns an array +Then the array's last axis is 512; if a future drift produces 768, raise `DescriptorBatchError("descriptor_dim mismatch")` BEFORE writing to FAISS + +**AC-10: Progress + DEBUG logs do not pull the private engine bytes** +Given a session with the C7-engine-backed embedder +When all DEBUG logs are captured +Then engine bytes do NOT appear in any log; only metadata (batch_size, tile_ids, elapsed_s) is logged + +## Non-Functional Requirements + +**Performance** +- Embed throughput is dominated by AZ-321's engine + the embedder; this task adds ≤ 5% overhead (lazy mmap handles + numpy concatenation). +- The 1000-tile corpus should complete in ≤ 5 min on Tier-1 dev workstation (assumes 50ms per batch of 64; envelope only). + +**Compatibility** +- `numpy` per project pin; `pathlib` stdlib; AZ-303 + AZ-306 + AZ-321 dependencies pinned. +- No new third-party dependencies. + +**Reliability** +- Halve-and-retry is bounded by `max_oom_retries`; default 1 (so 64→32, then either succeeds or raises); higher values trade latency for completion probability. +- The atomic FAISS rebuild relies on AZ-303/306's contract; this task does not fork its own write path. +- `descriptor_dim` mismatch is caught before FAISS write to prevent corrupting an existing valid index. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | 1000-tile corpus + fake embedder | 16 batches; rebuild called once; outcome=success | +| AC-2 | Fake embedder raises OOM at batch_size=64; succeeds at 32 | retry happens; oom_retries=1 | +| AC-3 | Fake embedder always OOMs | DescriptorBatchError raised; no rebuild call | +| AC-4 | Empty corpus | outcome=failure; explicit hint; zero embeds | +| AC-5 | 1000 tiles + spy callback | 10 callback events | +| AC-6 | Compute id for sample tile | Matches sha256 first-8-bytes formula | +| AC-7 | Kill mid-rebuild + restart | No half-index (AZ-306's atomic write) | +| AC-8 | `isinstance` check on impl + partial fake | True / False | +| AC-9 | Embedder returns wrong dim | DescriptorBatchError before FAISS write | +| AC-10 | Capture all DEBUG logs | No engine bytes; only metadata | +| NFR-perf-overhead | 1000-tile bench with no-op embedder | ≤ 5% overhead vs raw embed sum | +| NFR-reliability-bounded-retry | Embedder OOM × 5 with max_oom_retries=1 | Raises after 1 retry, not 5 | + +## Constraints + +- `BackboneEmbedder` Protocol surface is intentionally narrow (2 methods); future C2 wiring adapts via the composition root, not by modifying this task. +- `embed_batch` MUST be called with a list of mmap-backed `TilePixelHandle` (per AZ-303); raw bytes are NOT accepted (would defeat AZ-303's read-only invariant). +- The descriptor id formula is canonical via AZ-306; this task does NOT invent its own. +- `rebuild_from_descriptors` is the ONLY write path to the FAISS index in this task; consumers do NOT touch the `.index` file directly. +- Halve-and-retry is bounded; unlimited retries are NOT permitted (would mask GPU regressions). +- This task introduces no new third-party dependencies. + +## Risks & Mitigation + +**Risk 1: BackboneEmbedder Protocol drifts from E-C2's eventual surface** +- *Risk*: When E-C2 (AZ-255) ships, its natural public method might be `embed_query(image: np.ndarray)` not `embed_batch(list[TilePixelHandle])`. +- *Mitigation*: A thin adapter at the C10/C2 boundary translates; the Protocol's two-method surface is small enough that wrapping is trivial. AZ-321 already produces the engine; if E-C2 ships its own public embed API, the C7-backed adapter is replaced via composition root. + +**Risk 2: Halve-and-retry hides a real GPU regression** +- *Risk*: Persistent OOM at batch_size=1 indicates a deeper issue (memory fragmentation, model leak); halving repeatedly down to 1 wastes time. +- *Mitigation*: `max_oom_retries=1` by default — at most one halve. If 32 still OOMs, the run fails fast with full context for operator triage. + +**Risk 3: Descriptor array memory pressure** +- *Risk*: 100k tiles × 512-dim float32 = 200 MB in one numpy array; on small operator workstations this is OK but multiplies for higher-dim backbones (e.g., 1024 → 400 MB). +- *Mitigation*: AZ-306's `rebuild_from_descriptors` accepts a streamed iterator if added later; for now the in-memory approach is documented and bounded by the operator workstation's RAM (RESTRICT-OPS-1 sets a 16 GB floor). + +**Risk 4: Empty corpus is a silent operator mistake** +- *Risk*: Operator forgets to run C11 first; the build silently produces an empty index. +- *Mitigation*: AC-4 + ERROR log + explicit `failure_reason` hint surface immediately; the orchestrator T5 fails the build without writing a Manifest. + +**Risk 5: descriptor_dim mismatch is detected too late** +- *Risk*: All 1000 tiles embed successfully but at the wrong dim; FAISS index is rebuilt with the wrong shape; takeoff verify fails. +- *Mitigation*: AC-9 checks the array's last axis BEFORE the rebuild call; cheap dim check at every batch boundary. + +## Runtime Completeness + +- **Named capability**: descriptor batched generation through C2 backbone over the corpus, FAISS index rebuild, GPU-bound throughput envelope per C10-PT-01 (description.md § 5; epic § Acceptance C10-IT-01). +- **Production code that must exist**: real `DescriptorBatcher` orchestrating real `BackboneEmbedder` (initially `C7EngineBackboneEmbedder` wrapping AZ-298) + real AZ-303/306 `rebuild_from_descriptors`; real OOM halve-and-retry; real progress emission. +- **Allowed external stubs**: tests MAY use a fake `BackboneEmbedder` that returns scripted descriptor arrays + a fake `tile_metadata_store` (already provided by AZ-303 conformance fakes); production wiring uses the real AZ-298 runtime + real C6. +- **Unacceptable substitutes**: a "deterministic descriptor" fake in production (defeats the entire localization pipeline); skipping the OOM retry (every transient OOM becomes a manual restart); writing to FAISS via raw `numpy.tofile` (bypasses AZ-306's atomic write); fabricating descriptor ids that don't match AZ-306's int64 sha256 scheme (breaks AC-6 and the takeoff verify). diff --git a/_docs/02_tasks/todo/AZ-323_c10_manifest_builder.md b/_docs/02_tasks/todo/AZ-323_c10_manifest_builder.md new file mode 100644 index 0000000..236ed0c --- /dev/null +++ b/_docs/02_tasks/todo/AZ-323_c10_manifest_builder.md @@ -0,0 +1,243 @@ +# C10 Manifest Builder — Content-Hash Table + Operator-Key Ed25519 Signing + +**Task**: AZ-323_c10_manifest_builder +**Name**: C10 Manifest Builder +**Description**: Implement `ManifestBuilder`, the C10-internal phase that produces the signed cache Manifest covering EVERY shipped artifact (engines, FAISS index, calibration JSON, all tile hashes from C6) plus the build-identity tuple `(model_ids, calibration_sha256, sorted_tile_hashes, sector_class, bbox, zoom_levels)` whose canonical hash is `manifest_hash` — the D-C10-1 idempotence key. Serializes the Manifest as canonical JSON (sorted keys, no whitespace) at `cache_root/Manifest.json`, computes its own SHA-256 sidecar via AZ-280, and writes a detached Ed25519 signature at `cache_root/Manifest.json.sig` using the operator's signing key from `key_path`. Refuses to sign with a non-operator key when `config.c10.signing_mode = "operator"` (C10-ST-01). Emits the `signing_public_key_fingerprint` into the Manifest itself so verifiers can pin the trust root. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema, AZ-303_c6_storage_interfaces +**Component**: c10_provisioning (epic AZ-252 / E-C10) +**Tracker**: AZ-323 +**Epic**: AZ-252 (E-C10) + +### Document Dependencies + +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — atomic write + sidecar pattern (AZ-280). +- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md` — `query_by_bbox` returning per-tile sha256 set by AZ-316. +- `_docs/02_document/components/11_c10_provisioning/description.md` — § 1 idempotence, § 5 `ManifestWriteError`, § 7 D-C10-3 sidecar coverage. + +## Problem + +Without a real Manifest builder: + +- D-C10-1 (idempotent re-run via manifest hash) cannot be implemented — T5's "did anything change?" check has no canonical hash to compare. +- D-C10-3 (SHA-256 content-hash gate over every shipped artifact) is unobservable — the takeoff verifier (T4) has nothing to verify against. +- AC-NEW-1 ("no engine deserialization at takeoff before manifest verify") collapses without a signed Manifest at takeoff. +- C10-ST-01 (build refuses dev-key signing in operator mode) cannot be enforced without a signing key check. +- The `signing_public_key_fingerprint` field is the trust anchor for the airborne `ManifestVerifier`; without it, the verifier cannot decide which key is allowed to vouch for a Manifest. +- A Manifest that is huge (100k tile hashes × 80 bytes = 8 MB) but human-inspectable is operator-friendly; without canonical JSON ordering, two builds of the same input produce different bytes and break idempotence. + +This task delivers the Manifest serialization + signing. It does NOT compile engines (AZ-321), embed tiles (AZ-322), or run the takeoff verify (T4). + +## Outcome + +- A `ManifestBuilder` class at `src/gps_denied_onboard/components/c10_provisioning/manifest_builder.py`: + - Constructor: `__init__(self, *, sidecar: Sha256Sidecar, signer: ManifestSigner, tile_metadata_store: TileMetadataStore, logger: Logger, clock: Clock, config: C10ManifestConfig)`. + - `C10ManifestConfig` (`@dataclass(frozen=True)`): `signing_mode: enum {operator, dev}`, `allowed_operator_fingerprints: tuple[str, ...]`, `schema_version: str = "1.0"`. + - Public method: `build_manifest(input: ManifestBuildInput) -> ManifestArtifact`. + - `ManifestBuildInput` (`@dataclass(frozen=True)`): `cache_root: Path`, `bbox: Bbox`, `zoom_levels: tuple[int, ...]`, `sector_class: SectorClassification`, `engine_entries: tuple[EngineCacheEntry, ...]`, `descriptor_index_path: Path`, `calibration_path: Path`, `key_path: Path`. + - `ManifestArtifact` (`@dataclass(frozen=True)`): `manifest_path: Path`, `signature_path: Path`, `manifest_hash: str`, `signing_public_key_fingerprint: str`, `total_artifacts_listed: int`. +- A `ManifestSigner` Protocol at `src/gps_denied_onboard/components/c10_provisioning/interface.py`: + ```python + @runtime_checkable + class ManifestSigner(Protocol): + def load_signing_key(self, key_path: Path) -> SigningKeyHandle: ... + def sign(self, key: SigningKeyHandle, payload_bytes: bytes) -> bytes: ... + def public_key_fingerprint(self, key: SigningKeyHandle) -> str: ... + ``` + Default impl `Ed25519ManifestSigner` uses the `cryptography` library (already pinned via AZ-318 for per-flight keys). +- Method flow: + 1. Load operator signing key: `signer.load_signing_key(input.key_path)` → `SigningKeyHandle`. + 2. Compute `signing_public_key_fingerprint = signer.public_key_fingerprint(key)` (sha256 of the raw 32-byte ed25519 public key, hex). + 3. **Operator-mode gate (C10-ST-01)**: if `config.signing_mode == "operator"` AND `fingerprint not in config.allowed_operator_fingerprints` → raise `ManifestWriteError("signing key fingerprint not in allowed_operator_fingerprints")`; ERROR log with the offending fingerprint. If `config.signing_mode == "dev"` AND fingerprint matches an allowed operator fingerprint → emit WARN `c10.manifest.dev_mode_with_operator_key` (operator key being used in dev mode is suspicious but allowed). + 4. Compute per-artifact hashes: + - For each engine entry: read `entry.engine_sha256_hex` (already computed by AZ-321; do NOT re-hash). + - For descriptor index: call `sidecar.read_sidecar(input.descriptor_index_path)` → expect a 64-char hex digest. + - For calibration JSON: `sha256_hex(open(calibration_path, 'rb').read())` — calibration is small (KB). + - For tiles: call `tile_metadata_store.query_by_bbox(bbox, zoom_levels, sector_class)` → list of `TileMetadata` with `sha256_hex` field (set by AZ-316). Sort by `(zoom, lat, lon, source)` for determinism. Compute `tiles_coverage_sha256 = sha256(b"\n".join(f"{t.tile_id}:{t.sha256_hex}".encode() for t in sorted_tiles))`. + 5. Build the canonical Manifest dict: + ``` + { + "schema_version": "1.0", + "build": { + "bbox": {...}, + "zoom_levels": [16, 17, 18], + "sector_class": "stable_rear", + "built_at": "2026-05-10T12:00:00Z", + "manifest_hash": "" + }, + "artifacts": { + "engines": [{"path": "engines/dinov2_vpr_sm87_jp62_trt103_fp16.engine", "sha256": ""}, ...], + "descriptor_index": {"path": "descriptors/corpus.index", "sha256": ""}, + "calibration": {"path": "calibration/int8_calibration.json", "sha256": ""}, + "tiles_coverage": {"sha256": "", "tile_count": } + }, + "signing_public_key_fingerprint": "" + } + ``` + 6. Compute `manifest_hash` as `sha256(canonical_json(build_identity_tuple))` where `build_identity_tuple = sorted({model_ids, calibration_sha256, tiles_coverage_sha256, sector_class, bbox, zoom_levels})`. This is the D-C10-1 idempotence key. Insert into the Manifest dict at `build.manifest_hash` AFTER computation. + 7. Serialize the Manifest dict as canonical JSON: `orjson.dumps(manifest, option=orjson.OPT_SORT_KEYS | orjson.OPT_INDENT_2).decode()`. Append a trailing newline. + 8. Atomic-write the JSON via `sidecar.write_with_sidecar(cache_root / "Manifest.json", canonical_json_bytes)` — produces `Manifest.json` + `Manifest.json.sha256` (the latter is the Manifest's OWN sha256, used by T4). + 9. Sign the canonical JSON bytes: `signature_bytes = signer.sign(key, canonical_json_bytes)` (raw Ed25519 signature, 64 bytes). + 10. Atomic-write the signature: `sidecar.atomic_write(cache_root / "Manifest.json.sig", signature_bytes)` (no .sha256 sidecar for the signature itself — signature integrity is verified by Ed25519 over the Manifest bytes). + 11. Return `ManifestArtifact(manifest_path, signature_path, manifest_hash, signing_public_key_fingerprint, total_artifacts_listed)`. +- INFO log on successful build (`c10.manifest.build.success` with `manifest_hash` + `total_artifacts_listed`); ERROR on `ManifestWriteError`; WARN on dev-mode-with-operator-key. + +## Scope + +### Included + +- `ManifestBuilder` class with the single public method. +- `ManifestSigner` Protocol + `Ed25519ManifestSigner` default impl. +- Canonical JSON serialization (sorted keys, sorted lists where order is content-defining). +- Operator-key gate per `signing_mode` config. +- Per-artifact hash computation (engines, descriptor index, calibration, tiles aggregate). +- Atomic writes via AZ-280 for both `Manifest.json` and `Manifest.json.sig`. +- Composition-root factory `build_manifest_builder`. +- Conformance test for `ManifestSigner` Protocol. + +### Excluded + +- The orchestration of when to build (T5 owns). +- Engine compilation / descriptor generation (AZ-321 / AZ-322). +- Manifest verification (T4 owns). +- Idempotence "should we skip the build?" decision (T5 owns; this task always rebuilds when called). +- ManifestCoverageError (T5 owns; this task lists what it's told, doesn't enumerate cache_root). +- Key generation — operator's long-lived key is provisioned out-of-band; this task only loads + uses. +- Multi-key signing (M-of-N quorum) — single-key per build. +- Compressed Manifest format — JSON for human inspection. + +## Acceptance Criteria + +**AC-1: Happy path produces Manifest + sig + sidecars** +Given a valid input with 3 engines, 1 descriptor index, 1 calibration JSON, 100 tiles +When `build_manifest(input)` is called +Then `Manifest.json`, `Manifest.json.sha256`, `Manifest.json.sig` are all present at `cache_root/`; the Manifest contains 3 engine entries, 1 descriptor_index entry, 1 calibration entry, 1 tiles_coverage entry; `manifest_hash` is a 64-char lowercase hex string; the returned `ManifestArtifact.total_artifacts_listed == 5` (engines + index + calibration + tiles_coverage as one logical artifact + the Manifest itself counts separately if at all) + +**AC-2: Determinism — same input produces byte-identical Manifest** +Given the same `ManifestBuildInput` run twice on different days (different `built_at`) +When the canonical JSON is compared with `built_at` redacted +Then both runs produce byte-identical bytes — proves canonical JSON ordering works; same `manifest_hash`. (This is the foundation for T5's idempotence check.) + +**AC-3: Signature verifies against the public key** +Given the signature file + the operator's public key +When `cryptography.hazmat.primitives.asymmetric.ed25519.Ed25519PublicKey.verify(signature, manifest_bytes)` is called +Then no exception is raised — proves the signing produced a valid Ed25519 signature + +**AC-4: Operator-mode rejects unknown fingerprint** +Given `config.signing_mode = "operator"` and `config.allowed_operator_fingerprints = ("known_fp",)` and a key file whose fingerprint is `"unknown_fp"` +When `build_manifest` is called +Then `ManifestWriteError` is raised with a message naming both fingerprints (the offered one + the allowlist); ZERO files are written; ONE ERROR log + +**AC-5: Operator-mode accepts known fingerprint** +Given `config.signing_mode = "operator"` and the key file's fingerprint IS in the allowlist +When `build_manifest` is called +Then the build succeeds; ZERO WARN logs about dev-mode + +**AC-6: Dev-mode with non-operator key emits no warning** +Given `config.signing_mode = "dev"` and a random dev key (not in allowlist) +When `build_manifest` is called +Then build succeeds; `signing_public_key_fingerprint` is the dev key's; ZERO warnings about operator key in dev mode + +**AC-7: Dev-mode with operator key emits warning** +Given `config.signing_mode = "dev"` and a key whose fingerprint IS in `allowed_operator_fingerprints` +When `build_manifest` is called +Then build succeeds; ONE WARN log `c10.manifest.dev_mode_with_operator_key` with the fingerprint + +**AC-8: Tile coverage hash is sort-order-deterministic** +Given the same 100 tiles loaded in two different SQL row orders (e.g., insertion order vs index scan) +When `tiles_coverage_sha256` is computed +Then both runs produce the same hash — proves the `(zoom, lat, lon, source)` sort is canonical + +**AC-9: ManifestWriteError on key load failure** +Given a `key_path` that does not exist OR contains malformed PEM +When `signer.load_signing_key(key_path)` raises +Then `ManifestWriteError("operator signing key load failed: ")` is raised; ZERO files are written; the original `cryptography` exception is chained as `__cause__` for diagnosis + +**AC-10: Atomic write — partial Manifest impossible** +Given the Manifest is being written and the process is killed mid-write +When restarted +Then either the previous-good Manifest OR the new Manifest is at the path; never a half-written JSON. (AZ-280's atomic-write contract.) + +**AC-11: Manifest's own sidecar is consistent** +Given a freshly-written `Manifest.json` +When `sha256_hex(open("Manifest.json", "rb").read())` is computed and compared to `Manifest.json.sha256` +Then the values match — T4's verifier walks all sidecars and this is the entry point + +**AC-12: `total_artifacts_listed` equals dict-counted artifacts** +Given an input with N engines + 1 index + 1 calibration + tiles_coverage +When `ManifestArtifact.total_artifacts_listed` is inspected +Then it equals `N + 3` (engines + index + calibration + tiles_coverage); does NOT count the Manifest itself or the signature + +## Non-Functional Requirements + +**Performance** +- Build wall-clock ≤ 5 s for a 100k-tile corpus on Tier-1 dev workstation: sorting 100k tile hashes + computing one SHA-256 over the concatenated string is ~50 MB of input → ~100 ms; serializing JSON with 100k tile_count is fast (single integer); engine + index + calibration hashes are already computed upstream. Total ≤ 5 s leaves headroom. +- Operator-mode fingerprint check is a single string comparison. + +**Compatibility** +- Uses `orjson` (already pinned via AZ-272 for FDR), `cryptography` (already pinned via AZ-318 for per-flight keys), `hashlib` (stdlib). +- No new third-party dependencies. + +**Reliability** +- Operator-key gate is fail-closed: unknown fingerprint → no Manifest written. +- Atomic writes prevent half-written Manifests on process kill. +- Canonical JSON ensures bit-identical Manifests for identical inputs (foundation for D-C10-1 idempotence in T5). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Build with 3 engines + index + calibration + 100 tiles | All files present; counts match | +| AC-2 | Build twice, redact built_at, compare bytes | Identical | +| AC-3 | Verify signature with public key | No raise | +| AC-4 | Operator mode + unknown fingerprint | ManifestWriteError; no files | +| AC-5 | Operator mode + known fingerprint | Success; no warnings | +| AC-6 | Dev mode + dev key | Success; no warnings | +| AC-7 | Dev mode + operator-allowlisted key | Success; ONE warning | +| AC-8 | Tile rows in different orders | Same `tiles_coverage_sha256` | +| AC-9 | Missing or malformed key file | ManifestWriteError; chained cause | +| AC-10 | Kill mid-write | No half-Manifest | +| AC-11 | Verify Manifest's own sidecar | Hashes match | +| AC-12 | Inspect total_artifacts_listed | Counts engines+index+calibration+tiles_coverage | +| NFR-perf | 100k-tile bench | ≤ 5 s wall clock | +| NFR-reliability-fail-closed | Operator mode + unknown fp | Fail-closed; nothing written | + +## Constraints + +- Canonical JSON via `orjson` with `OPT_SORT_KEYS`; this task does NOT use a different JSON library. +- Atomic writes via AZ-280 for BOTH `Manifest.json` and `Manifest.json.sig`; no naked `Path.write_bytes()`. +- `manifest_hash` excludes `built_at` (it's a build-identity hash, not a Manifest-bytes hash). +- The Manifest's own SHA-256 sidecar (Manifest.json.sha256) IS the Manifest-bytes hash and is used by T4 at takeoff. +- Tile coverage hashing is via aggregate `tiles_coverage_sha256`, NOT per-tile entries in the Manifest (keeps Manifest bounded). +- Signature is detached (separate `.sig` file); embedded signatures are NOT permitted (would require parsing before verifying). +- Ed25519 only; this task does NOT add other algorithms. +- Operator-key fingerprint allowlist is config-driven; no hardcoded keys. + +## Risks & Mitigation + +**Risk 1: `built_at` makes Manifests non-deterministic for the same input** +- *Risk*: Idempotence check in T5 compares `manifest_hash` only, but if T5 reads the Manifest bytes directly elsewhere it could see different bytes for "same" build. +- *Mitigation*: AC-2 explicitly excludes `built_at` from the `manifest_hash` computation. T5 compares hashes, not bytes. Documented in the Manifest schema. + +**Risk 2: tiles_coverage as aggregate hides which tile changed** +- *Risk*: When verify fails at takeoff (T4), the operator only learns "tiles_coverage hash mismatch", not WHICH tile drifted. +- *Mitigation*: T4's failure path can re-walk per-tile hashes against C6 to identify the offender. The Manifest stays small; debugging detail is computed on-demand. Documented in T4's scope. + +**Risk 3: `cryptography` API breaks between minor versions** +- *Risk*: Ed25519 API changes (unlikely but `cryptography` does ship breaking changes occasionally). +- *Mitigation*: Pin to the same version used by AZ-318. The `Ed25519ManifestSigner` is the only place using the API; a one-place adapter swap on upgrade. + +**Risk 4: Operator key file format ambiguity** +- *Risk*: Operators might supply a key in PKCS8, OpenSSH, or raw 32-byte format. +- *Mitigation*: `Ed25519ManifestSigner.load_signing_key` accepts PEM-encoded PKCS8 only (matches AZ-318's convention); other formats raise `ManifestWriteError` with explicit format hint. + +**Risk 5: Dev key accidentally signs an operator-mode build** +- *Risk*: Operator runs build with `signing_mode = "operator"` but supplies a dev key by mistake. +- *Mitigation*: AC-4 covers; the gate is fail-closed and logs the offending fingerprint so the operator can correct. + +## Runtime Completeness + +- **Named capability**: signed Manifest production with content-hash table covering every shipped artifact, D-C10-1 idempotence key (`manifest_hash`), C10-ST-01 operator-mode gate (epic § Acceptance C10-IT-01, C10-IT-02, C10-ST-01). +- **Production code that must exist**: real `ManifestBuilder` orchestrating real `Ed25519ManifestSigner` (cryptography library) + real AZ-280 atomic writes + real C6 `query_by_bbox` to gather tile hashes; real config-driven fingerprint allowlist. +- **Allowed external stubs**: tests MAY use a fake `ManifestSigner` with a known keypair generated in-test + a fake `tile_metadata_store` (AZ-303 conformance fakes); production wiring uses `cryptography.hazmat`. +- **Unacceptable substitutes**: HMAC instead of Ed25519 (different trust model — symmetric vs asymmetric); embedding the signature in the JSON (defeats the parse-before-verify problem at takeoff); Python-only `pickle` of the Manifest (not human-inspectable, not canonical-byte stable); skipping the operator-fingerprint allowlist when `signing_mode = "operator"` (defeats C10-ST-01); using `json.dumps` without `OPT_SORT_KEYS` (breaks AC-2 determinism and breaks T5's idempotence). diff --git a/_docs/02_tasks/todo/AZ-324_c10_manifest_verifier.md b/_docs/02_tasks/todo/AZ-324_c10_manifest_verifier.md new file mode 100644 index 0000000..9579967 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-324_c10_manifest_verifier.md @@ -0,0 +1,243 @@ +# C10 ManifestVerifier — Takeoff Content-Hash Gate + Trusted-Key Pinning + +**Task**: AZ-324_c10_manifest_verifier +**Name**: C10 ManifestVerifier +**Description**: Implement `ManifestVerifier` (per the contract `_docs/02_document/contracts/c10_provisioning/manifest_verifier.md`), the read-only validator that AC-NEW-1 places between F2 takeoff and any engine deserialization. Loads `Manifest.json`, verifies its sidecar SHA-256 matches the Manifest bytes, parses the Ed25519 detached signature at `Manifest.json.sig`, verifies it against the caller-supplied `trusted_public_keys` tuple, parses the Manifest schema (rejecting absolute paths and schema violations), and walks every per-artifact entry re-hashing it via AZ-280's sidecar pattern. Returns a `VerificationResult` with `outcome ∈ {PASS, FAIL}`, the union of all `VerifyFailReason` values that fired, the populated `per_artifact_checks` list, and `elapsed_ms`. Fail-closed: any deviation in signature, schema, key trust, or hashes yields `FAIL` with detailed reasons. Never raises on a verify failure — only on environment errors (Manifest.json missing → `MANIFEST_NOT_FOUND` is still `FAIL`, not raise). +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-280_sha256_sidecar, AZ-281_engine_filename_schema +**Component**: c10_provisioning (epic AZ-252 / E-C10) +**Tracker**: AZ-324 +**Epic**: AZ-252 (E-C10) + +### Document Dependencies + +- `_docs/02_document/contracts/c10_provisioning/manifest_verifier.md` — produced by this task (frozen Protocol + DTO shape, invariants, test cases). +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar verify pattern (AZ-280). +- `_docs/02_document/components/11_c10_provisioning/description.md` — § 5 `ContentHashMismatchError` handling, § 7 D-C10-3 sidecar coverage. + +## Problem + +Without a real verifier: + +- AC-NEW-1 ("no engine deserialization at takeoff before manifest verify") collapses — F2 has nothing to gate on. +- D-C10-3 (SHA-256 content-hash gate over every shipped artifact) is unobservable at takeoff. +- C10-IT-02 (rejects tampered or wrong-key Manifests) cannot be implemented. +- A built but unverified Manifest is no better than no Manifest — operators cannot trust it without an actual check. +- Without a contract, C5 takeoff arming and C12 operator tooling cannot couple to C10 — every consumer would re-implement an ad-hoc check. +- The "fail-closed" property is a hard requirement; partial verifies that report PASS on first match would compromise the entire trust chain. + +This task delivers the verifier + its frozen contract. It does NOT compile engines (AZ-321), build the Manifest (AZ-323), or own the takeoff-arming policy (E-C5). + +## Outcome + +- A `ManifestVerifier` class implementation at `src/gps_denied_onboard/components/c10_provisioning/manifest_verifier.py` matching the Protocol in the contract. +- Constructor: `__init__(self, *, sidecar: Sha256Sidecar, logger: Logger, clock: Clock, tile_metadata_store: TileMetadataStore | None = None)`. + - When `tile_metadata_store is None`, the verifier operates in airborne mode: trusts the recorded `tiles_coverage_sha256` after the signature passes (per MV-INV-5). + - When `tile_metadata_store is not None`, the verifier operates in operator mode: re-derives `tiles_coverage_sha256` from C6 and reports `TILES_COVERAGE_MISMATCH` on drift. +- The frozen contract at `_docs/02_document/contracts/c10_provisioning/manifest_verifier.md` (already written; this task brings the implementation up to it). +- Method `verify_manifest(manifest_path, trusted_public_keys) -> VerificationResult` flow: + 1. Start `time.monotonic()` for `elapsed_ms`. + 2. Initialize empty `fail_reasons: list[VerifyFailReason]`, `fail_details: list[str]`, `per_artifact_checks: list[ArtifactCheck]`. + 3. **Step A — Manifest exists & sidecar matches**: + - If `manifest_path` does not exist: append `MANIFEST_NOT_FOUND`; return `FAIL` (no further work; per MV-INV-1). + - Read `Manifest.json` bytes. + - If `manifest_path.with_suffix(".json.sha256")` does not exist: append `SCHEMA_VIOLATION` ("missing manifest sidecar"); return `FAIL`. + - If `sha256(manifest_bytes) != sidecar_value`: append `MANIFEST_SELF_HASH_MISMATCH`; return `FAIL` (do NOT consult signature per MV-INV-3). + 4. **Step B — Signature verifies against a trusted key**: + - If `signature_path = manifest_path.with_suffix(".json.sig")` does not exist: append `SIGNATURE_NOT_FOUND`; `signing_public_key_fingerprint = None`; return `FAIL`. + - Parse Ed25519 signature bytes (must be exactly 64 bytes; otherwise `SIGNATURE_INVALID`). + - Try each public key in `trusted_public_keys`: + - Compute `fingerprint = sha256(pub.public_bytes_raw()).hex()`. + - Try `pub.verify(signature_bytes, manifest_bytes)`. + - On success: signature is valid; `signing_public_key_fingerprint = fingerprint`; break. + - If no trusted key verified: + - If at least one key raised `InvalidSignature` (signature doesn't match this key's bytes): the signature could still match an untrusted key. Try parsing the Manifest's `signing_public_key_fingerprint` field (if schema parses) and report whichever is more diagnostic — `UNTRUSTED_PUBLIC_KEY` if the Manifest names a known-but-untrusted key, `SIGNATURE_INVALID` otherwise. + - Append the reason; return `FAIL` (do NOT proceed to per-artifact hashing per MV-INV-2). + - If `trusted_public_keys` is empty: append `UNTRUSTED_PUBLIC_KEY`; return `FAIL`. + 5. **Step C — Schema parse**: + - `orjson.loads(manifest_bytes)` → dict. + - Validate required keys: `schema_version`, `build` (with sub-keys `bbox`, `zoom_levels`, `sector_class`, `built_at`, `manifest_hash`), `artifacts` (with `engines`, `descriptor_index`, `calibration`, `tiles_coverage`), `signing_public_key_fingerprint`. + - Validate types: `engines` is list of `{path: str, sha256: str}`; `descriptor_index`, `calibration` are `{path: str, sha256: str}`; `tiles_coverage` is `{sha256: str, tile_count: int}`. + - Validate path-relative-only: every `path` value must be relative (no leading `/`, no `..` segments). Append `SCHEMA_VIOLATION` per offending field; if any, return `FAIL`. + 6. **Step D — Per-artifact hash walk** (only reached if Steps A–C all passed): + - For each engine, descriptor_index, calibration entry: + - Compute `actual_path = manifest_path.parent / entry.path`. + - If file missing: append `ArtifactCheck(entry.path, entry.sha256, None, matched=False)`; append `ARTIFACT_MISSING` to `fail_reasons` once if not already there. + - Else: stream-read the file, compute SHA-256 (use AZ-280's helper that takes a path). + - If hash matches: `matched=True`. + - Else: `matched=False`; append `ARTIFACT_HASH_MISMATCH` once. + - For tiles_coverage: + - If `tile_metadata_store is None` (airborne mode): trust the recorded `tiles_coverage.sha256` since the Manifest signature already binds it. Append `ArtifactCheck("tiles_coverage", recorded_sha256, recorded_sha256, matched=True)` for completeness. + - Else (operator mode): re-derive `tiles_coverage_sha256` by `tile_metadata_store.query_by_bbox(...)` over the `build.bbox` + `zoom_levels` + `sector_class`, sort by `(zoom, lat, lon, source)`, hash. If mismatch → `TILES_COVERAGE_MISMATCH`. + - Walk ALL entries even on first failure (per MV-TC-9). + 7. Set `outcome = PASS` iff `fail_reasons` is empty; else `FAIL`. + 8. Set `elapsed_ms = int((time.monotonic() - start) * 1000)`. + 9. Return `VerificationResult(...)`. +- INFO log on PASS (`c10.manifest.verify.pass` with elapsed_ms + fingerprint); WARN on FAIL with `fail_reasons` + counts of mismatched artifacts. +- Composition root factory `build_manifest_verifier(config, *, with_tile_store: bool) -> ManifestVerifier` — `with_tile_store=True` for operator mode, `False` for airborne C5. + +## Scope + +### Included + +- `ManifestVerifier` class implementing the Protocol from the contract. +- The contract document (frozen at v1.0.0). +- Schema validation against the v1.0 shape produced by AZ-323. +- Signature verification against a tuple of trusted public keys. +- Per-artifact stream-hash walk with multiple-failure accumulation. +- Airborne vs operator mode for tiles_coverage handling. +- Composition-root factory. +- Conformance test for the contract Protocol. + +### Excluded + +- Manifest building / signing (AZ-323 owns). +- Trusted-key distribution / loading from disk — caller passes `Ed25519PublicKey` instances. +- Cache repair on FAIL — caller (E-C5 takeoff arming, E-C12 operator) decides next action. +- Coverage check for orphan files in `cache_root` (AZ-325 owns `ManifestCoverageError`). +- Logging Manifest contents (Manifests are not secret but verbose; only fingerprints + counts are logged). +- C13 FDR emission — caller's responsibility (per MV-INV-6). +- Non-Ed25519 signatures. + +## Acceptance Criteria + +**AC-1: PASS on a valid Manifest with all artifacts present and matching** +Given a freshly-built Manifest + sig + sidecar from AZ-323 and `trusted_public_keys = (signing_pub,)` +When `verify_manifest(manifest_path, trusted_public_keys)` is called +Then `outcome=PASS`, `fail_reasons` is empty, `per_artifact_checks` has every entry `matched=True`, `signing_public_key_fingerprint` is the signing key's fingerprint, `elapsed_ms > 0` + +**AC-2: FAIL on missing Manifest with no further work** +Given `manifest_path` does not exist +When verify runs +Then `outcome=FAIL`, `fail_reasons=(MANIFEST_NOT_FOUND,)`, `per_artifact_checks` is empty (no work performed), `signing_public_key_fingerprint=None` + +**AC-3: FAIL on missing signature with diagnostic** +Given Manifest.json exists + sidecar matches but Manifest.json.sig is absent +When verify runs +Then `fail_reasons=(SIGNATURE_NOT_FOUND,)`, `per_artifact_checks` is empty, no per-artifact disk reads happen (defence-in-depth) + +**AC-4: FAIL on tampered Manifest body** +Given Manifest.json is mutated by 1 byte after signing +When verify runs +Then either `MANIFEST_SELF_HASH_MISMATCH` (sidecar caught it first) OR `SIGNATURE_INVALID` (if sidecar was also re-computed by attacker); per-artifact walk does NOT happen + +**AC-5: FAIL on untrusted public key** +Given the Manifest is signed with a key NOT in `trusted_public_keys` +When verify runs +Then `fail_reasons=(UNTRUSTED_PUBLIC_KEY,)`, `signing_public_key_fingerprint` is populated (so operators see WHICH untrusted key signed it), per-artifact walk does NOT happen + +**AC-6: FAIL on schema violation lists offending field** +Given a Manifest missing the `signing_public_key_fingerprint` key +When verify runs +Then `fail_reasons=(SCHEMA_VIOLATION,)`, `fail_details` contains a string naming `signing_public_key_fingerprint` + +**AC-7: FAIL on absolute path in artifact entry** +Given an engine entry has `path: "/etc/passwd"` +When verify runs +Then `fail_reasons=(SCHEMA_VIOLATION,)`, `fail_details` names the offending field; per-artifact walk does NOT consult `/etc/passwd` + +**AC-8: FAIL with multiple reasons accumulated** +Given one engine is missing on disk AND one engine's bytes drifted AND a third engine matches +When verify runs +Then `fail_reasons` contains BOTH `ARTIFACT_MISSING` and `ARTIFACT_HASH_MISMATCH` (in deterministic order: traversal order); `per_artifact_checks` has all 3 entries with correct `matched` values; the third entry has `matched=True` + +**AC-9: Operator mode re-derives tiles_coverage** +Given `tile_metadata_store` is supplied AND C6's tiles for the build's bbox/zoom now have a different aggregate hash (e.g., a tile was re-downloaded) +When verify runs +Then `fail_reasons=(TILES_COVERAGE_MISMATCH,)`; the recorded vs computed hashes are in `fail_details` + +**AC-10: Airborne mode trusts tiles_coverage post-signature** +Given `tile_metadata_store=None` +When verify runs +Then `tiles_coverage` `ArtifactCheck` shows `matched=True` (recorded == "actual" because we don't re-derive); the airborne F2 path is fast (≤ 100 ms per NFR) + +**AC-11: Conformance — `isinstance` returns True** +Given the implementation +When `isinstance(impl, ManifestVerifier)` is checked under runtime_checkable +Then `True` + +**AC-12: `elapsed_ms` recorded on every outcome** +Given any of the above ACs +When inspecting the result +Then `elapsed_ms >= 0` and is reasonable (smaller for early-exit failures, larger for full per-artifact walks) + +**AC-13: Empty `trusted_public_keys` always fails closed** +Given `trusted_public_keys = ()` +When verify runs +Then `fail_reasons=(UNTRUSTED_PUBLIC_KEY,)` regardless of Manifest validity; per-artifact walk does NOT happen + +## Non-Functional Requirements + +**Performance** +- Airborne F2 verify (no per-tile re-derivation, ~5 artifact entries): wall-clock ≤ 100 ms on Jetson Orin (signature verify + 5 stream-SHA-256s of bounded files). +- Operator-mode verify with 100k tiles re-derivation: ≤ 5 s (matches AZ-323's NFR). +- Stream-hash files via 64 KB chunks; do NOT load engine binaries (~200 MB) entirely into memory. + +**Compatibility** +- `cryptography` (already pinned via AZ-318), `orjson` (already pinned), `hashlib` (stdlib). +- No new third-party dependencies. + +**Reliability** +- Fail-closed: empty trusted keys → FAIL; missing files → FAIL; any drift → FAIL. +- No partial PASS; the `outcome=PASS` branch is taken only when `fail_reasons` is empty. +- Defensive against directory traversal: relative paths only (AC-7). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Built Manifest from AZ-323 fixture | PASS; all matched | +| AC-2 | Missing Manifest.json | FAIL; MANIFEST_NOT_FOUND only | +| AC-3 | Missing signature | FAIL; SIGNATURE_NOT_FOUND; no disk reads | +| AC-4 | Mutated Manifest body | FAIL; either MANIFEST_SELF_HASH_MISMATCH or SIGNATURE_INVALID | +| AC-5 | Wrong-key signing | FAIL; UNTRUSTED_PUBLIC_KEY; fingerprint populated | +| AC-6 | Missing required field | FAIL; SCHEMA_VIOLATION + field name | +| AC-7 | Absolute path in artifact | FAIL; SCHEMA_VIOLATION; no path traversal | +| AC-8 | 1 missing + 1 drifted + 1 OK | Two failure reasons; per_artifact_checks complete | +| AC-9 | Operator mode + drifted tile | TILES_COVERAGE_MISMATCH | +| AC-10 | Airborne mode | tiles_coverage matched=True | +| AC-11 | Conformance check | True | +| AC-12 | Inspect elapsed_ms | All non-negative; ordered as expected | +| AC-13 | Empty trusted keys | FAIL; UNTRUSTED | +| NFR-perf-airborne | 5 artifact bench, no tile re-walk | p99 ≤ 100 ms | +| NFR-perf-operator | 100k-tile re-walk | ≤ 5 s | +| NFR-reliability-stream-hash | 200 MB engine + memory profile | Peak < 10 MB extra | + +## Constraints + +- Stream SHA-256 over files via `hashlib.sha256().update(chunk)` in 64 KB blocks; do NOT `Path.read_bytes()` on engines (memory blowup per NFR). +- Path interpretation is relative-only; absolute paths are SCHEMA_VIOLATION (AC-7). +- The verifier is read-only (per MV-INV-6); no disk writes, no network, no FDR. +- `fail_reasons` is a tuple (immutable, ordered, deterministic). +- Signature checks happen before per-artifact walks (per MV-INV-2). +- Manifest sidecar check happens before signature (per MV-INV-3). +- Multiple failures accumulate; do not short-circuit on first per-artifact failure (per MV-TC-9 / AC-8). + +## Risks & Mitigation + +**Risk 1: Trusted-key list accidentally empty in production wiring** +- *Risk*: Composition root mis-configures; airborne C5 ends up with an empty key list and arming silently fails forever. +- *Mitigation*: AC-13 + ERROR log on `UNTRUSTED_PUBLIC_KEY` with key-list-length=0 makes the misconfiguration loud at first arm attempt. + +**Risk 2: Per-artifact walk dominates airborne arm latency** +- *Risk*: 5 engines × 200 MB stream-hash on slow microSD → 30 s arm latency. +- *Mitigation*: NFR-perf-airborne benchmark documents the envelope; if the Jetson microSD I/O is the bottleneck, a follow-up task adds an "incremental verify" path that trusts unchanged artifacts since last reboot. Out of scope this cycle. + +**Risk 3: Tampered sidecar matches tampered body (attacker drops both sidecar + body)** +- *Risk*: AC-4's first failure case (sidecar mismatch) is bypassed by an attacker who recomputes the sidecar. +- *Mitigation*: Signature check (Step B) catches this — the signature is over the Manifest body; recomputing the sidecar does NOT also recompute the signature. The Ed25519 secret key is operator-only. + +**Risk 4: Path traversal via relative `..` segments** +- *Risk*: A relative path like `../../etc/passwd` passes the "no leading /" check but escapes cache_root. +- *Mitigation*: AC-7 + `..` segment rejection covers it; explicit check `if ".." in Path(entry.path).parts: SCHEMA_VIOLATION`. + +**Risk 5: Operator-mode tile re-walk on Jetson is too slow** +- *Risk*: An airborne-mode verifier mistakenly gets a `tile_metadata_store` (composition root mistake) and re-walks 100k tiles, blowing the arm latency budget. +- *Mitigation*: The composition root factory `build_manifest_verifier(config, *, with_tile_store: bool)` is the explicit toggle; airborne wiring passes `with_tile_store=False`. AC-10 tests airborne mode latency. + +## Runtime Completeness + +- **Named capability**: takeoff content-hash gate per AC-NEW-1 + D-C10-3 + C10-IT-02 (epic § Acceptance C10-IT-01..02; description.md § 5 `ContentHashMismatchError`). +- **Production code that must exist**: real `ManifestVerifier` orchestrating real `cryptography` Ed25519 verify + real `hashlib` stream-SHA-256 + real `orjson` schema parse; real `tile_metadata_store` re-derivation in operator mode. +- **Allowed external stubs**: tests MAY use a fake key generated in-test, fake Manifest fixtures from AZ-323's test fixtures; production wiring uses real keys from operator key store. +- **Unacceptable substitutes**: skipping Step A's sidecar check (loses bit-rot detection); skipping Step B before walking artifacts (defeats MV-INV-2 defence-in-depth); short-circuiting on first per-artifact failure (operators need full diagnostic per MV-TC-9); HMAC instead of Ed25519 (different trust model); accepting absolute paths in entries (path traversal vulnerability per AC-7); raising on missing files instead of `outcome=FAIL` (breaks the contract's read-only / never-raise-on-verify-failure invariant). diff --git a/_docs/02_tasks/todo/AZ-325_c10_cache_provisioner.md b/_docs/02_tasks/todo/AZ-325_c10_cache_provisioner.md new file mode 100644 index 0000000..cecc0b9 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-325_c10_cache_provisioner.md @@ -0,0 +1,233 @@ +# C10 CacheProvisioner — Idempotent Orchestrator + ManifestCoverageError + +**Task**: AZ-325_c10_cache_provisioner +**Name**: C10 CacheProvisioner +**Description**: Implement `CacheProvisioner` (per the contract `_docs/02_document/contracts/c10_provisioning/cache_provisioner.md`), the public top-level orchestrator that composes AZ-321 (EngineCompiler), AZ-322 (DescriptorBatcher), and AZ-323 (ManifestBuilder) into a single idempotent F1 build pipeline. Acquires a `cache_root/.c10.lock` filesystem lockfile to enforce CP-INV-4. Computes the build-identity hash from the same canonical inputs AZ-323 hashes (model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels) and compares to the existing `Manifest.json`'s `manifest_hash`; on match → `outcome=IDEMPOTENT_NO_OP`. On mismatch (or no prior Manifest) → run engine compile → descriptor population → Manifest build, then walk `cache_root` to confirm every file is listed in the new Manifest's `artifacts` section, raising `ManifestCoverageError` on orphans (with rollback to prior-good Manifest). Empty corpus → `BuildReport(outcome=FAILURE, failure_reason="run C11 TileDownloader first")` per description.md § 5. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-303_c6_storage_interfaces, AZ-321_c10_engine_compiler, AZ-322_c10_descriptor_batcher, AZ-323_c10_manifest_builder +**Component**: c10_provisioning (epic AZ-252 / E-C10) +**Tracker**: AZ-325 +**Epic**: AZ-252 (E-C10) + +### Document Dependencies + +- `_docs/02_document/contracts/c10_provisioning/cache_provisioner.md` — produced by this task (frozen Protocol + DTO shape, invariants, test cases). +- `_docs/02_document/components/11_c10_provisioning/description.md` — § 1 idempotence, § 5 error handling, § 7 lockfile race-condition mitigation. + +## Problem + +Without a real orchestrator: + +- D-C10-1 (idempotent re-run via manifest hash) cannot be enforced — every operator invocation re-compiles every engine, blowing the C10-PT-01 ≤ 1 min warm target. +- D-C10-3 (`ManifestCoverageError` on orphan files / no smuggled artifacts) is unobservable — partial-build leftovers and out-of-band file drops at takeoff time go undetected. +- C10-IT-03 (idempotent re-run — same hash, no recompile) cannot be implemented. +- C10-IT-04 (`ManifestCoverageError` on orphan files) cannot be implemented. +- The race-condition mitigation per description.md § 7 (filesystem lockfile) has no producer. +- C12 OperatorTooling (E-C12) has no surface to call — its `c10 build` CLI command is a one-liner only after this task ships. +- The "missing tiles in C6" failure path (description.md § 5) has no surface — operators would see a stack trace from AZ-322 instead of a clear `failure_reason` directing them to C11. + +This task delivers the orchestrator + its frozen contract. It does NOT compile engines (AZ-321), embed tiles (AZ-322), build Manifests (AZ-323), or verify at takeoff (AZ-324). + +## Outcome + +- A `CacheProvisioner` class implementation at `src/gps_denied_onboard/components/c10_provisioning/provisioner.py` matching the Protocol in the contract. +- Constructor: `__init__(self, *, engine_compiler: EngineCompiler, descriptor_batcher: DescriptorBatcher, manifest_builder: ManifestBuilder, tile_metadata_store: TileMetadataStore, lock_factory: FileLockFactory, logger: Logger, clock: Clock, config: C10ProvisionerConfig)`. +- `C10ProvisionerConfig` (`@dataclass(frozen=True)`): `coverage_strict: bool = True`, `lock_timeout_s: float = 5.0`, `manifest_filename: str = "Manifest.json"`. +- Method `build_cache_artifacts(request: BuildRequest) -> BuildReport` flow: + 1. **Lock acquisition** (CP-INV-4): + - Path: `request.cache_root / ".c10.lock"`. + - Acquire via `lock_factory.try_lock(path, timeout_s=config.lock_timeout_s)` — non-blocking with a short timeout to surface concurrent invocations as `BuildLockHeldError`. + 2. **Tile gathering**: call `tile_metadata_store.query_by_bbox(bbox, zoom_levels, sector_class)`. + - If empty → return `BuildReport(outcome=FAILURE, failure_reason="no tiles in C6 for the requested scope; run C11 TileDownloader first", engines_built=0, ...)`. ERROR log; release lock. + 3. **Build-identity hash for idempotence check**: + - Compute `request_hash = sha256(canonical_json(model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels))`. The `model_ids` come from the configured backbone list; `calibration_sha256` from streaming the calibration_path; `tiles_coverage_sha256` from sorting the tile rows by `(zoom, lat, lon, source)` and hashing per AZ-323's algorithm. + - Read existing `Manifest.json` if present; parse only the `build.manifest_hash` field (don't run full verification — that's AZ-324's job). If `existing.manifest_hash == request_hash` → return `BuildReport(outcome=IDEMPOTENT_NO_OP, manifest_hash=existing.manifest_hash, manifest_path=existing_path, engines_built=0, engines_reused=0, descriptors_generated=0, elapsed_s, failure_reason=None)`. INFO log; release lock. + 4. **Active build path**: + - Snapshot prior-good Manifest (rename to `Manifest.json.prev` if present) for rollback. + - Compose engine compile request from configured backbones; call `engine_compiler.compile_engines_for_corpus(...)` → `engine_entries`. + - Compose descriptor populate request (filter, callback hooked to logger); call `descriptor_batcher.populate_descriptors(...)` → `DescriptorBatchReport`. If `outcome=failure` → restore prior Manifest, release lock, return `BuildReport(outcome=FAILURE, failure_reason=batch.failure_reason, ...)`. + - Compose Manifest build input from engine entries + descriptor index path + calibration + key_path; call `manifest_builder.build_manifest(...)` → `ManifestArtifact`. + 5. **Coverage check** (CP-INV-3 / D-C10-3): + - Walk `cache_root` recursively (`pathlib.Path.rglob`); collect every regular file path EXCLUDING `Manifest.json`, `Manifest.json.sha256`, `Manifest.json.sig`, `Manifest.json.prev`, `.c10.lock`, and any `.sha256` sidecar (sidecars are implicit per the AZ-280 pattern, paired with their primary). + - Build expected set: every `path` in `manifest.artifacts.engines + descriptor_index + calibration` (resolved relative to `cache_root`). + - `orphans = walked - expected`. + - If `orphans` non-empty AND `config.coverage_strict`: + - Restore prior Manifest from `Manifest.json.prev` (delete current Manifest; rename prev back). If no prev existed, leave the new Manifest in place but raise. + - Raise `ManifestCoverageError(f"orphan files in cache_root: {sorted(orphans)}")`. ERROR log. + - If `orphans` non-empty AND NOT `coverage_strict`: WARN log with the orphan list; continue. + 6. **Cleanup**: delete `Manifest.json.prev` if present; release lock. + 7. Return `BuildReport(outcome=SUCCESS, engines_built, engines_reused, descriptors_generated, manifest_hash, manifest_path, failure_reason=None, elapsed_s)`. +- Method `compile_engines_for_corpus(request)` is a thin passthrough to `engine_compiler.compile_engines_for_corpus(request)` (per CP-TC-11; lets operators run engine-only re-compiles for D-C10-6 hardware-change scenarios without redoing descriptors). +- A `FileLockFactory` Protocol + a default `Filelock`-library-backed impl (use `filelock` package, already pinned via shared helpers if present; if not, add to deps with a single pinned version). +- INFO logs on lock acquired / released, build start/end, idempotent no-op; ERROR on coverage error / build failure; WARN on non-strict coverage drift. + +## Scope + +### Included + +- `CacheProvisioner` class implementing the Protocol from the contract. +- The contract document (frozen at v1.0.0). +- Filesystem lockfile (FileLockFactory Protocol + filelock-backed default impl). +- Idempotence check (parse existing Manifest's `manifest_hash` only; no full verify). +- Coverage walk + `ManifestCoverageError` with rollback to prior Manifest. +- Empty-corpus handling with explicit hint to run C11. +- `compile_engines_for_corpus` passthrough. +- Composition-root factory `build_cache_provisioner(config) -> CacheProvisioner`. +- Conformance test for the contract Protocol. + +### Excluded + +- The internal phases (AZ-321, AZ-322, AZ-323). +- Manifest verification at takeoff (AZ-324). +- Operator CLI / tooling (E-C12). +- C13 FDR emissions (build is offline). +- Resumable mid-build state (out of scope; restart from scratch). +- GC of stale engines (operator action). +- Multi-cache rotation. + +## Acceptance Criteria + +**AC-1: Cold build composes phases and writes Manifest** +Given an empty cache_root and C6 populated with tiles for the requested scope +When `build_cache_artifacts(request)` is called +Then `outcome=SUCCESS`; `engines_built > 0`; `descriptors_generated > 0`; `Manifest.json` + `Manifest.json.sig` + `Manifest.json.sha256` exist; `BuildReport.manifest_hash` matches the on-disk Manifest's `build.manifest_hash`; `elapsed_s` is positive + +**AC-2: Warm idempotent re-run skips everything** +Given a prior successful build at the same cache_root with the same identity tuple +When `build_cache_artifacts` is called with an identical request +Then `outcome=IDEMPOTENT_NO_OP`; `engines_built=0, engines_reused=0, descriptors_generated=0`; ZERO calls to `engine_compiler.compile_engines_for_corpus` (verifiable via spy); ZERO calls to `descriptor_batcher.populate_descriptors`; ZERO calls to `manifest_builder.build_manifest`; the on-disk Manifest is byte-identical (mtime unchanged) + +**AC-3: Different bbox triggers full rebuild and atomic replacement** +Given a prior Manifest at the cache_root for bbox A +When `build_cache_artifacts` is called with bbox B (B ≠ A) +Then `outcome=SUCCESS`; the new Manifest replaces the old (atomic via AZ-280); old `Manifest.json.prev` is cleaned up after coverage passes; `manifest_hash` differs from the prior + +**AC-4: Empty corpus surfaces failure with operator hint** +Given C6 has zero tiles for the requested scope +When `build_cache_artifacts` is called +Then `outcome=FAILURE`; `failure_reason` contains "C11 TileDownloader"; ZERO compile / embed / Manifest calls; lock IS released (no leaked lockfile) + +**AC-5: Concurrent invocation raises `BuildLockHeldError`** +Given another invocation holds `.c10.lock` +When a second `build_cache_artifacts` runs +Then `BuildLockHeldError` is raised within `lock_timeout_s`; the existing build is unaffected; the existing lockfile is NOT deleted + +**AC-6: ManifestCoverageError rolls back to prior Manifest** +Given a prior-good Manifest exists; a build is run; before the coverage walk, an orphan file `cache_root/leftover.bin` is dropped (simulated) +When the coverage walk runs in strict mode +Then `ManifestCoverageError(...)` is raised naming the orphan; `Manifest.json` on disk is the prior-good one (prev was restored); ERROR log + +**AC-7: Coverage non-strict mode warns but continues** +Given `coverage_strict=False` and an orphan +When the build completes +Then `outcome=SUCCESS`; ONE WARN log naming the orphan; the new Manifest is on disk + +**AC-8: Lock released on every exit path** +Given any of: success / failure / IDEMPOTENT_NO_OP / `ManifestCoverageError` / `EngineBuildError` propagation +When `build_cache_artifacts` returns or raises +Then `cache_root/.c10.lock` is removed (or unlocked if the implementation uses fcntl); a subsequent call succeeds (no leftover lock) + +**AC-9: Hard errors propagate without state corruption** +Given `engine_compiler.compile_engines_for_corpus` raises `EngineBuildError` +When `build_cache_artifacts` runs +Then the error propagates; on-disk Manifest is the prior-good one (prev restored); lock is released; partial engines that AZ-321 wrote ARE on disk (not deleted — operators may want them for diagnostic) + +**AC-10: `compile_engines_for_corpus` passthrough** +Given a request configured for engine-only re-compile +When `compile_engines_for_corpus(req)` is called directly +Then `engine_compiler.compile_engines_for_corpus(req)` is invoked once with the same request; the return value is forwarded as a tuple; no lock is acquired (this is a thin diagnostic-mode call) + +**AC-11: Conformance — `isinstance` returns True** +Given the implementation +When `isinstance(impl, CacheProvisioner)` is checked under runtime_checkable +Then `True` + +**AC-12: Cold build benchmark within C10-PT-01 envelope** +Given Tier-1 dev workstation with NVIDIA GPU + a 1000-tile corpus + 3 backbones +When a cold build runs +Then wall-clock ≤ 12 min (CP-TC-12 / NFR C10-PT-01); WARN log if exceeded (so operators see the regression in CI) + +**AC-13: Warm idempotent benchmark within C10-PT-01 envelope** +Given a populated cache and identical request +When `build_cache_artifacts` runs +Then wall-clock ≤ 1 min (CP-TC-13 / NFR C10-PT-01); the bound work is the build-identity hash computation, which is dominated by `tiles_coverage_sha256` over 1000 tiles (~5 ms hashing) + +## Non-Functional Requirements + +**Performance** +- Cold path is bound by AZ-321 + AZ-322 (per their NFRs); this orchestrator adds ≤ 5 s coordination overhead. +- Warm path: build-identity hash + Manifest read + idempotence compare ≤ 5 s on Tier-1 dev workstation (1000-tile corpus). +- Coverage walk: O(N files); ≤ 1 s for ≤ 10k files in cache_root. + +**Compatibility** +- `filelock` library — pin via `requirements.txt`. (Verify already present from a prior task's deps; if not, add. Same version across all C10 tasks.) +- `orjson` (already pinned via AZ-272), `hashlib` (stdlib), `pathlib` (stdlib). + +**Reliability** +- CP-INV-2: failed build never leaves the cache in a worse state than at start. +- Lock release on every exit path (try/finally). +- Atomic Manifest replacement (rename prev → current rollback semantics); coverage error rolls back automatically. +- No silent failures: every error path logs at ERROR level with diagnostic. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Cold build with fakes for phases | All phases called once; SUCCESS | +| AC-2 | Warm re-run with identical request | IDEMPOTENT_NO_OP; zero phase calls | +| AC-3 | Different bbox after prior build | SUCCESS; atomic replace; old Manifest gone | +| AC-4 | Empty C6 query | FAILURE; hint string; lock released | +| AC-5 | Pre-acquire lock externally; run | BuildLockHeldError | +| AC-6 | Inject orphan file before coverage walk | ManifestCoverageError; prior Manifest restored | +| AC-7 | Same as AC-6 with `coverage_strict=False` | SUCCESS; WARN log | +| AC-8 | Each error path | Lock released after each | +| AC-9 | engine_compiler raises | Error propagates; rollback; lock released | +| AC-10 | Direct call to compile_engines_for_corpus | Single passthrough; no lock | +| AC-11 | Conformance | True | +| AC-12 | Cold build bench (skipped on CI; manual) | ≤ 12 min | +| AC-13 | Warm bench | ≤ 1 min | +| NFR-perf-coverage-walk | 10k files in cache_root | ≤ 1 s | + +## Constraints + +- The orchestrator does NOT touch `satellite-provider` (CP-INV-6); all I/O is local. +- Lockfile is mandatory; bypassing the lock for testing is a config flag, NOT a separate code path. +- Idempotence check parses ONLY `build.manifest_hash` from the existing Manifest; full verification is AZ-324's job (separate code path). +- `Manifest.json.prev` is the rollback target; never two prevs deep (rebuilds are not stack-able). +- Coverage walk EXCLUDES the lockfile, the Manifest itself, its sidecar, its signature, and any `.prev` rollback file. +- The orchestrator never modifies engines compiled by AZ-321 (atomic on disk) — it only touches the Manifest + .prev/.lock files. +- Operator key handling delegates entirely to AZ-323 (CP-INV-7). +- This task introduces at most ONE new third-party dependency (`filelock`) — verify against existing deps first. + +## Risks & Mitigation + +**Risk 1: Stale lockfile after process kill** +- *Risk*: A SIGKILL'd build leaves `.c10.lock` on disk; subsequent runs always raise `BuildLockHeldError`. +- *Mitigation*: Use `filelock` library which uses fcntl flock (auto-released on process exit). On platforms without fcntl, document the manual cleanup step. AC-5 + AC-8 cover normal lock release; the SIGKILL case is an OS-level guarantee from filelock. + +**Risk 2: Coverage walk slow on huge cache_root** +- *Risk*: 100k files in cache_root → coverage walk could take seconds. +- *Mitigation*: NFR-perf-coverage-walk benchmark; if exceeded, switch to streaming compare with a sorted Manifest path list. Out of scope for the initial impl. + +**Risk 3: Idempotence check trusts prior Manifest's hash without verifying signature** +- *Risk*: A tampered Manifest could lie about its `manifest_hash`, fooling the orchestrator into IDEMPOTENT_NO_OP and skipping a needed rebuild. +- *Mitigation*: This is acceptable because AZ-324's `ManifestVerifier` runs at takeoff — a tampered Manifest fails verify and prevents arming. The orchestrator's role is to AVOID rebuilds when nothing changed; trusting `manifest_hash` is a performance optimization, not a security check. Documented in CP-INV-1. + +**Risk 4: Empty `coverage_strict=False` becomes the de-facto default** +- *Risk*: Operators set `coverage_strict=False` to ship faster, defeating D-C10-3. +- *Mitigation*: Default is True; the config flag is documented as "for forensic builds only"; CI runs always assert strict. + +**Risk 5: Rollback corrupts state on partial coverage walk failure** +- *Risk*: If `Manifest.json.prev` rename fails (e.g., disk full), the cache is left in an in-between state. +- *Mitigation*: Use AZ-280's atomic rename helper; if the rename itself fails, surface a distinct `ManifestRollbackError` (subclass of `ManifestCoverageError`) so operators see the disk-level cause. Documented but not a blocker for v1.0.0. + +**Risk 6: Lock acquisition races with operator's manual file ops** +- *Risk*: Operator manually edits a file in cache_root while a build is running. +- *Mitigation*: Coverage walk happens at the end of build; if operator drops a file mid-build, AC-6 catches it. The lockfile prevents two CONCURRENT builds, not operator-vs-build interference. Documented. + +## Runtime Completeness + +- **Named capability**: top-level F1 cache build with D-C10-1 idempotence + D-C10-3 ManifestCoverageError + lockfile race-condition mitigation (epic § Acceptance C10-IT-01, C10-IT-03, C10-IT-04; description.md § 1, § 5, § 7). +- **Production code that must exist**: real `CacheProvisioner` orchestrating real AZ-321/322/323 + real `filelock`-backed lock + real coverage walk + real rollback. +- **Allowed external stubs**: tests MAY use spy/fake versions of AZ-321/322/323 (already produced by their conformance tests) + an in-process `FileLockFactory` for deterministic concurrency tests. +- **Unacceptable substitutes**: skipping the lockfile (defeats CP-INV-4); skipping the coverage walk (defeats D-C10-3); a "soft" idempotence that re-builds anyway (defeats D-C10-1 and the C10-PT-01 1-min warm target); calling AZ-324's `ManifestVerifier` for the idempotence check (over-kill — full verify on every operator invocation triples warm-path cost); deleting partial engines on failure (operators rely on them for diagnostic per AC-9). diff --git a/_docs/02_tasks/todo/AZ-326_c12_cli_app.md b/_docs/02_tasks/todo/AZ-326_c12_cli_app.md new file mode 100644 index 0000000..42ecba0 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-326_c12_cli_app.md @@ -0,0 +1,188 @@ +# C12 CLI App — Typer Entry Point + Subcommand Routing + Operator Helpers + +**Task**: AZ-326_c12_cli_app +**Name**: C12 CLI App +**Description**: Implement the operator-tooling CLI shell that operators run on the workstation. Wires Typer (per the Click/Typer project pin) into `operator_tool/__main__.py`, registers six subcommands (`download`, `build-cache`, `upload-pending`, `reloc-confirm`, `verify-ready`, `set-sector`), wires the E-CC-LOG (AZ-266) logger to a workstation-side structured-JSON log file (`~/.azaion/onboard/c12-tooling.log`), and ships the two trivial operator-side helpers from description.md § 2 — `set_sector_classification(area, sector_class)` (persists per-area classification to a local JSON file under the operator workstation's home directory) and `apply_freshness_threshold(sector_class) -> int (months)` (a pure-data lookup that maps the sector classification enum to the AC-NEW-6 months freshness budget). Each subcommand is a thin shell that resolves its service collaborator (`build_cache`, `companion_bringup`, `post_landing_upload`, `operator_reloc_service` — all owned by sibling tasks AZ-NNN T2..T5) from the composition root and delegates to it; on success returns 0; on a known error type maps to a documented non-zero exit code with a one-line operator-friendly message + remediation hint pulled from the underlying error's `remediation` attribute. The CLI app does NOT own any workflow logic itself — only command registration, argument parsing, logger wiring, exit-code mapping, and the two simple operator helpers. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c12_operator_tooling (epic AZ-253 / E-C12) +**Tracker**: AZ-326 +**Epic**: AZ-253 (E-C12) + +### Document Dependencies + +- `_docs/02_document/components/13_c12_operator_tooling/description.md` — § 2 (`set_sector_classification`, `apply_freshness_threshold` from `CacheBuildWorkflow`), § 5 (logging strategy table), § 7 (CLI-only this cycle, GUI deferred). +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO/WARN/ERROR log shapes for operator events. + +## Problem + +Without a real CLI shell: + +- F1 (pre-flight cache build) and F10 (post-landing upload) have no operator entry point — every workflow function in this epic is unreachable from the workstation. +- AC-NEW-6 (freshness pipeline) collapses partially — sibling tasks have no canonical place to call `apply_freshness_threshold(sector_class)` so each invents its own table or hard-codes months. +- Sector classification (active-conflict vs stable-rear) per description.md § 1 has no persistent surface; operator restarts lose all classifications. +- Logging from C12 is silent — without the wiring of E-CC-LOG to the workstation-side log file, every operator action is invisible during incident review. +- Sibling tasks T2..T5 have no consumer; their service classes ship but no end-to-end CLI flow exercises them. +- Exit codes are inconsistent across subcommands — operators script `operator-tool` runs and need `$?` to mean something specific per failure category. + +This task delivers the CLI shell + the two trivial operator helpers. It does NOT own `build_cache`, `verify_companion_ready`, `trigger_post_landing_upload`, or `OperatorReLocService` — those are sibling tasks invoked through the CLI. + +## Outcome + +- A Typer-based CLI app at `src/operator_tool/`: + - `src/operator_tool/__main__.py` — module entry point: `from operator_tool.cli import app; app()`. + - `src/operator_tool/cli.py` — Typer `app = typer.Typer(name="operator-tool", help="GPS-denied onboard pre-flight tooling (operator workstation)")`. Registers six subcommands via `@app.command(...)`. Each subcommand opens a logging context, calls into its service collaborator, catches the documented exception family for that command, maps to the documented exit code, and `raise typer.Exit(code=N)`. + - `src/operator_tool/sector_classification_store.py` — `SectorClassificationStore` class: + - Constructor: `__init__(self, *, store_path: Path, logger: Logger)`. + - `set_classification(area: AreaIdentifier, sector_class: SectorClassification) -> None` — persists `{area_id: sector_class}` mapping to `store_path` (default: `~/.azaion/onboard/sector-classifications.json`) using atomic write (`tempfile + os.replace`). + - `get_classification(area: AreaIdentifier) -> SectorClassification | None` — reads the JSON file; returns the classification for the given area or `None` if not set. + - `list_classifications() -> dict[AreaIdentifier, SectorClassification]` — returns all current classifications. + - File format: `{"area_id": "active_conflict" | "stable_rear", ...}`. + - INFO log on every `set_classification` call (`kind="c12.sector.classification.set"`). + - `src/operator_tool/freshness_table.py` — `freshness_threshold_months(sector_class: SectorClassification) -> int`: + - Pure data: `active_conflict → 1 month`; `stable_rear → 12 months`. Documented inline as the AC-NEW-6 freshness budget per description.md § 1 + Plan-phase intent. + - Module-level constant: `FRESHNESS_TABLE: dict[SectorClassification, int]`. + - `src/operator_tool/exit_codes.py` — module-level constants: `EXIT_OK = 0`, `EXIT_GENERIC_ERROR = 1`, `EXIT_USAGE = 2`, `EXIT_COMPANION_UNREACHABLE = 10`, `EXIT_CONTENT_HASH_MISMATCH = 11`, `EXIT_DOWNLOAD_FAILURE = 20`, `EXIT_BUILD_FAILURE = 21`, `EXIT_FLIGHT_STATE_NOT_CONFIRMED = 30`, `EXIT_UPLOAD_FAILURE = 31`, `EXIT_GCS_LINK_ERROR = 40`, `EXIT_LOCK_HELD = 50`. Sibling tasks may extend with documented additions. +- A composition root entry at `src/gps_denied_onboard/runtime_root/c12_factory.py`: + - `build_operator_tool(config: Config) -> OperatorToolServices` — pure factory that constructs the `SectorClassificationStore` + a logger configured to write to `~/.azaion/onboard/c12-tooling.log`. Returns a frozen dataclass aggregating the operator-tool service handles. Sibling tasks T2..T5 each add their service to this dataclass without renaming or moving it. +- Subcommand surface (each subcommand body lives in `cli.py`; service implementations live in sibling task files): + - `download` — delegates to `tile_downloader.fetch(...)` (AZ-316). Maps `SatelliteProviderError → EXIT_DOWNLOAD_FAILURE`. + - `build-cache` — delegates to `build_cache_orchestrator.build_cache(...)` (sibling T3). Maps `CacheBuildError → EXIT_DOWNLOAD_FAILURE | EXIT_BUILD_FAILURE` (per `failure_phase`); `BuildLockHeldError → EXIT_LOCK_HELD`. + - `upload-pending` — delegates to `post_landing_upload.trigger_post_landing_upload(...)` (sibling T4). Maps `FlightStateNotConfirmedError → EXIT_FLIGHT_STATE_NOT_CONFIRMED`; `UploadGateBlockedError → EXIT_UPLOAD_FAILURE`. + - `reloc-confirm` — delegates to `operator_reloc_service.request_reloc(...)` (sibling T5). Maps `GcsLinkError → EXIT_GCS_LINK_ERROR`. + - `verify-ready` — delegates to `companion_bringup.verify_companion_ready(...)` (sibling T2). Maps `CompanionUnreachableError → EXIT_COMPANION_UNREACHABLE`; `ContentHashMismatchError → EXIT_CONTENT_HASH_MISMATCH`. + - `set-sector` — delegates to `SectorClassificationStore.set_classification(...)`. +- Each subcommand's `--help` includes a one-line summary + the AC IDs it supports (e.g. `build-cache: orchestrate F1 (AC-8.3, AC-NEW-1)`). +- Logging is wired at app startup: a single rotating file handler at `~/.azaion/onboard/c12-tooling.log`, structured JSON formatter from E-CC-LOG (AZ-266). Console (stderr) handler at WARN level for operator visibility. +- `pyproject.toml` registers `operator-tool` as a console script entry point pointing at `operator_tool.__main__:main`. The `main` function in `__main__.py` calls `app()`. + +## Scope + +### Included + +- `operator_tool` package layout (`__init__.py`, `__main__.py`, `cli.py`, `sector_classification_store.py`, `freshness_table.py`, `exit_codes.py`). +- The composition-root factory `build_operator_tool`. +- Six subcommand registrations + per-subcommand `--help` text + per-subcommand exception → exit-code mapping. +- `SectorClassificationStore` with atomic-write JSON persistence. +- `freshness_threshold_months` pure-data lookup. +- The exit-code constants module. +- Logger wiring for the workstation-side log file (rotating file handler + structured JSON via AZ-266). +- Console-script entry-point declaration in `pyproject.toml`. +- Unit tests covering: subcommand registration, exception → exit-code mapping (using fakes for service collaborators), `SectorClassificationStore` round-trip (set, get, atomic write resilience), `freshness_threshold_months` for both enum values, console script invocability via `subprocess.run`. + +### Excluded + +- The actual workflows for `build-cache`, `upload-pending`, `reloc-confirm`, `verify-ready` — owned by sibling tasks T2..T5. +- The download workflow body — owned by AZ-316. +- The MAVLink encoding for `reloc-confirm` — owned by sibling T5. +- A GUI surface — Plan-phase carryforward, deferred per description.md § 7. +- Anything that runs on the airborne companion (this entire package is operator-workstation-only per ADR-004). +- Per-subcommand integration tests against real `satellite-provider` — those live in C12-AT-01 (test decompose). + +## Acceptance Criteria + +**AC-1: All six subcommands register and appear in `--help`** +Given the `operator-tool` console script is installed +When the operator runs `operator-tool --help` +Then the listed subcommands include exactly `download`, `build-cache`, `upload-pending`, `reloc-confirm`, `verify-ready`, `set-sector`; no extras + +**AC-2: Successful subcommand exits 0** +Given a subcommand whose service collaborator returns successfully +When the subcommand is invoked through the CLI +Then the process exit code is 0; no error message is printed to stderr; an INFO log entry is written + +**AC-3: Each documented exception maps to its documented exit code** +Given a service collaborator raises one of the documented exception types in this task's outcome list +When the subcommand is invoked +Then the process exit code matches the constant in `exit_codes.py`; a one-line operator-friendly message is printed to stderr; an ERROR log entry is written with the exception type and the remediation hint + +**AC-4: `SectorClassificationStore` round-trips via atomic write** +Given an empty store +When `set_classification(area="Derkachi", sector_class=SectorClassification.active_conflict)` is called, then a fresh `SectorClassificationStore` is constructed pointing at the same path +Then `get_classification("Derkachi")` returns `SectorClassification.active_conflict`; the on-disk JSON file matches the expected shape; the file's parent directory was created if missing + +**AC-5: `SectorClassificationStore` set is atomic under crash** +Given an existing JSON file with one classification +When the process is killed (`SIGKILL`) mid-write of a second classification (simulated via a `Path.replace` patch that raises after `tempfile.write` but before `os.replace`) +Then the original JSON file remains intact and parseable; no `*.tmp` lingers + +**AC-6: `freshness_threshold_months` returns the documented values** +Given the two enum values of `SectorClassification` +When `freshness_threshold_months(...)` is called for each +Then `active_conflict → 1`, `stable_rear → 12` + +**AC-7: Logging writes structured JSON to the workstation log file** +Given a fresh CLI invocation with `~/.azaion/onboard/` empty +When any subcommand runs to completion +Then a `c12-tooling.log` file exists at `~/.azaion/onboard/`; its lines parse as JSON; each line carries `timestamp`, `level`, `kind`, plus subcommand-specific fields per AZ-266's record schema + +**AC-8: Console-script entry point is installed and runnable** +Given the package is installed via `pip install -e .` +When the shell runs `operator-tool --help` +Then the help text is printed; the exit code is 0; the binary resolves through the entry-point declared in `pyproject.toml` + +**AC-9: Subcommand `--help` references the relevant AC IDs** +Given any subcommand +When `operator-tool --help` is run +Then the help text body includes the AC IDs the subcommand supports (e.g. `build-cache` mentions `AC-8.3, AC-NEW-1`); operators reading `--help` can cross-reference to `acceptance_criteria.md` + +**AC-10: `set-sector` is idempotent for the same input** +Given `set-sector --area Derkachi --class active_conflict` was just run +When the same command is run again +Then the on-disk JSON file is byte-identical (or has only timestamp diffs in the log, not in the data file); the operator sees the same exit code 0 and the same INFO log line + +## Non-Functional Requirements + +**Performance** +- CLI cold start (`operator-tool --help`) ≤ 500 ms on a developer laptop. The Typer app must avoid eager-importing heavy dependencies (httpx, pymavlink, paramiko) — sibling tasks expose lazy-import accessors used by their respective subcommands, not at module load time. + +**Compatibility** +- Click/Typer per the project pin (no version override). +- The structured JSON log format matches AZ-266's record schema exactly; this task adds no new top-level field. + +**Reliability** +- The `SectorClassificationStore` write path is atomic across process kill (per AC-5). +- `~/.azaion/onboard/` is created with mode `0o700` if it does not exist. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `operator-tool --help` output | All 6 subcommands listed | +| AC-2 | Subcommand with success-returning fake service | Exit 0, INFO log, no stderr | +| AC-3 | Subcommand with raising fake (each documented exception family) | Exit code matches `exit_codes.py`; ERROR log; one-line stderr | +| AC-4 | Round-trip `SectorClassificationStore` set → read | Matches input | +| AC-5 | Patched `os.replace` to raise mid-write | Original file intact, no `*.tmp` lingers | +| AC-6 | `freshness_threshold_months` for both enums | `active_conflict → 1`, `stable_rear → 12` | +| AC-7 | Subcommand run, then read log file | Each line parses as JSON; required fields present | +| AC-8 | `subprocess.run(["operator-tool", "--help"])` after `pip install -e .` | Exit 0, help text printed | +| AC-9 | Per-subcommand `--help` text | Includes documented AC IDs | +| AC-10 | Repeated `set-sector` for same area/class | On-disk JSON byte-identical | +| NFR-perf-cold-start | Microbench `operator-tool --help` × 10 | p99 ≤ 500 ms | + +## Constraints + +- This task introduces NO new third-party dependencies — Click/Typer is already pinned by the project per description.md § 5. +- Heavy dependencies (httpx, pymavlink, paramiko) MUST NOT be eager-imported in `cli.py` or `__main__.py`; they live behind the sibling tasks' service classes that are lazy-resolved. +- The CLI is operator-workstation-only — `operator_tool` MUST NOT be importable from any airborne entry point. Verified at the SBOM-diff level by E-BOOT (the CI gate already enforces no `operator_tool` symbol in `production-binary`). +- Atomic writes use `tempfile.NamedTemporaryFile(dir=store_path.parent) + os.replace`. Naked `Path.write_text()` is NOT acceptable per `coderule.mdc` "follow established project patterns" (see AZ-280's atomic-write pattern for the established convention; this task uses the simpler stdlib version since there is no SHA-256 sidecar requirement here). +- Log file location is fixed at `~/.azaion/onboard/c12-tooling.log` per description.md § 9 — config-overrideable via `config.c12.log_path` for tests but the default MUST match the spec. +- Subcommand naming is the source of truth for operators; renaming a subcommand requires a Plan-cycle change. + +## Risks & Mitigation + +**Risk 1: Heavy imports leak into CLI startup** +- *Risk*: A future sibling task lazily-imports a heavy dependency at the wrong scope (module level instead of function level), violating NFR-perf-cold-start. +- *Mitigation*: AC-NFR-perf-cold-start microbenches startup; CI hooks the test. If a regression appears, the offending import is surfaced by `python -X importtime`. + +**Risk 2: Operator runs `set-sector` against a stale store path after upgrade** +- *Risk*: An operator upgrades the operator-tool tarball; the new version changes the default `store_path`; classifications appear lost. +- *Mitigation*: The default path is fixed at `~/.azaion/onboard/sector-classifications.json` and treated as a stable contract. A future cycle that needs to migrate runs an explicit migration; this cycle does NOT change the path. + +**Risk 3: Console script collides with another tool** +- *Risk*: The name `operator-tool` is generic; another package on the operator's workstation could shadow it. +- *Mitigation*: The package is shipped as part of the operator-tooling tarball with its own venv; no global install. README documents the tarball install procedure. + +**Risk 4: Atomic-write corner case — disk full mid-tempfile** +- *Risk*: `tempfile.NamedTemporaryFile.write` could raise `OSError` mid-call; partial tempfile lingers. +- *Mitigation*: `try/finally` deletes the tempfile path on any exception in the write path; AC-5 covers the kill-mid-replace case; the disk-full case surfaces as `OSError` to the caller and the original file remains intact. diff --git a/_docs/02_tasks/todo/AZ-327_c12_companion_bringup.md b/_docs/02_tasks/todo/AZ-327_c12_companion_bringup.md new file mode 100644 index 0000000..8b25814 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-327_c12_companion_bringup.md @@ -0,0 +1,209 @@ +# C12 Companion Bringup — SSH `verify_companion_ready` + `ReadinessReport` + +**Task**: AZ-327_c12_companion_bringup +**Name**: C12 Companion Bringup +**Description**: Implement `CompanionBringup`, the C12-internal helper that opens an SSH session against the companion (paramiko per project pin), inspects the companion-side filesystem for the four required pre-flight artifacts (Manifest.json, .engine files + AZ-280 sidecars, calibration JSON), runs sidecar verification on the engines via a remote `sha256sum` over the engine path (compared against the sidecar's hex digest), and returns a `ReadinessReport` per description.md § 2 (`manifest_present`, `content_hashes_pass`, `engines_present`, `calibration_present`, `outcome ∈ {ready, not_ready}`, `not_ready_reasons: list[str]`). Owns the two error families: `CompanionUnreachableError` (SSH session-open failure: TCP refused, auth failed, host key mismatch, socket timeout) and `ContentHashMismatchError` (sidecar verification fails on at least one engine — distinct from "engine missing", which is a not-ready signal not an exception). Public surface is one method `verify_companion_ready(companion_address: CompanionAddress) -> ReadinessReport`. SSH user, key file, host-key policy, connect-timeout, and the canonical companion-side cache root come from config (`config.c12.companion_ssh_user`, `config.c12.companion_ssh_keyfile`, `config.c12.companion_host_key_policy`, `config.c12.companion_connect_timeout_s`, `config.c12.companion_cache_root`) per AZ-269. The session is opened in a `try/finally` block; the connection is always closed even if the four checks raise. INFO log on every successful call (with the four boolean flags + outcome); WARN on degraded readiness (any 3-of-4); ERROR on the two error families. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c12_operator_tooling (epic AZ-253 / E-C12) +**Tracker**: AZ-327 +**Epic**: AZ-253 (E-C12) + +### Document Dependencies + +- `_docs/02_document/components/13_c12_operator_tooling/description.md` — § 2 (`verify_companion_ready` interface + `ReadinessReport` DTO shape), § 5 (`CompanionUnreachableError`, `ContentHashMismatchError`), § 7 (filesystem lockfile note — relevant for orchestrator T3 not this task). +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — sidecar file format (this task verifies remotely; does not import the helper but reuses the schema). +- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — engine filename layout used to enumerate the expected engines list. +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO/WARN/ERROR log shapes. + +## Problem + +Without a real `CompanionBringup`: + +- `build_cache` (sibling T3) cannot run safely — the orchestrator would invoke C10 on the companion without any pre-flight visibility into the companion's state. A half-provisioned companion would either silently miscompile (manifest stale) or corrupt the cache. +- The `verify-ready` CLI subcommand has no implementation — operators cannot diagnose "is my companion in a usable state?" without SSHing in manually. +- Pre-flight content-hash verification per AC-NEW-1's takeoff gate (AZ-324 covers the airborne side) has no operator-side counterpart — sidecar mismatches that occur during the SSH transfer would only surface at takeoff, too late. +- `CompanionUnreachableError` and `ContentHashMismatchError` exist as concept-only types in description.md § 5 with no producer. +- Configuration knobs for SSH credentials, host-key policy, and the canonical cache root have no consumer; AZ-269's loader cannot validate them against a concrete usage. + +This task delivers the bring-up + verification layer. It does NOT orchestrate the `build_cache` flow (sibling T3 does), does NOT invoke C10 (T3 does via SSH after this task confirms readiness), and does NOT perform the takeoff-time content-hash verification (AZ-324 owns the airborne side). + +## Outcome + +- A `CompanionBringup` class at `src/operator_tool/companion_bringup.py`: + - Constructor: `__init__(self, *, ssh_factory: SshSessionFactory, sidecar_verifier: RemoteSidecarVerifier, logger: Logger, config: C12CompanionConfig)`. + - `C12CompanionConfig` (`@dataclass(frozen=True)`): `ssh_user: str`, `ssh_keyfile: Path`, `host_key_policy: enum {strict, known_hosts, reject_new}`, `connect_timeout_s: float = 10.0`, `companion_cache_root: PurePosixPath = PurePosixPath("/var/lib/azaion/c10/cache")`, `manifest_filename: str = "Manifest.json"`, `calibration_filename: str = "camera_calibration.json"`, `expected_engines: tuple[str, ...] = ()` (the orchestrator passes the list per the request; default empty fails AC-2 cleanly). + - Public method: `verify_companion_ready(companion_address: CompanionAddress) -> ReadinessReport`. +- DTOs at `src/operator_tool/_types.py`: + - `CompanionAddress` (`@dataclass(frozen=True)`): `host: str`, `port: int = 22`. + - `ReadinessReport` (`@dataclass(frozen=True)`): `manifest_present: bool`, `content_hashes_pass: bool`, `engines_present: bool`, `calibration_present: bool`, `outcome: enum {ready, not_ready}`, `not_ready_reasons: tuple[str, ...]`, `companion_cache_root: str`, `engines_inspected_count: int`. +- Errors at `src/operator_tool/errors.py`: + - `CompanionUnreachableError(Exception)`: attributes `host: str`, `port: int`, `reason: enum {connect_refused, auth_failed, host_key_mismatch, timeout, other}`, `underlying_exception_repr: str`. `remediation` attribute returns a one-line operator-friendly hint per `reason`. + - `ContentHashMismatchError(Exception)`: attributes `engine_path: str`, `expected_sha256_hex: str`, `actual_sha256_hex: str`. `remediation` attribute returns "Re-run the cache build (`operator-tool build-cache --area ...`) to repopulate the affected engine.". +- A `SshSessionFactory` Protocol at `src/operator_tool/ssh_session.py`: + ```python + @runtime_checkable + class SshSession(Protocol): + def run(self, command: str, *, timeout_s: float) -> RemoteCommandResult: ... + def file_exists(self, remote_path: PurePosixPath) -> bool: ... + def list_dir(self, remote_path: PurePosixPath) -> list[str]: ... + def close(self) -> None: ... + + @runtime_checkable + class SshSessionFactory(Protocol): + def open(self, address: CompanionAddress, *, timeout_s: float) -> SshSession: ... + ``` + Concrete implementation `ParamikoSshSessionFactory` wraps `paramiko.SSHClient` with the documented host-key policy mapping (`strict → RejectPolicy`, `known_hosts → AutoAddPolicy gated on `~/.ssh/known_hosts` presence`, `reject_new → RejectPolicy with explicit allowlist`). +- A `RemoteSidecarVerifier` helper at `src/operator_tool/remote_sidecar_verifier.py`: + - `verify(session: SshSession, engine_path: PurePosixPath) -> RemoteSidecarResult` — runs `sha256sum ` over the SSH session, parses the first 64 hex chars, reads the sidecar file at `.sha256` via `session.run("cat ...")`, parses its 64 hex chars, compares case-insensitively. Returns `RemoteSidecarResult(matches: bool, expected_hex: str, actual_hex: str)`. +- Method flow for `verify_companion_ready`: + 1. Open SSH session via `ssh_factory.open(companion_address, timeout_s=config.connect_timeout_s)`. On any paramiko/socket exception → catch and raise `CompanionUnreachableError` mapping the underlying type to a `reason` enum value. Always wrap subsequent steps in `try/finally` that closes the session. + 2. Check 1 — `manifest_present`: `session.file_exists(companion_cache_root / manifest_filename)`. + 3. Check 2 — `engines_present`: `session.list_dir(companion_cache_root / "engines")` → set of filenames; compare against `config.expected_engines`. If `config.expected_engines` is empty → `engines_present = False`, `not_ready_reasons += ["expected_engines list empty in caller-supplied config"]`. Else `engines_present = expected_engines.issubset(listed_engines)`; if not, append `"engines_missing: "`. + 4. Check 3 — `content_hashes_pass`: for each engine in the intersection of `expected_engines` and `listed_engines`, call `sidecar_verifier.verify(session, companion_cache_root / "engines" / engine)`. If ANY result `matches == False` → raise `ContentHashMismatchError` with the first failing path. If all match → `content_hashes_pass = True`. Records `engines_inspected_count` regardless. + 5. Check 4 — `calibration_present`: `session.file_exists(companion_cache_root / calibration_filename)`. + 6. Compute `outcome`: `ready` iff all four booleans are `True`; `not_ready` otherwise. + 7. Emit log: INFO `kind="c12.companion.ready"` with the four flags + outcome on success; WARN `kind="c12.companion.degraded"` if any check failed without raising (i.e. `outcome=not_ready` due to a missing artifact, not a hash mismatch). + 8. Return the `ReadinessReport`. +- Composition-root factory at `src/gps_denied_onboard/runtime_root/c12_factory.py` extends T1's `OperatorToolServices` dataclass with a `companion_bringup: CompanionBringup` field. The factory `build_companion_bringup(config) -> CompanionBringup` constructs the paramiko-backed session factory + remote sidecar verifier + logger. + +## Scope + +### Included + +- `CompanionBringup` class with the single public method. +- The 2 DTOs (`CompanionAddress`, `ReadinessReport`) plus the `outcome` and `reason` enum types. +- The 2 error types (`CompanionUnreachableError`, `ContentHashMismatchError`) with `remediation` attributes. +- `SshSessionFactory` + `SshSession` Protocols. +- `ParamikoSshSessionFactory` + `ParamikoSshSession` concrete implementations. +- `RemoteSidecarVerifier` helper. +- Composition-root factory. +- Config schema extension on AZ-269's loader (`config.c12.companion_*` block). +- `verify-ready` subcommand wiring delegated to T1's CLI shell — this task ships the service class; T1's `cli.py` resolves it from the composition root. +- Conformance unit tests using a fake `SshSessionFactory` (no paramiko in unit tests) covering all 6 acceptance criteria. + +### Excluded + +- The `build_cache` orchestration that consumes `verify_companion_ready` (sibling T3). +- The actual SSH-invocation of C10 on the companion (sibling T3). +- The takeoff-time content-hash verification on the airborne side (AZ-324). +- Engine compilation (AZ-321), descriptor generation (AZ-322), Manifest writing (AZ-323) — all C10 owns these and they ran prior to this task being invoked. +- A SOCKS proxy or jump-host SSH path — direct SSH only this cycle. +- Telemetry exfiltration of operator workstation key material — host key + private key never appear in log output (only fingerprint hash if at all). + +## Acceptance Criteria + +**AC-1: All four artifacts present + sidecars verify → `outcome=ready`** +Given the companion's SSH is reachable, `Manifest.json` exists, all `expected_engines` exist, all sidecars verify, and the calibration file exists +When `verify_companion_ready(address)` is called +Then `ReadinessReport(manifest_present=True, content_hashes_pass=True, engines_present=True, calibration_present=True, outcome=ready, not_ready_reasons=())` is returned; ONE INFO log `kind="c12.companion.ready"` is emitted + +**AC-2: Missing engine → `outcome=not_ready`** +Given `expected_engines=("dinov2_vpr_sm87_jp62_trt103_fp16.engine", "lightglue_sm87_jp62_trt103_fp16.engine")` and only the first exists on the companion +When `verify_companion_ready(address)` is called +Then `engines_present=False`; `not_ready_reasons` contains `"engines_missing: lightglue_sm87_jp62_trt103_fp16.engine"`; `outcome=not_ready`; ONE WARN log `kind="c12.companion.degraded"`; NO `ContentHashMismatchError` is raised + +**AC-3: Sidecar mismatch → `ContentHashMismatchError`** +Given an engine file is present but its sidecar's hex digest does not match the engine's actual SHA-256 +When `verify_companion_ready(address)` is called +Then `ContentHashMismatchError` is raised with `engine_path`, `expected_sha256_hex`, `actual_sha256_hex` populated; the SSH session is closed (`session.close()` is called in `finally`); ONE ERROR log `kind="c12.companion.hash.mismatch"` is emitted + +**AC-4: SSH connection refused → `CompanionUnreachableError(reason=connect_refused)`** +Given the companion address is unreachable (TCP RST or no listener) +When `verify_companion_ready(address)` is called +Then `CompanionUnreachableError(reason=connect_refused, underlying_exception_repr="...")` is raised; the underlying paramiko/socket exception's repr is captured; ONE ERROR log `kind="c12.companion.unreachable"`; `remediation` attribute returns "Check companion power, USB/Ethernet cable, and `config.c12.companion_address`." + +**AC-5: SSH auth failure → `CompanionUnreachableError(reason=auth_failed)`** +Given the companion is reachable but the SSH key is wrong or revoked +When `verify_companion_ready(address)` is called +Then `CompanionUnreachableError(reason=auth_failed, ...)` is raised; ERROR log `kind="c12.companion.unreachable"` with `reason="auth_failed"`; `remediation` attribute returns "Verify `config.c12.companion_ssh_keyfile` matches the public key in `~/.ssh/authorized_keys` on the companion." + +**AC-6: Host key mismatch with `host_key_policy=strict` → `CompanionUnreachableError(reason=host_key_mismatch)`** +Given the companion's host key has changed and `config.c12.companion_host_key_policy = strict` +When `verify_companion_ready(address)` is called +Then `CompanionUnreachableError(reason=host_key_mismatch, ...)` is raised; ERROR log; `remediation` returns "Inspect `~/.ssh/known_hosts`; if the companion was reflashed, remove its old entry; otherwise treat as a security incident." + +**AC-7: SSH session is always closed** +Given any of the four checks raises an unexpected exception (e.g. SFTP returns `OSError`) +When `verify_companion_ready(address)` is called +Then the exception propagates to the caller; `session.close()` was called exactly once before propagation (verifiable via spy on the fake `SshSession`); no socket descriptor leaks + +**AC-8: Connect timeout → `CompanionUnreachableError(reason=timeout)`** +Given the companion address routes but never responds to TCP SYN within `config.c12.companion_connect_timeout_s` +When `verify_companion_ready(address)` is called +Then `CompanionUnreachableError(reason=timeout, ...)` is raised within `connect_timeout_s + 1.0 s` (allowing test jitter); ERROR log includes the configured timeout value + +**AC-9: `engines_inspected_count` reflects what was actually checked** +Given a mix of present + missing engines (2 of 3 expected exist) +When `verify_companion_ready(address)` is called +Then `engines_inspected_count == 2`; the missing engine appears in `not_ready_reasons` but does NOT trigger a sidecar verify call (verifiable via spy) + +**AC-10: `host_key_policy=reject_new` blocks first connection to a previously unseen host** +Given `config.c12.companion_host_key_policy = reject_new` and the companion is not in `~/.ssh/known_hosts` +When `verify_companion_ready(address)` is called +Then `CompanionUnreachableError(reason=host_key_mismatch, ...)` is raised; ERROR log; `remediation` returns "Add the companion to `~/.ssh/known_hosts` first via a manual `ssh-keyscan`, then retry." + +## Non-Functional Requirements + +**Performance** +- A successful `verify_companion_ready` call against a local-network companion (≤ 1 ms RTT) with 5 engines completes in ≤ 5 s wall-clock (dominated by 5 × `sha256sum` over engines totaling ~1 GB on the companion's NVMe). +- Connection-open phase ≤ 2 s p99 in normal conditions; the `connect_timeout_s` config caps the worst case at the configured value. + +**Compatibility** +- paramiko per the project pin; no version override. +- Host-key policies map to paramiko's `MissingHostKeyPolicy` subclasses; if paramiko changes the API in a future minor version, this task's policy mapping is the only place to update. + +**Reliability** +- The session is closed in `finally` on every code path (AC-7 covers). +- `sha256sum` invocation has a per-engine timeout (default 60 s, config-overrideable) so a hung companion does not hold the operator's CLI indefinitely. +- The four checks are sequential, not parallel, to keep the SSH session simple and ordering deterministic for log correlation. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Fake `SshSessionFactory` returning a fake session where all four checks succeed | `ReadinessReport(outcome=ready)` + INFO log | +| AC-2 | Fake session with one missing engine | `outcome=not_ready`, `not_ready_reasons` lists the missing engine, no hash check on the missing one | +| AC-3 | Fake session where sidecar verifier returns `matches=False` | `ContentHashMismatchError` with populated attributes, session closed, ERROR log | +| AC-4 | `SshSessionFactory.open` raises `ConnectionRefusedError` | `CompanionUnreachableError(reason=connect_refused)`, ERROR log | +| AC-5 | `SshSessionFactory.open` raises `paramiko.AuthenticationException` | `CompanionUnreachableError(reason=auth_failed)`, ERROR log | +| AC-6 | `SshSessionFactory.open` raises `paramiko.BadHostKeyException` with `policy=strict` | `CompanionUnreachableError(reason=host_key_mismatch)`, ERROR log | +| AC-7 | Fake session whose `file_exists` raises `OSError` mid-flow | `OSError` propagates; `session.close()` called exactly once | +| AC-8 | `SshSessionFactory.open` raises `socket.timeout` after `connect_timeout_s` | `CompanionUnreachableError(reason=timeout)`, log includes timeout value | +| AC-9 | Fake session with mixed-presence engines, sidecar-verifier spy | `engines_inspected_count == count_of_present_expected`, sidecar verifier not called for missing engines | +| AC-10 | `host_key_policy=reject_new` + unknown host | `CompanionUnreachableError(reason=host_key_mismatch)` with `reject_new`-specific remediation text | +| NFR-perf-cold-call | Microbench against in-process fake session × 100 | p99 ≤ 50 ms for the orchestration overhead (excludes real SSH) | + +## Constraints + +- paramiko is the only allowed SSH library — no `subprocess.run("ssh ...")` shell-out (security: shell injection surface; reliability: no parsed output). +- `SshSessionFactory` is a Protocol, NOT a class — the concrete `ParamikoSshSessionFactory` is one implementation, allowing tests to inject fakes without monkey-patching paramiko. +- The `RemoteSidecarVerifier` does NOT pull the engine bytes back to the operator workstation — it runs `sha256sum` on the companion and parses the output. This avoids a multi-GB transfer per readiness check. +- The error families (`CompanionUnreachableError`, `ContentHashMismatchError`) are the canonical types; sibling tasks (T3 build_cache) MUST consume these and not redefine them. +- The host-key policy `auto_add_unknown` is intentionally NOT a supported value — silently accepting new host keys defeats the security model. The supported set is `strict | known_hosts | reject_new`; `known_hosts` requires the entry to already exist; `reject_new` is functionally identical to `strict` but with a clearer error message. +- This task does NOT cache SSH sessions — every `verify_companion_ready` call opens and closes a fresh session. Caching would complicate the failure model for marginal performance gain (the bottleneck is the four `sha256sum` runs, not session establishment). + +## Risks & Mitigation + +**Risk 1: paramiko version drift breaks the host-key-policy mapping** +- *Risk*: A future paramiko minor release renames or removes `MissingHostKeyPolicy` subclasses; this task's mapping breaks silently in tests that don't exercise paramiko itself. +- *Mitigation*: A single integration test (marked `@pytest.mark.requires_paramiko`) constructs `ParamikoSshSessionFactory` with each policy value and asserts the resulting paramiko policy class name. Catches version drift on dependency upgrades. + +**Risk 2: `sha256sum` is missing or behaves differently on the companion image** +- *Risk*: The companion is JetPack-based; if it ships without `coreutils`'s `sha256sum`, this task's verifier breaks at runtime. +- *Mitigation*: A composition-root health check at startup runs `sha256sum --version` over the SSH session and surfaces a clear `CompanionUnreachableError(reason=other, underlying_exception_repr="sha256sum not found")` if absent. JetPack base images include `coreutils` per ADR-005. + +**Risk 3: Operator's `~/.ssh/known_hosts` has stale entries from prior bench runs** +- *Risk*: A reflashed companion exhibits AC-10 / AC-6 failures legitimately, but operators see the cryptic paramiko traceback if remediation hints are unclear. +- *Mitigation*: AC-6 / AC-10 require the `remediation` attribute on `CompanionUnreachableError` to mention `~/.ssh/known_hosts` explicitly. The CLI subcommand `verify-ready` (in T1) prints the remediation hint to stderr. + +**Risk 4: Long-running `sha256sum` hangs the operator's CLI** +- *Risk*: A degraded companion NVMe causes `sha256sum` on a 200 MB engine to take minutes; the operator sees a hung command. +- *Mitigation*: `RemoteSidecarVerifier` enforces a per-engine timeout (default 60 s, config-overrideable). On timeout, the verifier raises `ContentHashMismatchError(actual_sha256_hex="")` so the operator sees a clear failure and can investigate the disk. + +## Runtime Completeness + +- **Named capability**: pre-flight companion artifact verification per AC-NEW-1 + description.md § 2 `verify_companion_ready`. +- **Production code that must exist**: real `CompanionBringup` orchestrating real `ParamikoSshSessionFactory` + real `RemoteSidecarVerifier` (with real `sha256sum` over SSH); real config-driven SSH credentials + host-key policy + cache root. +- **Allowed external stubs**: tests MAY use a fake `SshSessionFactory` returning a fake `SshSession` whose `run`, `file_exists`, `list_dir` are scripted; production wiring uses paramiko + the real companion. +- **Unacceptable substitutes**: shelling out to `ssh ...` via `subprocess.run` (security + reliability); reading sidecars by pulling engine bytes back to the workstation (multi-GB per readiness check); `auto_add_unknown` host-key policy (security defeat); a "skip-verify" config flag (defeats AC-NEW-1). diff --git a/_docs/02_tasks/todo/AZ-328_c12_build_cache_orchestrator.md b/_docs/02_tasks/todo/AZ-328_c12_build_cache_orchestrator.md new file mode 100644 index 0000000..c0c27be --- /dev/null +++ b/_docs/02_tasks/todo/AZ-328_c12_build_cache_orchestrator.md @@ -0,0 +1,220 @@ +# C12 Build-Cache Orchestrator — F1 Sequencing + Actionable `CacheBuildReport` + +**Task**: AZ-328_c12_build_cache_orchestrator +**Name**: C12 Build-Cache Orchestrator +**Description**: Implement `BuildCacheOrchestrator`, the public top-level F1 (pre-flight cache build) workflow. `build_cache(request: BuildCacheRequest) -> CacheBuildReport` does the following sequenced work, with strict ordering: (1) acquire a filesystem lockfile at `/.c12.lock` per description.md § 7 (prevents concurrent F1 runs from stomping each other); (2) call `tile_downloader.fetch(...)` (AZ-316) on the operator workstation with `area`, `sector_class`, `freshness_threshold_months`, `satellite_provider_url`, `api_key`; (3) on download `failure` outcome → wrap as `CacheBuildError(failure_phase=download, ...)` and return `CacheBuildReport(outcome=failure, failure_phase=download, download_report=..., build_report=None)` WITHOUT invoking C10; (4) on download `success` → call `companion_bringup.verify_companion_ready(...)` (AZ-327) — if `not_ready` → wrap and return `CacheBuildReport(outcome=failure, failure_phase=download, ...)` because the artifacts the C11 step pushed to the companion did not survive the verification (the boundary case here is that the operator workstation may have ingested tiles into local C6 but the companion's pre-existing artifacts are stale); (5) SSH-invoke `C10.CacheProvisioner.build_cache_artifacts` (AZ-325) on the companion via the `RemoteCacheProvisionerInvoker` helper, streaming the C10 stdout/stderr lines back as DEBUG logs and parsing the final `BuildReport` JSON document the C10 process emits on stdout; (6) aggregate into `CacheBuildReport`; (7) release the lockfile in `finally`. Wraps any underlying error from C11/C10/C7/C6 as `CacheBuildError` with a `remediation` attribute populated per `failure_phase` (download phase → retry hint, key rotation hint; build phase → cache cleanup hint, GPU OOM mitigation hint). Surfaces a clear non-zero exit code via T1's `cli.py` mapping. Owns the operator-facing C12-IT-02 acceptance test contract (build_cache orchestrates C11 then C10; download failure aborts before C10; mixed reports surface in `CacheBuildReport`). +**Complexity**: 5 points +**Dependencies**: AZ-326_c12_cli_app, AZ-327_c12_companion_bringup, AZ-316_c11_tile_downloader, AZ-325_c10_cache_provisioner, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c12_operator_tooling (epic AZ-253 / E-C12) +**Tracker**: AZ-328 +**Epic**: AZ-253 (E-C12) + +### Document Dependencies + +- `_docs/02_document/contracts/c11_tilemanager/tile_downloader.md` — consumed: `fetch` API + `DownloadBatchReport` shape. +- `_docs/02_document/contracts/c10_provisioning/cache_provisioner.md` — consumed: `build_cache_artifacts` API + `BuildReport` shape (this task invokes the contract over SSH; the contract values are passed back as a JSON document). +- `_docs/02_document/components/13_c12_operator_tooling/description.md` — § 1 (Coordinator), § 2 (`build_cache`, `CacheBuildReport`), § 5 (`CacheBuildError`), § 7 (lockfile), § 8 (depends on C10 + C11). +- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO/WARN/ERROR + DEBUG log shapes (DEBUG is used for streamed C10 progress). +- `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md` — the parent-suite `satellite-provider` URL + auth surface this task wires through (informational, no direct dep). + +## Problem + +Without a real `BuildCacheOrchestrator`: + +- F1 has no head — operators cannot build a flight-ready cache; AC-8.3 (imagery pre-loaded onto companion before flight) collapses; AC-NEW-1 (cold-start TTFF) cannot be exercised. +- The download-vs-build phase distinction has no enforcement — without strict ordering, a build phase may start before the C6 cache has tiles, causing the C10 `DescriptorBatcher` to return `failure_reason="no tiles in C6 ..."` instead of the operator getting the actionable C11 download error first. +- Operators have no failure-phase signal — a `CacheBuildError` without `failure_phase` forces the operator to read tracebacks to determine whether to retry the download or rebuild the engines. +- C12-IT-02 (build_cache orchestrates C11 then C10; download failure aborts before C10) has no implementation. +- Concurrent operator runs of `build-cache` against the same area would race on C6 + on the companion's C10 cache root, producing inconsistent state. description.md § 7's lockfile mitigation has no producer. +- The CLI's `build-cache` subcommand has nothing to delegate to. +- C10's `BuildReport` is produced on the companion process; without a remote invoker that captures and parses its output, the operator workstation cannot aggregate it into `CacheBuildReport`. + +This task delivers the F1 orchestrator + the remote C10 invoker + the lockfile + the unified `CacheBuildReport` aggregation. It does NOT own download (AZ-316), engine compile (AZ-321), descriptor generation (AZ-322), Manifest writing (AZ-323), takeoff verification (AZ-324), or the C10 orchestrator itself (AZ-325) — it composes them. + +## Outcome + +- A `BuildCacheOrchestrator` class at `src/operator_tool/build_cache.py`: + - Constructor: `__init__(self, *, tile_downloader: TileDownloader, companion_bringup: CompanionBringup, remote_c10_invoker: RemoteCacheProvisionerInvoker, freshness_table: FreshnessTable, lock_factory: FileLockFactory, logger: Logger, clock: Clock, config: C12BuildCacheConfig)`. + - `C12BuildCacheConfig` (`@dataclass(frozen=True)`): `cache_staging_root: Path`, `lock_filename: str = ".c12.lock"`, `lock_timeout_s: float = 5.0`, `companion_cache_root: PurePosixPath`. + - Public method: `build_cache(request: BuildCacheRequest) -> CacheBuildReport`. +- DTOs at `src/operator_tool/_types.py`: + - `BuildCacheRequest` (`@dataclass(frozen=True)`): `area: AreaIdentifier`, `bbox: Bbox`, `sector_class: SectorClassification`, `calibration_path: Path`, `satellite_provider_url: str`, `api_key: SecretStr`, `companion_address: CompanionAddress`, `expected_engines: tuple[str, ...]`. + - `CacheBuildReport` (`@dataclass(frozen=True)`): `download_report: DownloadBatchReport | None`, `build_report: BuildReport | None`, `outcome: enum {success, failure, idempotent_no_op}`, `failure_phase: enum {none, download, build}`, `failure_reason: str | None`, `wall_clock_s: float`. +- Errors at `src/operator_tool/errors.py`: + - `CacheBuildError(Exception)`: attributes `failure_phase: enum {download, build}`, `wrapped_exception_repr: str`, `remediation: str`. The `remediation` attribute is populated at construction time per `failure_phase` (download → "Re-run with same args; check `satellite_provider_url` and `api_key`."; build → "Inspect companion `~/.azaion/onboard/c10-build.log`; consider `rm -rf /engines/` to force a clean rebuild."). + - `BuildLockHeldError(CacheBuildError)`: subclass for the lock-held case with `remediation` = "Another `build-cache` is in progress; wait or kill the holding process and remove ``." +- A `RemoteCacheProvisionerInvoker` at `src/operator_tool/remote_c10_invoker.py`: + - Constructor: `__init__(self, *, ssh_factory: SshSessionFactory, logger: Logger)`. + - `invoke(session: SshSession, request: RemoteBuildRequest) -> BuildReport` — runs the C10 build entry point on the companion via `session.run("azaion-onboard c10 build --json-output --request ", ...)`, streams stdout line-by-line as DEBUG logs (`kind="c10.remote.progress"`), parses the final line as `BuildReport` JSON. The C10 entry point on the companion is the canonical CLI that AZ-325's `CacheProvisioner` ships (E-BOOT scaffolding established `azaion-onboard` as the airborne-image CLI; C10's build mode is `azaion-onboard c10 build`). +- A `FileLockFactory` Protocol at `src/operator_tool/file_lock.py`: + ```python + @runtime_checkable + class FileLock(Protocol): + def __enter__(self) -> "FileLock": ... + def __exit__(self, exc_type, exc, tb) -> None: ... + + @runtime_checkable + class FileLockFactory(Protocol): + def try_lock(self, path: Path, *, timeout_s: float) -> FileLock: ... + ``` + Concrete: `FilelockFileLockFactory` wrapping the `filelock` library per the project pin (already used by E-C13 per epics.md C13 section). NOT a custom implementation. +- Method flow for `build_cache`: + 1. Compute `lock_path = config.cache_staging_root / config.lock_filename`. Ensure `config.cache_staging_root` exists (mkdir parents=True). + 2. Compute `freshness_threshold_months = freshness_table.threshold(request.sector_class)` (uses T1's helper). + 3. Acquire lock: `with lock_factory.try_lock(lock_path, timeout_s=config.lock_timeout_s) as lock:` — on timeout, raise `BuildLockHeldError(failure_phase=download, ...)`. + 4. Record `start_t = clock.monotonic()`. + 5. INFO log `kind="c12.build_cache.start"` with the request (api_key REDACTED). + 6. **Download phase**: `download_report = tile_downloader.fetch(DownloadRequest(area=request.area, bbox=request.bbox, freshness_threshold_months=freshness_threshold_months, url=request.satellite_provider_url, api_key=request.api_key))`. Catch `SatelliteProviderError`, `RateLimitedError`, `ResolutionRejectionError`, `CacheBudgetExceededError`, `TileManagerError` → wrap as `CacheBuildError(failure_phase=download, ...)`. If `download_report.outcome == failure` → return `CacheBuildReport(outcome=failure, failure_phase=download, download_report=..., build_report=None, failure_reason=download_report.failure_reason, wall_clock_s=...)`. + 7. **Verify-ready phase**: `readiness = companion_bringup.verify_companion_ready(request.companion_address)`. Catch `CompanionUnreachableError`, `ContentHashMismatchError` → wrap as `CacheBuildError(failure_phase=download, ...)` (the C11 download succeeded but the companion is not in a state to consume the new tiles; failure_phase is `download` because the operator's next action is to re-run the same `build-cache` command, not to clean the build). If `readiness.outcome == not_ready` → return `CacheBuildReport(outcome=failure, failure_phase=download, ..., failure_reason="companion not ready: " + ", ".join(readiness.not_ready_reasons))`. + 8. **Build phase**: open SSH session via `ssh_factory.open(request.companion_address, ...)`; call `remote_c10_invoker.invoke(session, RemoteBuildRequest(bbox=request.bbox, sector_class=request.sector_class, calibration_path=request.calibration_path, expected_engines=request.expected_engines, companion_cache_root=config.companion_cache_root))`; catch `EngineBuildError`, `CalibrationCacheError`, `ManifestSignatureError`, `ManifestCoverageError`, `BuildLockHeldError` (C10's lock, distinct from C12's) → wrap as `CacheBuildError(failure_phase=build, ...)`. + 9. Aggregate: `build_report` from step 8. If `build_report.outcome == IDEMPOTENT_NO_OP` → return `CacheBuildReport(outcome=idempotent_no_op, failure_phase=none, download_report=..., build_report=..., failure_reason=None, wall_clock_s=...)`. Else if `build_report.outcome == FAILURE` → return `CacheBuildReport(outcome=failure, failure_phase=build, ..., failure_reason=build_report.failure_reason, ...)`. + 10. INFO log `kind="c12.build_cache.success"` with the aggregated counts (tiles_downloaded, engines_built, engines_reused, descriptors_generated). + 11. Return `CacheBuildReport(outcome=success, failure_phase=none, download_report=..., build_report=..., failure_reason=None, wall_clock_s=...)`. + 12. Lockfile released by `__exit__` of the `with` block. +- Composition-root factory at `src/gps_denied_onboard/runtime_root/c12_factory.py` extends T1's `OperatorToolServices` dataclass with a `build_cache_orchestrator: BuildCacheOrchestrator` field. The factory `build_build_cache_orchestrator(config, services) -> BuildCacheOrchestrator` constructs the lock factory, the remote C10 invoker, and pulls T1's `freshness_table` + T2's `companion_bringup` from the existing services dataclass. +- T1's `cli.py` `build-cache` subcommand resolves `services.build_cache_orchestrator` and calls `.build_cache(request)`. Maps `CacheBuildError(failure_phase=download) → exit 20`; `CacheBuildError(failure_phase=build) → exit 21`; `BuildLockHeldError → exit 50`. + +## Scope + +### Included + +- `BuildCacheOrchestrator` class with the single public method. +- The 2 DTOs (`BuildCacheRequest`, `CacheBuildReport`) plus the `outcome` and `failure_phase` enums. +- The 2 error types (`CacheBuildError` with `remediation`, `BuildLockHeldError`). +- `RemoteCacheProvisionerInvoker` over SSH (using the `SshSessionFactory` Protocol from T2). +- `FileLockFactory` + `FileLock` Protocols + `FilelockFileLockFactory` concrete using the `filelock` library. +- Composition-root factory. +- Wiring of T1's `build-cache` subcommand to this service. +- Conformance unit tests using fakes for `TileDownloader`, `CompanionBringup`, `RemoteCacheProvisionerInvoker`, `FileLockFactory` covering all 8 acceptance criteria. + +### Excluded + +- Anything internal to C11 download (AZ-316). +- Anything internal to C10 build (AZ-321..325). +- Anything internal to companion-side verification (AZ-327). +- The takeoff-time verification (AZ-324, airborne). +- Telemetry of build progress to a dashboard — DEBUG-log streaming only this cycle. +- Resumable downloads — AZ-316's idempotence handles partial downloads; this task does not retry on its own. +- Parallel multi-area builds — one area per `build_cache` call. + +## Acceptance Criteria + +**AC-1: Happy path — download → verify-ready → build → `success`** +Given a fresh empty C6 + a clean companion + valid `BuildCacheRequest` + fakes that all return `success` +When `build_cache(request)` is called +Then the call sequence is `lock acquire → tile_downloader.fetch → companion_bringup.verify_companion_ready → remote_c10_invoker.invoke → lock release` (verifiable via spy on each fake); `CacheBuildReport(outcome=success, failure_phase=none, download_report=..., build_report=..., failure_reason=None)` is returned; ONE INFO log `kind="c12.build_cache.start"`; ONE INFO log `kind="c12.build_cache.success"` + +**AC-2: Download failure aborts before C10** +Given a fake `tile_downloader.fetch` that raises `SatelliteProviderError("503 Service Unavailable")` +When `build_cache(request)` is called +Then `CacheBuildReport(outcome=failure, failure_phase=download, download_report=None, build_report=None, failure_reason="503 Service Unavailable")` is returned (NOT raised); `companion_bringup.verify_companion_ready` is NEVER called; `remote_c10_invoker.invoke` is NEVER called; ONE ERROR log `kind="c12.build_cache.download.failed"`; lockfile is released + +**AC-3: Verify-ready failure (`not_ready`) aborts before C10** +Given `tile_downloader.fetch` returns `success`, then `companion_bringup.verify_companion_ready` returns `ReadinessReport(outcome=not_ready, not_ready_reasons=("manifest missing",))` +When `build_cache(request)` is called +Then `CacheBuildReport(outcome=failure, failure_phase=download, ..., failure_reason="companion not ready: manifest missing")` is returned; `remote_c10_invoker.invoke` is NEVER called; ONE ERROR log `kind="c12.build_cache.companion.not_ready"`; lockfile released + +**AC-4: Build failure surfaces `failure_phase=build`** +Given download + verify-ready return `success`/`ready`, then `remote_c10_invoker.invoke` raises `EngineBuildError("CUDA OOM on backbone dinov2_vpr")` +When `build_cache(request)` is called +Then `CacheBuildReport(outcome=failure, failure_phase=build, download_report=..., build_report=None, failure_reason="CUDA OOM on backbone dinov2_vpr")` is returned; ONE ERROR log `kind="c12.build_cache.build.failed"`; `CacheBuildError(failure_phase=build)`'s `remediation` attribute mentions cache cleanup; lockfile released + +**AC-5: Lockfile prevents concurrent F1 runs** +Given a `FileLockFactory` whose `try_lock` raises `LockTimeout` after 5 s (simulated) +When `build_cache(request)` is called +Then `BuildLockHeldError(failure_phase=download, ...)` is raised; the `tile_downloader`, `companion_bringup`, `remote_c10_invoker` are NEVER called; ONE ERROR log `kind="c12.build_cache.lock.held"` + +**AC-6: Lockfile released in `finally` even on exception** +Given any of the four service collaborators raises an unexpected exception (`KeyboardInterrupt`, `RuntimeError`) +When `build_cache(request)` is called +Then the exception propagates to the caller; the lockfile's `__exit__` was called exactly once (verifiable via spy on the fake `FileLock`); the next `build_cache` call against the same lock path acquires the lock immediately + +**AC-7: Idempotent no-op surfaces as `outcome=idempotent_no_op`** +Given `remote_c10_invoker.invoke` returns `BuildReport(outcome=IDEMPOTENT_NO_OP, ...)` (D-C10-1 hit per AZ-325) +When `build_cache(request)` is called +Then `CacheBuildReport(outcome=idempotent_no_op, failure_phase=none, ..., failure_reason=None)` is returned; ONE INFO log `kind="c12.build_cache.idempotent"`; CLI exit code is 0 (success-equivalent for idempotent re-runs) + +**AC-8: `remediation` populated per `failure_phase`** +Given any `CacheBuildError` raised by the orchestrator +When the caller inspects `error.remediation` +Then for `failure_phase=download` the text mentions "Re-run with same args" + key/url checks; for `failure_phase=build` the text mentions cache cleanup + GPU diagnostics; for `BuildLockHeldError` the text mentions the lock path and how to clear it + +**AC-9: api_key is REDACTED in all log output** +Given a `BuildCacheRequest` with `api_key=SecretStr("super-secret-token")` +When any log line is emitted by the orchestrator +Then no log line contains the literal token; `api_key` field appears as `"REDACTED"` or is omitted entirely + +**AC-10: Aggregated `CacheBuildReport` carries both sub-reports on success** +Given a happy-path run +When the caller inspects the returned `CacheBuildReport` +Then `download_report` is a populated `DownloadBatchReport` from C11; `build_report` is a populated `BuildReport` from C10; `wall_clock_s` is a positive float; both sub-reports' fields are accessible (no truncation) + +## Non-Functional Requirements + +**Performance** +- The orchestrator's own overhead (lock acquire + verify-ready dispatch + result aggregation) is ≤ 1 s wall-clock; the dominant time is `tile_downloader.fetch` (minutes) + `remote_c10_invoker.invoke` (minutes), both owned upstream. +- Lock acquisition timeout default `5.0 s`; configurable for tests. + +**Compatibility** +- `filelock` library per the project pin (used by E-C13 already). No new third-party dependencies. +- The `SshSessionFactory` Protocol is shared with T2 — the orchestrator MUST receive the same factory T2 uses (single composition-root construction). + +**Reliability** +- Strict ordering: download → verify-ready → build. AC-2, AC-3 enforce. +- Lockfile released in all paths (AC-6). +- `api_key` never logged (AC-9). +- The remote C10 invocation streams DEBUG logs but does NOT buffer the full stdout in memory — uses line-iteration so even multi-hour builds don't blow memory. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Happy path with all fakes returning success | Sequenced calls + `success` report + INFO logs | +| AC-2 | Fake `tile_downloader.fetch` raises `SatelliteProviderError` | `failure_phase=download`, no C10 call, lock released | +| AC-3 | Fake `verify_companion_ready` returns `not_ready` | `failure_phase=download`, no C10 call, lock released | +| AC-4 | Fake `remote_c10_invoker.invoke` raises `EngineBuildError` | `failure_phase=build`, ERROR log, remediation mentions cleanup | +| AC-5 | Fake `FileLockFactory.try_lock` raises `LockTimeout` | `BuildLockHeldError`, no service calls, ERROR log | +| AC-6 | Fake `tile_downloader.fetch` raises `KeyboardInterrupt` | `KeyboardInterrupt` propagates, `FileLock.__exit__` called once | +| AC-7 | Fake C10 returns `IDEMPOTENT_NO_OP` | `outcome=idempotent_no_op`, INFO log | +| AC-8 | Construct each error type, inspect `remediation` | Matches documented text per phase | +| AC-9 | Capture log output with `api_key="super-secret-token"` | Token not present in any log line | +| AC-10 | Happy-path inspect returned report | Both sub-reports present, fields accessible | +| NFR-perf-overhead | Microbench orchestrator-only path with all-fake collaborators × 100 | p99 ≤ 50 ms (excludes real network/SSH) | + +## Constraints + +- Strict phase ordering is non-negotiable: download → verify-ready → build. Any reordering breaks AC-2/AC-3 and causes operators to chase phantom errors. +- `failure_phase` is a closed set `{none, download, build}` — adding a new value requires Plan-cycle approval (operators script against these values). +- The lockfile lives in the operator workstation's cache staging area, NOT on the companion. Companion-side concurrent protection is C10's responsibility (CP-INV-4 in AZ-325). +- `api_key` field uses `pydantic.SecretStr` (or equivalent) and MUST NOT be `repr()`-logged anywhere in the orchestrator. +- The remote C10 invocation goes through the same `SshSessionFactory` as T2 — do NOT instantiate a second SSH client. Single composition root. +- `filelock` library — do NOT roll a custom file-locking primitive. Cross-platform correctness is hard. + +## Risks & Mitigation + +**Risk 1: Operator runs `build-cache` while a previous `build-cache` is still in progress** +- *Risk*: Two concurrent runs would race on the C6 spatial index + the companion's C10 cache root, producing inconsistent state. +- *Mitigation*: AC-5 + AC-6 — the lockfile is acquired with a 5-s timeout; the second invocation gets `BuildLockHeldError` with a clear remediation hint. + +**Risk 2: Mid-build SSH session drops (operator disconnects USB)** +- *Risk*: The C10 build is hours long; an SSH disconnect surfaces as `paramiko.SSHException` in the middle of `remote_c10_invoker.invoke`. +- *Mitigation*: The exception propagates as `CacheBuildError(failure_phase=build, wrapped_exception_repr="...")`; `remediation` mentions reconnecting and re-running (D-C10-1 makes the next run cheap if the build was past the engine-compile phase). The lockfile is released so the retry is unblocked. + +**Risk 3: C10's stdout stream is malformed or truncated** +- *Risk*: The companion's C10 process crashes mid-output; `RemoteCacheProvisionerInvoker` cannot find a valid `BuildReport` JSON line. +- *Mitigation*: `RemoteCacheProvisionerInvoker.invoke` raises `BuildReportParseError` (a `CacheBuildError(failure_phase=build)` subclass) with the captured stdout/stderr tail. Operator diagnoses via the companion's `c10-build.log`. + +**Risk 4: `freshness_threshold_months` lookup fails** +- *Risk*: A future cycle adds a `SectorClassification` enum value without updating `freshness_table` (T1). +- *Mitigation*: T1's `freshness_threshold_months` raises `KeyError` for unknown values; this orchestrator surfaces it as `CacheBuildError(failure_phase=download, ...)` with `remediation` mentioning the missing classification. Tests on T1 cover this. + +**Risk 5: api_key leaks into a DEBUG log via the C10 stdout stream** +- *Risk*: A mis-configured C10 prints the api_key in its own log; the orchestrator's DEBUG-streaming forwards it. +- *Mitigation*: AC-9 asserts the orchestrator does NOT emit the api_key; the `RemoteCacheProvisionerInvoker.invoke` filters incoming stdout lines through a redactor that replaces the literal api_key value with `` before logging. Defence-in-depth — C10 SHOULD not log it either, but this guards against a regression. + +## Runtime Completeness + +- **Named capability**: F1 pre-flight cache build orchestration per description.md § 1, § 2 (`build_cache`), § 8. +- **Production code that must exist**: real `BuildCacheOrchestrator` composing real `TileDownloader` (AZ-316) + real `CompanionBringup` (AZ-327) + real `RemoteCacheProvisionerInvoker` (this task) over real `paramiko` SSH (T2 owns the factory) + real `filelock` lockfile + real C10 build entry on the companion (AZ-325 ships the entry point). +- **Allowed external stubs**: tests MAY use fakes for all four service collaborators + the lock factory; production wiring uses real C11/C10 + real SSH + real filelock. +- **Unacceptable substitutes**: in-process fake C10 in production (description.md § 1 says C10 runs companion-side over USB/Eth — running in-process defeats the architecture); a custom file-locking primitive (correctness is non-trivial, use `filelock`); skipping verify-ready in production (defeats AC-NEW-1 takeoff verify); silently swallowing C10 errors instead of surfacing as `CacheBuildError(failure_phase=build)`. diff --git a/_docs/02_tasks/todo/AZ-329_c12_post_landing_upload.md b/_docs/02_tasks/todo/AZ-329_c12_post_landing_upload.md new file mode 100644 index 0000000..560715e --- /dev/null +++ b/_docs/02_tasks/todo/AZ-329_c12_post_landing_upload.md @@ -0,0 +1,216 @@ +# C12 Post-Landing Upload — `trigger_post_landing_upload` + FDR ON_GROUND Confirmation + +**Task**: AZ-329_c12_post_landing_upload +**Name**: C12 Post-Landing Upload +**Description**: Implement `PostLandingUploadOrchestrator`, the C12 post-flight (F10) workflow that gates `C11.TileUploader.upload_pending_tiles` (AZ-319) on a confirmed-ON_GROUND signal from the post-flight FDR. `trigger_post_landing_upload(request: PostLandingUploadRequest) -> UploadBatchReport` does the following: (1) locate the FDR segments for the given `flight_id` under `config.c12.fdr_root` (segment layout: `//segment_.fdr` per the C13 conventions); (2) iterate the segments from newest to oldest, parsing records via AZ-272's `FdrRecord.parse(...)`; (3) collect all `state.tick` records carrying a `flight_state` payload field (or a dedicated `flight_state.tick` kind if the schema names it that way — defer to AZ-272's contract); (4) walking the collected records backwards from the most recent (chronologically), count contiguous `ON_GROUND` records and compute the contiguous ON_GROUND duration as `(latest_record.ts − first_consecutive_on_ground_record.ts)` seconds; (5) compare against `config.c12.upload_min_on_ground_s` (default 30 s per description.md C12-IT-03); (6) on confirmed ≥ threshold → construct a `FlightStateSignal(state=ON_GROUND, since_ts=)` and call `tile_uploader.upload_pending_tiles(flight_state=...)`; (7) on any refusal mode → raise `FlightStateNotConfirmedError(not_confirmed_reason=...)` with one of the four documented reason strings (`"never_landed"`, `"insufficient_duration: s < s"`, `"flight_id_not_found"`, `"fdr_unreadable: "`). Owns AC-8.4's defense-in-depth check on the operator-tooling side — the airborne C11 ALSO blocks via `UploadGateBlockedError` per AZ-319; this task is the operator-side gate that prevents the upload command from even being issued. Returns C11's `UploadBatchReport` unchanged on success. Logs every decision (INFO on confirmed; ERROR on each refusal mode) including the inferred contiguous ON_GROUND duration in seconds. +**Complexity**: 3 points +**Dependencies**: AZ-326_c12_cli_app, AZ-319_c11_tile_uploader, AZ-272_fdr_record_schema, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c12_operator_tooling (epic AZ-253 / E-C12) +**Tracker**: AZ-329 +**Epic**: AZ-253 (E-C12) + +### Document Dependencies + +- `_docs/02_document/contracts/c11_tilemanager/tile_uploader.md` — consumed: `upload_pending_tiles` API + `UploadBatchReport` shape + `FlightStateSignal` DTO. +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — consumed: `parse(buf: bytes) -> FdrRecord` + the `state.tick` / `flight_state.tick` kind shape (defer to the contract for the exact `kind` name and `flight_state` field). +- `_docs/02_document/components/13_c12_operator_tooling/description.md` — § 2 (`trigger_post_landing_upload` interface, `FlightStateNotConfirmedError`). +- `_docs/02_document/components/13_c12_operator_tooling/tests.md` — C12-IT-03 specifies the 30-s ON_GROUND threshold. +- `_docs/02_document/components/14_c13_fdr/description.md` — § 1 segment file layout (informational). + +## Problem + +Without a real `PostLandingUploadOrchestrator`: + +- F10 has no head — operators cannot trigger post-landing tile upload; AC-8.4 (mid-flight tile upload trigger, post-landing) collapses; the pending-upload journal in C6 grows unboundedly across flights. +- The operator-side ON_GROUND gate (defense-in-depth on top of C11's airborne gate) does not exist — operators can manually invoke `C11.TileUploader.upload_pending_tiles` with a fabricated `FlightStateSignal`, defeating the AC-NEW-7 / AC-8.4 architectural intent that mid-flight tiles only upload when the aircraft has landed. +- C12-IT-03 (`trigger_post_landing_upload` requires ≥ 30 s confirmed ON_GROUND in FDR) has no implementation. +- `FlightStateNotConfirmedError` is concept-only in description.md § 5 with no producer. +- The CLI's `upload-pending` subcommand has nothing to delegate to. +- An incomplete flight log (FDR ends with `IN_FLIGHT` because the aircraft crashed or never landed) silently passes through to C11 if there's no operator-side gate; the airborne gate is the last line of defense and may itself be unavailable on the operator workstation. + +This task delivers the operator-side gate. It does NOT own the actual upload (AZ-319), the FDR record schema (AZ-272), or the FDR write side (AZ-291..296) — it composes them. + +## Outcome + +- A `PostLandingUploadOrchestrator` class at `src/operator_tool/post_landing_upload.py`: + - Constructor: `__init__(self, *, tile_uploader: TileUploader, fdr_segment_reader: FdrSegmentReader, logger: Logger, clock: Clock, config: C12PostLandingConfig)`. + - `C12PostLandingConfig` (`@dataclass(frozen=True)`): `fdr_root: Path`, `upload_min_on_ground_s: float = 30.0`, `flight_state_record_kind: str = "state.tick"`, `flight_state_payload_field: str = "flight_state"`. + - Public method: `trigger_post_landing_upload(request: PostLandingUploadRequest) -> UploadBatchReport`. +- DTOs at `src/operator_tool/_types.py`: + - `PostLandingUploadRequest` (`@dataclass(frozen=True)`): `flight_id: str`. + - Reuses C11's `UploadBatchReport`. +- Errors at `src/operator_tool/errors.py`: + - `FlightStateNotConfirmedError(Exception)`: attributes `flight_id: str`, `not_confirmed_reason: str` (one of the four documented strings), `inferred_on_ground_duration_s: float | None` (populated when the reason is `insufficient_duration`), `remediation: str` (per-reason hint, e.g. for `flight_id_not_found`: "Verify the flight ID matches the FDR directory name; check `//`."). +- An `FdrSegmentReader` Protocol + `LocalFdrSegmentReader` concrete at `src/operator_tool/fdr_segment_reader.py`: + - `Protocol`: `iter_records_for_flight(flight_id: str, *, kind_filter: str | None = None) -> Iterator[FdrRecord]` — yields records ordered by `ts` ASCENDING; the orchestrator reverses on its own. `kind_filter` if non-None restricts to that record kind for efficiency. + - `LocalFdrSegmentReader.iter_records_for_flight(...)` — opens `//segment_*.fdr` files in numerical order, reads each as a stream of length-prefixed `FdrRecord` blobs (per AZ-272's serialisation), parses via `FdrRecord.parse(...)`, optionally filters by `kind`, yields one record at a time. Files are mmap'd or buffered-iterated so the operator workstation does not load multi-GB segments fully into memory. + - On any I/O or parse error → raises `FdrUnreadableError(reason: str)` (a sibling helper exception caught by the orchestrator and rewrapped as `FlightStateNotConfirmedError("fdr_unreadable: ...")`). +- Method flow for `trigger_post_landing_upload`: + 1. `flight_dir = config.fdr_root / request.flight_id`. If `not flight_dir.exists()` → raise `FlightStateNotConfirmedError(flight_id, "flight_id_not_found", remediation="Verify // exists; check `config.c12.fdr_root`.")`. + 2. Collect all `flight_state` records: `records = list(fdr_segment_reader.iter_records_for_flight(request.flight_id, kind_filter=config.flight_state_record_kind))`. Catch `FdrUnreadableError` → raise `FlightStateNotConfirmedError(flight_id, f"fdr_unreadable: {e!r}", ...)`. + 3. If `not records` → raise `FlightStateNotConfirmedError(flight_id, "never_landed", remediation="No flight state records in FDR for this flight; check the flight produced state.tick records.")` (treat absence of any state record as never-landed since we have no positive ON_GROUND signal). + 4. Walk `records` backward from the last (most recent `ts`): + - `latest = records[-1]`. + - If `latest.payload[config.flight_state_payload_field] != "ON_GROUND"` → raise `FlightStateNotConfirmedError(flight_id, "never_landed", remediation="Most recent flight_state in FDR is not ON_GROUND; the flight may have ended in IN_FLIGHT (e.g. crash, log truncation).")`. + - Walk backward through `records[:-1]` while `record.payload[...] == "ON_GROUND"`; the first non-`ON_GROUND` (or the start of the list) bounds the contiguous ON_GROUND run. + - `since = first_contiguous_on_ground_record.ts`; `duration_s = (parse_iso(latest.ts) - parse_iso(since)).total_seconds()`. + 5. If `duration_s < config.upload_min_on_ground_s` → raise `FlightStateNotConfirmedError(flight_id, f"insufficient_duration: {duration_s:.1f}s < {config.upload_min_on_ground_s:.1f}s", inferred_on_ground_duration_s=duration_s, remediation="Wait for the aircraft to be confirmed ON_GROUND for the required duration, then re-run.")`. + 6. INFO log `kind="c12.upload.confirmed_on_ground"` with `flight_id`, `inferred_on_ground_duration_s`. + 7. Construct `flight_state = FlightStateSignal(state=ON_GROUND, since_ts=since)` (the DTO comes from C11 per AZ-319's contract). + 8. Call `report = tile_uploader.upload_pending_tiles(flight_state=flight_state)`. Propagate `UploadGateBlockedError` (defense-in-depth on the airborne side; this should never happen if step 6 confirmed; if it does, log ERROR and re-raise as-is). + 9. INFO log `kind="c12.upload.complete"` with `tiles_acked`, `tiles_rejected` from `report`. + 10. Return `report` unchanged. +- Composition-root factory at `src/gps_denied_onboard/runtime_root/c12_factory.py` extends T1's `OperatorToolServices` dataclass with `post_landing_upload_orchestrator: PostLandingUploadOrchestrator`. The factory `build_post_landing_upload_orchestrator(config, services) -> PostLandingUploadOrchestrator` constructs the `LocalFdrSegmentReader` over `config.c12.fdr_root` and pulls C11's `tile_uploader` from the wider service registry. +- T1's `cli.py` `upload-pending` subcommand resolves `services.post_landing_upload_orchestrator` and calls `.trigger_post_landing_upload(...)`. Maps `FlightStateNotConfirmedError → exit 30`; `UploadGateBlockedError → exit 31`. + +## Scope + +### Included + +- `PostLandingUploadOrchestrator` class with the single public method. +- `PostLandingUploadRequest` DTO. +- `FlightStateNotConfirmedError` with the four documented `not_confirmed_reason` strings + per-reason `remediation`. +- `FdrSegmentReader` Protocol. +- `LocalFdrSegmentReader` concrete reading on-disk FDR segments. +- `FdrUnreadableError` helper exception (caught and rewrapped at the orchestrator boundary). +- Composition-root factory. +- Wiring of T1's `upload-pending` subcommand to this service. +- Conformance unit tests using a fake `FdrSegmentReader` returning scripted record sequences for all 7 acceptance criteria. +- Two end-to-end integration tests using real FDR segment fixtures (one ending with confirmed ON_GROUND for 60 s, one ending with IN_FLIGHT) — these are the C12-IT-03 fixtures. + +### Excluded + +- The actual upload HTTP machinery (AZ-319). +- The FDR record schema or serialiser (AZ-272). +- The FDR write side / segment rotation (AZ-291..296). +- A "force-upload" override flag to bypass the gate — explicitly NOT supported (defeats the operator-side gate's purpose). +- Reading mid-flight tile snapshots from FDR — the upload itself reads tiles from C6 per AZ-319. +- Cross-flight aggregation — one `flight_id` per call. + +## Acceptance Criteria + +**AC-1: ≥ 30 s confirmed ON_GROUND → upload invoked** +Given a fake `FdrSegmentReader` returning 60 records, the last 60 of them with `flight_state=ON_GROUND` spanning 60 s of timestamps +When `trigger_post_landing_upload(request)` is called +Then `tile_uploader.upload_pending_tiles` is called exactly once with `flight_state.state=ON_GROUND` and `flight_state.since_ts` equal to the first contiguous ON_GROUND record's ts; the returned `UploadBatchReport` is the one C11 produced; ONE INFO log `kind="c12.upload.confirmed_on_ground"` with `inferred_on_ground_duration_s ≈ 60.0`; ONE INFO log `kind="c12.upload.complete"` + +**AC-2: Insufficient duration → `FlightStateNotConfirmedError("insufficient_duration: ...")`** +Given the FDR ends with 15 s contiguous ON_GROUND records (less than the 30 s threshold) +When `trigger_post_landing_upload(request)` is called +Then `FlightStateNotConfirmedError(not_confirmed_reason="insufficient_duration: 15.0s < 30.0s", inferred_on_ground_duration_s≈15.0)` is raised; `tile_uploader.upload_pending_tiles` is NEVER called; ONE ERROR log `kind="c12.upload.refused.insufficient_duration"` + +**AC-3: Never-landed (last record is IN_FLIGHT) → `FlightStateNotConfirmedError("never_landed")`** +Given the FDR's most recent `state.tick` record has `flight_state=IN_FLIGHT` +When `trigger_post_landing_upload(request)` is called +Then `FlightStateNotConfirmedError(not_confirmed_reason="never_landed", inferred_on_ground_duration_s=None)` is raised; uploader NOT called; ONE ERROR log `kind="c12.upload.refused.never_landed"` + +**AC-4: `flight_id` not found in FDR → `FlightStateNotConfirmedError("flight_id_not_found")`** +Given `//` does not exist +When `trigger_post_landing_upload(request)` is called +Then `FlightStateNotConfirmedError(not_confirmed_reason="flight_id_not_found")` is raised; uploader NOT called; ONE ERROR log `kind="c12.upload.refused.flight_id_not_found"` + +**AC-5: FDR unreadable → `FlightStateNotConfirmedError("fdr_unreadable: ")`** +Given the FDR segments exist but parsing raises `OSError("input/output error")` mid-stream +When `trigger_post_landing_upload(request)` is called +Then `FlightStateNotConfirmedError(not_confirmed_reason=re.compile(r"^fdr_unreadable: .*OSError.*"))` is raised; uploader NOT called; ONE ERROR log `kind="c12.upload.refused.fdr_unreadable"` including the inner repr + +**AC-6: Threshold is configurable** +Given `config.c12.upload_min_on_ground_s = 5.0` (override) and the FDR ends with 6 s contiguous ON_GROUND records +When `trigger_post_landing_upload(request)` is called +Then the call succeeds (uploader invoked); the threshold is read from config, NOT a hardcoded literal + +**AC-7: Returns C11's `UploadBatchReport` unchanged** +Given a successful upload returning `UploadBatchReport(tiles_acked=42, tiles_rejected=3, ...)` +When the caller inspects the return value of `trigger_post_landing_upload` +Then it is byte-for-byte the `UploadBatchReport` C11 returned (same dataclass instance via passthrough); no field is added, removed, or renamed + +**AC-8: Contiguous ON_GROUND counting starts from the most recent record only** +Given the FDR contains a sequence `IN_FLIGHT, ON_GROUND, IN_FLIGHT, ON_GROUND × 60s` (an aborted go-around landing) +When `trigger_post_landing_upload(request)` is called +Then the contiguous ON_GROUND block counted is the LAST one (60 s), not the earlier ON_GROUND record; the upload is invoked since 60 s ≥ 30 s + +**AC-9: Empty `flight_state` records → `never_landed`** +Given `iter_records_for_flight(...)` yields zero records (no `state.tick` records ever emitted) +When `trigger_post_landing_upload(request)` is called +Then `FlightStateNotConfirmedError(not_confirmed_reason="never_landed")` is raised (treated as "we have no positive ON_GROUND signal") + +**AC-10: Real FDR fixture C12-IT-03(a) (60 s confirmed) → upload invoked** +Given the C12-IT-03 fixture FDR with confirmed ON_GROUND for 60 s +When `trigger_post_landing_upload(request)` is called against the LocalFdrSegmentReader on the fixture +Then the upload is invoked; the returned `UploadBatchReport` matches the fixture's expected counts + +**AC-11: Real FDR fixture C12-IT-03(b) (IN_FLIGHT, incomplete log) → refused** +Given the C12-IT-03 fixture FDR ending with IN_FLIGHT (truncated) +When `trigger_post_landing_upload(request)` is called against the LocalFdrSegmentReader on the fixture +Then `FlightStateNotConfirmedError(not_confirmed_reason="never_landed")` is raised; the upload is NOT invoked + +## Non-Functional Requirements + +**Performance** +- For an 8-hour flight (≤ 64 GB FDR per AC-NEW-3) the orchestrator's read of `state.tick` records completes in ≤ 30 s wall-clock on a developer laptop with NVMe (the records are sparse — `state.tick` is one of many record kinds; the `kind_filter` argument lets the reader skip non-state records cheaply). +- Memory peak ≤ 200 MB even with multi-GB FDR segments — `LocalFdrSegmentReader` is a streaming generator, NOT a list-in-memory. + +**Compatibility** +- AZ-272's `FdrRecord.parse` API is the only parser path; this task does NOT re-implement record parsing. +- C11's `FlightStateSignal` DTO is consumed unchanged; this task does NOT redefine it. + +**Reliability** +- Catches and rewraps the four refusal modes deterministically — operators can script against the four documented `not_confirmed_reason` prefix strings. +- Streaming I/O on FDR segments — multi-GB segments do not blow memory. +- The threshold default (30.0 s) matches description.md C12-IT-03 exactly. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Fake reader with 60 ON_GROUND records spanning 60 s | Uploader called once, INFO logs, returns `UploadBatchReport` | +| AC-2 | Fake reader with 15 s ON_GROUND tail | `FlightStateNotConfirmedError("insufficient_duration: 15.0s < 30.0s")` | +| AC-3 | Fake reader whose last record is IN_FLIGHT | `FlightStateNotConfirmedError("never_landed")` | +| AC-4 | Path doesn't exist | `FlightStateNotConfirmedError("flight_id_not_found")` | +| AC-5 | Fake reader raises `FdrUnreadableError("OSError(...)")` | `FlightStateNotConfirmedError(re.match("^fdr_unreadable: .*"))` | +| AC-6 | Override `upload_min_on_ground_s=5.0` + 6 s ON_GROUND | Upload invoked | +| AC-7 | Successful upload, inspect return | Same `UploadBatchReport` instance/fields | +| AC-8 | Sequence with go-around (IN_FLIGHT in middle) | Contiguous count is the LAST run only | +| AC-9 | Empty `iter_records_for_flight` | `FlightStateNotConfirmedError("never_landed")` | +| AC-10 | C12-IT-03(a) fixture | Upload invoked | +| AC-11 | C12-IT-03(b) fixture | `FlightStateNotConfirmedError("never_landed")` | +| NFR-perf-streaming | Microbench `LocalFdrSegmentReader` over 1 GB synthetic segment | Memory peak ≤ 200 MB; parse rate ≥ 100 MB/s | + +## Constraints + +- The four `not_confirmed_reason` strings (`"never_landed"`, `"insufficient_duration: ..."`, `"flight_id_not_found"`, `"fdr_unreadable: ..."`) are a closed contract — adding a new value requires Plan-cycle approval (operators script against these prefixes). +- The threshold default 30.0 s matches description.md C12-IT-03 EXACTLY; changing it requires a spec amendment, not just a config change. +- The "contiguous ON_GROUND from most recent only" semantic (AC-8) is non-negotiable — counting the union of all ON_GROUND windows would defeat the gate by allowing an aborted-go-around aircraft to qualify based on the brief earlier landing. +- A "force-upload" override is explicitly NOT supported — operators who legitimately need to upload after a non-conforming flight must use a separate forensic path (out of scope this cycle). +- `LocalFdrSegmentReader` MUST stream; loading a multi-GB segment fully into memory is a NFR violation (NFR-perf-streaming). +- C11's `FlightStateSignal` DTO is the source of truth for the gate signal — this task does NOT define a parallel C12-internal `FlightStateSignal`. +- The threshold is a `float`; comparison uses `>=` (so exactly 30.0 s qualifies). + +## Risks & Mitigation + +**Risk 1: AZ-272's record schema names the field something other than `flight_state`** +- *Risk*: AZ-272's contract may use `state` or `flight.state` instead of `flight_state`; this task hardcodes the field name in `config.c12.flight_state_payload_field`. +- *Mitigation*: The field name is a config knob (default `"flight_state"`); during integration with AZ-272, the default is updated to match AZ-272's actual contract. Tests use the default; integration tests against real FDR fixtures catch a mismatch immediately. + +**Risk 2: The aircraft logs ON_GROUND briefly during taxi before takeoff** +- *Risk*: The flight starts ON_GROUND, transitions to IN_FLIGHT, lands ON_GROUND again. The "contiguous from most recent" semantic correctly handles this — but if the FDR is truncated mid-flight, the most recent record might be from the taxi phase, falsely suggesting a landed flight. +- *Mitigation*: The truncation case is captured by AC-3 / AC-11 — a truncated log ending in IN_FLIGHT correctly refuses. A truncated log ending in the early ON_GROUND taxi phase is indistinguishable from a real landing, but this is an FDR integrity concern out of scope; in practice the FDR writes are continuous. + +**Risk 3: FDR segment file naming convention drift** +- *Risk*: C13 (AZ-291..296) may name segments differently than `segment_.fdr`. +- *Mitigation*: The naming pattern is captured in `LocalFdrSegmentReader` with a `glob_pattern` constructor parameter (default `segment_*.fdr`); update the default if AZ-291 picks a different name. Tests cover both patterns. + +**Risk 4: `parse_iso` timezone handling** +- *Risk*: Two records with the same wall-clock time but different timezones produce a wrong duration calculation. +- *Mitigation*: AZ-272's contract specifies all timestamps are ISO 8601 UTC microseconds; this task asserts UTC at parse time and raises `FdrUnreadableError("non-UTC timestamp in record")` otherwise. Defense-in-depth. + +**Risk 5: A future cycle adds a third flight state value (e.g. `EMERGENCY`)** +- *Risk*: The contiguous-counting code treats anything other than `ON_GROUND` as breaking the run; a new `EMERGENCY` value during landing rollout could shorten the inferred duration spuriously. +- *Mitigation*: Acceptable for this cycle — emergency states should not allow upload anyway. A future cycle that introduces such states must update this task's logic explicitly via a Plan-cycle change. + +## Runtime Completeness + +- **Named capability**: post-flight ON_GROUND-gated upload trigger per description.md § 2 (`trigger_post_landing_upload`) + AC-8.4 + C12-IT-03. +- **Production code that must exist**: real `PostLandingUploadOrchestrator` consuming real `TileUploader` (AZ-319) + real `LocalFdrSegmentReader` reading real on-disk FDR segments + real `FdrRecord.parse` (AZ-272). +- **Allowed external stubs**: tests MAY use fakes for `FdrSegmentReader` and `TileUploader`; the C12-IT-03 integration tests use real FDR fixture files + a fake `TileUploader` that records the call (no real network). +- **Unacceptable substitutes**: in-memory FDR (defeats the streaming guarantee NFR); a "force-upload" override (defeats the gate); shelling out to `cat ` instead of using `FdrRecord.parse` (no schema validation, no forward-compat); reading the FDR via the producer-side ring buffer (wrong API; ring buffer is for live producers, not post-flight reads). diff --git a/_docs/02_tasks/todo/AZ-330_c12_operator_reloc_service.md b/_docs/02_tasks/todo/AZ-330_c12_operator_reloc_service.md new file mode 100644 index 0000000..832e902 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-330_c12_operator_reloc_service.md @@ -0,0 +1,205 @@ +# C12 OperatorReLocService — AC-3.4 Re-localization Request via GCS Link + +**Task**: AZ-330_c12_operator_reloc_service +**Name**: C12 OperatorReLocService +**Description**: Implement `OperatorReLocService`, the C12 operator-side of AC-3.4 (operator-relocalization on visual loss; the SUT requests a position hint from the operator after losing satellite anchoring; the operator confirms a candidate; the system re-anchors). Owns: (a) the `ReLocHint` DTO (`approximate_position_wgs84: LatLonAlt`, `confidence_radius_m: float`, `reason: str`) per description.md § 2; (b) the `OperatorCommandTransport` Protocol that E-C8 (a future task in AZ-261) will implement against pymavlink for the actual GCS-link MAVLink encoding + transmission; (c) the `request_reloc(reloc_hint: ReLocHint) -> None` public method that validates the hint at the C12 boundary, calls `transport.send_reloc_hint(...)`, catches the transport's `GcsLinkError` and re-raises with C12-specific context (operator action label, monotonic timestamp, hint summary as a redacted log line), emits an FDR record `kind="c12.reloc.requested"` via the AZ-273 FDR client so the post-flight log carries the operator's action chronologically, and writes an INFO log on success / ERROR log on failure. Best-effort semantics per description.md § 7 — if the GCS link is degraded the operator may need to re-issue manually; this task does NOT auto-retry. Publishes the Protocol contract at `_docs/02_document/contracts/c12_operator_tooling/operator_command_transport.md` so a future E-C8 task implements the same shape against pymavlink without re-negotiating fields. The pattern matches AZ-322's `BackboneEmbedder` Protocol (C10 owns the Protocol; C2 implements it later). +**Complexity**: 3 points +**Dependencies**: AZ-326_c12_cli_app, AZ-273_fdr_client_ringbuf, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module +**Component**: c12_operator_tooling (epic AZ-253 / E-C12) +**Tracker**: AZ-330 +**Epic**: AZ-253 (E-C12) + +### Document Dependencies + +- `_docs/02_document/contracts/c12_operator_tooling/operator_command_transport.md` — produced by this task (frozen Protocol + DTO shape, invariants, test cases for E-C8 to implement against). +- `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` — consumed: the `c12.reloc.requested` record envelope. +- `_docs/02_document/components/13_c12_operator_tooling/description.md` — § 2 (`OperatorReLocService` interface, `ReLocHint` DTO), § 5 (`GcsLinkError` best-effort), § 7 (best-effort semantics; operator may re-issue). +- `_docs/02_document/components/13_c12_operator_tooling/tests.md` — C12-IT-01 (operator re-loc workflow returns SUT to satellite-anchored ≤ 30 s). + +## Problem + +Without a real `OperatorReLocService`: + +- AC-3.4 (operator-relocalization on visual loss) collapses on the operator side — the airborne SUT can publish a re-loc request via FDR + GCS STATUSTEXT (per the C12-IT-01 description), but the operator workstation has no surface to send the operator's confirmed candidate back to the companion. +- The C12 ↔ C8 contract is undefined — without a frozen Protocol document, E-C8's MAVLink implementation might use a wire shape C12 cannot construct. +- `GcsLinkError` is concept-only in description.md § 5 with no producer. +- The CLI's `reloc-confirm` subcommand has nothing to delegate to. +- The post-flight FDR has no record of operator re-loc actions, breaking the C12-IT-01 assertion that "the confirmation event lands in FDR". +- The `ReLocHint` DTO is not defined anywhere; sibling code that wants to construct one has no canonical type. + +This task delivers the C12 service surface + the Protocol contract + the FDR side effect. It does NOT own MAVLink encoding (E-C8 will), does NOT own the GCS-link transport layer (E-C8), and does NOT own the airborne side that consumes the inbound MAVLink message (E-C8 / E-C5 / E-C2 chain). + +## Outcome + +- An `OperatorReLocService` class at `src/operator_tool/operator_reloc_service.py`: + - Constructor: `__init__(self, *, transport: OperatorCommandTransport, fdr_client: FdrClient, logger: Logger, clock: Clock)`. + - Public method: `request_reloc(reloc_hint: ReLocHint) -> None`. +- DTOs at `src/operator_tool/_types.py`: + - `LatLonAlt` (`@dataclass(frozen=True)`): `latitude_deg: float`, `longitude_deg: float`, `altitude_m: float`. Range checks at construction (`-90 ≤ lat ≤ 90`, `-180 < lon ≤ 180`, no altitude bound). NOTE: if the project already has a shared `LatLonAlt` (likely under `shared_helpers` since C4/C5/C6/C8/C10 all use WGS84) — REUSE it; do NOT redefine. The check is: if `_docs/02_document/contracts/shared_helpers/wgs_converter.md` defines `LatLonAlt`, import from there. Otherwise add to `_types.py`. + - `ReLocHint` (`@dataclass(frozen=True)`): `approximate_position_wgs84: LatLonAlt`, `confidence_radius_m: float`, `reason: str`. Validates `confidence_radius_m > 0` at construction (`__post_init__`); validates `reason` is non-empty. +- An `OperatorCommandTransport` Protocol at `src/operator_tool/operator_command_transport.py`: + ```python + @runtime_checkable + class OperatorCommandTransport(Protocol): + def send_reloc_hint(self, hint: ReLocHint) -> None: ... + ``` + This task ships the Protocol; E-C8 ships the concrete `MavlinkOperatorCommandTransport` (a future task referenced as a forward dep). +- Errors at `src/operator_tool/errors.py`: + - `GcsLinkError(Exception)`: attributes `reason: str` (operator-friendly), `wrapped_exception_repr: str | None`, `remediation: str = "Check GCS link signal strength; re-issue the re-loc command when the link recovers."`. Note: this exception is RAISED by the transport (E-C8's concrete impl); C12 catches and re-raises with added context (the original `GcsLinkError` is preserved as `__cause__`). +- Method flow for `request_reloc`: + 1. Validate `reloc_hint.confidence_radius_m > 0` (already validated at DTO construction; defense-in-depth check here). + 2. Validate `reloc_hint.reason` is non-empty. + 3. Compute a redacted hint summary for logging: `{lat: , lon: , radius_m: , reason: ""}` — altitude included unredacted; lat/lon to 5 decimals only (~1 m granularity, sufficient for log forensics, avoids logging full operator-confidential GPS). + 4. Try block: + - `transport.send_reloc_hint(reloc_hint)`. + - INFO log `kind="c12.reloc.sent"` with the redacted summary. + - `fdr_client.enqueue(FdrRecord(kind="c12.reloc.requested", payload={"hint": , "outcome": "sent", "ts_monotonic": clock.monotonic()}))`. + 5. Except `GcsLinkError as e`: + - ERROR log `kind="c12.reloc.failed"` with the redacted summary + `e.reason`. + - `fdr_client.enqueue(FdrRecord(kind="c12.reloc.requested", payload={"hint": , "outcome": "failed", "failure_reason": e.reason, "ts_monotonic": clock.monotonic()}))` — the FDR record carries BOTH the attempt and the failure so the post-flight log shows the operator tried. + - Re-raise `GcsLinkError(reason=f"C12 reloc-confirm: {e.reason}", wrapped_exception_repr=repr(e), remediation=e.remediation)` — wrap with C12 prefix in `reason`. +- The Protocol contract published at `_docs/02_document/contracts/c12_operator_tooling/operator_command_transport.md` per `templates/api-contract.md`. Includes Shape, Invariants, Non-Goals, Versioning Rules, and at least 3 Test Cases that E-C8's implementer can run against `MavlinkOperatorCommandTransport`. +- Composition-root factory at `src/gps_denied_onboard/runtime_root/c12_factory.py` extends T1's `OperatorToolServices` dataclass with `operator_reloc_service: OperatorReLocService`. The factory `build_operator_reloc_service(config, services) -> OperatorReLocService` constructs the service; the `OperatorCommandTransport` is resolved from a wider service registry that includes E-C8's `MavlinkOperatorCommandTransport` (or a fake `LoggingOnlyOperatorCommandTransport` until E-C8 is implemented — fake declared in tests, NOT in production wiring). +- T1's `cli.py` `reloc-confirm` subcommand resolves `services.operator_reloc_service` and calls `.request_reloc(...)`. The CLI subcommand parses CLI flags `--lat`, `--lon`, `--alt`, `--radius`, `--reason` into a `ReLocHint`. Maps `GcsLinkError → exit 40`; `ValueError → exit 2 (usage)`. + +## Scope + +### Included + +- `OperatorReLocService` class with the single public method. +- `LatLonAlt` and `ReLocHint` DTOs (or import from `shared_helpers` if WgsConverter already defined `LatLonAlt`). +- `OperatorCommandTransport` Protocol. +- `GcsLinkError` error type with `reason`, `wrapped_exception_repr`, `remediation`. +- The Protocol contract document at `_docs/02_document/contracts/c12_operator_tooling/operator_command_transport.md`. +- FDR record emission via `fdr_client.enqueue` (both success and failure cases). +- Composition-root factory. +- Wiring of T1's `reloc-confirm` subcommand to this service. +- Conformance unit tests using a fake `OperatorCommandTransport` covering all 5 acceptance criteria. + +### Excluded + +- The MAVLink encoding (E-C8 owns). +- The actual GCS-link transmission (E-C8 owns). +- The airborne side that decodes the inbound MAVLink message and feeds it to the C5 state estimator (E-C5 / E-C8 chain). +- A retry loop or backoff (best-effort per description.md § 7; operator re-issues manually). +- A "broadcast" mode that sends to multiple companions (out of scope; one companion per operator session). +- An ack mechanism (the airborne side may publish a re-loc-applied event via FDR + STATUSTEXT, but this task does NOT wait for or process such an ack). + +## Acceptance Criteria + +**AC-1: Successful send → transport called once + INFO log + FDR record** +Given a fake `OperatorCommandTransport.send_reloc_hint` that returns successfully and a valid `ReLocHint` +When `request_reloc(hint)` is called +Then the transport is called exactly once with the hint (verifiable via spy); ONE INFO log `kind="c12.reloc.sent"` emitted with `reason`, `confidence_radius_m`, `position_lat` (5 decimals), `position_lon` (5 decimals); ONE FDR record `kind="c12.reloc.requested"` enqueued with `payload.outcome == "sent"` and the full hint + +**AC-2: Transport raises `GcsLinkError` → re-raise + ERROR log + FDR record carries failure** +Given a fake `OperatorCommandTransport.send_reloc_hint` that raises `GcsLinkError(reason="link signal lost", wrapped_exception_repr="SerialTimeout(...)")` +When `request_reloc(hint)` is called +Then `GcsLinkError(reason="C12 reloc-confirm: link signal lost", ...)` is raised; the original `GcsLinkError` is preserved as `__cause__`; ONE ERROR log `kind="c12.reloc.failed"` with the redacted summary + `e.reason`; ONE FDR record `kind="c12.reloc.requested"` enqueued with `payload.outcome == "failed"` and `payload.failure_reason == "link signal lost"` + +**AC-3: `confidence_radius_m ≤ 0` → `ValueError` at DTO construction** +Given attempting to construct `ReLocHint(approximate_position_wgs84=..., confidence_radius_m=0.0, reason="...")` or with a negative value +When the constructor is invoked +Then `ValueError("confidence_radius_m must be > 0; got 0.0")` is raised; the transport is NEVER called; no log or FDR record is emitted (the DTO never reached the service) + +**AC-4: `reason` is preserved verbatim through the transport call** +Given a `ReLocHint(reason="lost track at waypoint 3, terrain features ambiguous due to seasonal foliage change")` +When `request_reloc(hint)` is called +Then the transport's `send_reloc_hint` receives the hint with `reason` byte-for-byte equal to the input (no truncation, no normalization); the FDR record's `payload.hint.reason` is the same; the INFO log truncates the displayed reason to 200 chars (display-only) but the underlying transport call is unmodified + +**AC-5: Protocol contract document exists with the exact method signature** +Given the published contract at `_docs/02_document/contracts/c12_operator_tooling/operator_command_transport.md` +When E-C8's implementer reads the contract to build `MavlinkOperatorCommandTransport` +Then the contract specifies the exact Protocol shape (`def send_reloc_hint(self, hint: ReLocHint) -> None`), the `ReLocHint` field shape, the documented `GcsLinkError` raise behaviour, the Versioning Rules, and at least 3 Test Cases + +**AC-6: Empty `reason` → `ValueError` at DTO construction** +Given attempting to construct `ReLocHint(reason="")` +When the constructor is invoked +Then `ValueError("reason must be non-empty")` is raised + +**AC-7: Latitude / longitude out-of-range → `ValueError` at LatLonAlt construction** +Given attempting to construct `LatLonAlt(latitude_deg=91.0, ...)` or `longitude_deg=181.0` +When the constructor is invoked +Then `ValueError` is raised with the offending value in the message + +**AC-8: FDR enqueue is non-blocking even if the FDR client is overrun** +Given a fake `FdrClient` whose `enqueue` returns `EnqueueResult.OVERRUN` (per AZ-273) +When `request_reloc(hint)` is called +Then `request_reloc` does NOT raise; the transport call is unaffected; the operator's re-loc request still reaches the companion; the OVERRUN result is observable in the test (via the spy) but the operator action proceeds (FDR is best-effort logging) + +**AC-9: Position is logged at 5 decimals (not full precision)** +Given a `LatLonAlt(latitude_deg=49.99876543, longitude_deg=36.12345678, altitude_m=...)` +When `request_reloc(hint)` is called +Then the INFO log line shows `position_lat: 49.99877` and `position_lon: 36.12346` (rounded to 5 decimals); the underlying transport receives the full-precision value (no rounding before transport) + +**AC-10: Composition-root factory does not eager-construct the transport** +Given the operator-tool starts up (T1's `cli.py` lazily resolves services) +When the operator does NOT use the `reloc-confirm` subcommand in this session +Then `OperatorCommandTransport` is NEVER instantiated (verifiable via spy on the factory); pymavlink is NEVER imported (NFR-perf-cold-start from T1 holds) + +## Non-Functional Requirements + +**Performance** +- `request_reloc` overhead in this task (validation + log + FDR enqueue) ≤ 1 ms wall-clock; the transport call is the dominant time. +- FDR `enqueue` is non-blocking per AZ-273. + +**Compatibility** +- Reuses `LatLonAlt` from `shared_helpers/wgs_converter.md` if it exists there (per the cross-cutting rule); otherwise defines it locally and a future cycle migrates. +- The Protocol contract is forward-compatible: adding a new method to `OperatorCommandTransport` is allowed; renaming or removing `send_reloc_hint` is a breaking change requiring a new Protocol version. + +**Reliability** +- Transport failure does NOT corrupt the FDR — the FDR record is enqueued with `outcome=failed` so the post-flight log shows the operator tried. +- DTO validation rejects malformed input at the boundary (AC-3, AC-6, AC-7). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Fake transport returning success + valid hint | Transport called once, INFO log, FDR record `outcome="sent"` | +| AC-2 | Fake transport raising `GcsLinkError` | Re-raise with C12 prefix, ERROR log, FDR record `outcome="failed"` | +| AC-3 | `ReLocHint(confidence_radius_m=0.0)` and `=-1.0` | `ValueError` at construction | +| AC-4 | Long reason string (300 chars) | Transport receives full string; log truncates to 200 chars | +| AC-5 | File exists check + parse | Contract file exists; required sections present | +| AC-6 | `ReLocHint(reason="")` | `ValueError` at construction | +| AC-7 | `LatLonAlt(latitude_deg=91.0)`, `longitude_deg=181.0` | `ValueError` with offending value | +| AC-8 | Fake `FdrClient.enqueue` returns OVERRUN | `request_reloc` succeeds; transport unaffected | +| AC-9 | Hint with high-precision lat/lon | Log shows 5-decimal rounding; transport sees full precision | +| AC-10 | Lazy resolution test — never call `reloc-confirm` | `OperatorCommandTransport` not constructed; pymavlink not imported | + +## Constraints + +- The Protocol shape (`def send_reloc_hint(self, hint: ReLocHint) -> None`) is the C12 ↔ C8 contract — E-C8 MUST implement this exact signature against pymavlink. Renaming the method requires a Plan-cycle change. +- `ReLocHint.confidence_radius_m > 0` and `reason != ""` are non-negotiable invariants — caller must validate before constructing the DTO; construction-time check is defense-in-depth. +- Best-effort semantics: this task does NOT auto-retry on `GcsLinkError`. Operators are responsible for re-issuing. +- `LatLonAlt` reuse: if `shared_helpers/wgs_converter.md` defines `LatLonAlt`, this task imports it; if not, defines it in `_types.py` and a future cycle migrates to shared. DO NOT silently duplicate. +- The FDR record's `payload.hint` carries the FULL `ReLocHint` (no redaction) — operators inspecting the post-flight log need the exact action they took. The redaction is for the live log file only. +- pymavlink MUST NOT be imported in this task's modules — that's E-C8's concern. The transport is consumed via the Protocol. + +## Risks & Mitigation + +**Risk 1: E-C8's concrete transport diverges from the Protocol** +- *Risk*: A future E-C8 task implements `MavlinkOperatorCommandTransport` with a different signature (e.g. an extra `companion_id` parameter), breaking the C12 ↔ C8 boundary. +- *Mitigation*: AC-5 + the published contract document fix the Protocol shape; E-C8's unit tests assert the implementation satisfies the Protocol via `runtime_checkable`. Catches divergence at E-C8's PR review. + +**Risk 2: GCS link is degraded but the operator wants to re-issue rapidly** +- *Risk*: The operator hits `reloc-confirm` repeatedly during a degraded link; each call hits `GcsLinkError`; FDR fills with `outcome=failed` records. +- *Mitigation*: Acceptable — the FDR is bounded at 64 GB (AZ-291..296 enforce); a flood of `c12.reloc.requested` records is a bounded-size anomaly, not unbounded. A future cycle MAY add operator-side rate-limiting, but this cycle's best-effort semantics align with description.md § 7. + +**Risk 3: Operator's `reason` field contains sensitive info (e.g. tactical context)** +- *Risk*: The full `reason` lands in FDR (which is post-flight retrievable); the truncated 200-char log is also persisted. +- *Mitigation*: The 5-decimal lat/lon rounding in the log is the only redaction this task applies; full hint persistence in FDR is an intentional product decision per description.md § 5 (post-flight forensics needs the full action). Operators who need to redact must use shorter `reason` strings. + +**Risk 4: The Protocol contract becomes stale as E-C8 evolves** +- *Risk*: E-C8's implementation needs additional context (e.g. a `correlation_id` for ack matching). +- *Mitigation*: The Protocol's Versioning Rules (in the contract) document how to extend (add new method, OR add optional kwarg with default); breaking changes require a new Protocol version. E-C8 negotiates via the contract document, not by editing this task. + +## Runtime Completeness + +- **Named capability**: AC-3.4 operator-relocalization — operator-side request channel per description.md § 2 (`OperatorReLocService.request_reloc`) + § 5 (`GcsLinkError` semantics) + § 7 (best-effort). +- **Production code that must exist**: real `OperatorReLocService` with real `OperatorCommandTransport` (E-C8's `MavlinkOperatorCommandTransport`) injected; real `FdrClient` (AZ-273) emitting the `c12.reloc.requested` record. +- **Allowed external stubs**: tests MAY use a fake `OperatorCommandTransport`; production wiring uses E-C8's pymavlink-backed concrete impl. A `LoggingOnlyOperatorCommandTransport` MAY be used in development environments where no companion is wired (CLI-driven smoke test) — declared in tests / dev composition root, NOT in the production composition root. +- **Unacceptable substitutes**: a no-op transport in production (defeats AC-3.4); shelling out to `mavlink_router` (security + reliability); an internal queue stub instead of the real GCS-link transport (defeats the C12-IT-01 30-s round-trip assertion); skipping FDR record emission (defeats post-flight forensics). + +## Contract + +This task produces/implements the contract at `_docs/02_document/contracts/c12_operator_tooling/operator_command_transport.md`. Consumers (specifically the future E-C8 task implementing `MavlinkOperatorCommandTransport`) MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-331_c1_vio_strategy_protocol.md b/_docs/02_tasks/todo/AZ-331_c1_vio_strategy_protocol.md new file mode 100644 index 0000000..1ca7fc4 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-331_c1_vio_strategy_protocol.md @@ -0,0 +1,182 @@ +# C1 VioStrategy Protocol + Composition-Root Selection + +**Task**: AZ-331_c1_vio_strategy_protocol +**Name**: C1 VioStrategy Protocol +**Description**: Define the `VioStrategy` Protocol, its DTOs (`WarmStartPose`, `VioOutput`, `VioHealth`, `FeatureQuality`, `VioState` enum), the runtime error taxonomy (`VioError` family), and the composition-root selection switch that wires exactly one of `Okvis2Strategy` / `VinsMonoStrategy` / `KltRansacStrategy` at startup based on ADR-001 (config) and ADR-002 (`BUILD_OKVIS2` / `BUILD_VINS_MONO` / `BUILD_KLT_RANSAC` flags). This is the foundational shared-API task for E-C1 — every other E-C1 strategy task implements this Protocol; C5 state estimator tasks (AZ-260) consume `VioOutput`; C13 FDR writer tasks (AZ-248) consume `VioHealth`; the warm-start + F8 reboot recovery wiring task (this same epic) invokes `reset_to_warm_start`. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-270_compose_root, AZ-272_fdr_record_schema, AZ-276_imu_preintegrator, AZ-277_se3_utils +**Component**: c1_vio (epic AZ-254 / E-C1) +**Tracker**: AZ-331 +**Epic**: AZ-254 (E-C1) + +### Document Dependencies + +- `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md` — frozen public interface this task produces. +- `_docs/02_document/contracts/shared_helpers/imu_preintegrator.md` — referenced for the IMU substrate every strategy consumes (AZ-276). +- `_docs/02_document/contracts/shared_helpers/se3_utils.md` — `SE3` math primitives used by `WarmStartPose` and `VioOutput.relative_pose_T` (AZ-277). +- `_docs/02_document/contracts/shared_config/composition_root_protocol.md` — `Config` extension for the new `config.vio.strategy` enum. +- `_docs/02_document/components/01_c1_vio/description.md` — § 1 high-level overview and § 2 internal interfaces (the Protocol's source of truth). + +## Problem + +Three different concrete VIO backends (OKVIS2 production-default, VINS-Mono research-only, KLT/RANSAC mandatory simple-baseline per ADR-002 engine rule) plus three external consumers (C5 state estimator, C13 FDR writer, runtime_root composition) all need a single, frozen interface to per-frame VIO output. Without it: + +- Each consumer would import a concrete OKVIS2 / VINS-Mono / KLT-RANSAC class directly, hard-coding the runtime choice and breaking ADR-001's runtime selectability. +- `BUILD_OKVIS2=OFF` (Tier-0 workstation without OKVIS2 native libs installed) would not import because consumers depend on OKVIS2-specific symbols. +- The composition root would have to know per-component which strategy is acceptable; today only ADR-001 (config) + ADR-002 (`BUILD_*` flags) decide. +- Error handling would diverge per backend; `VioFatalError` would have a different shape per implementation, making the AC-5.2 fallback path fragile (3 s no-estimate timer expects a uniform error). +- C5 fusion would have to handle a different `VioOutput` shape per backend; the iSAM2 graph's `BetweenFactorPose3` insertion path assumes a fixed 6×6 covariance shape. +- The honest-covariance contract — strategies MUST NOT tighten covariance during a degradation event — would have no canonical surface to enforce; AC-1.4 drift becomes silent. + +This task delivers the typed boundary every consumer reads against and every strategy conforms to. It writes no per-frame VIO logic — the concrete backends are separate tasks in this epic. + +## Outcome + +- A `VioStrategy` Protocol (PEP 544 `typing.Protocol`, `runtime_checkable=True`) is exported from `src/gps_denied_onboard/components/c1_vio/interface.py` and re-exported from the component's `__init__.py`. +- The DTOs `WarmStartPose`, `VioOutput`, `VioHealth`, `FeatureQuality` are stdlib `@dataclass(frozen=True)`; `VioState` is a `str`-Enum. `VioOutput` and `VioHealth` are placed in `src/gps_denied_onboard/_types/nav.py` for cross-component access (C5, C13); `WarmStartPose`, `FeatureQuality`, `VioState`, and the `VioStrategy` Protocol live in the c1_vio component. +- The runtime error taxonomy is a single hierarchy under `gps_denied_onboard.components.c1_vio.errors`: `VioError` ← {`VioInitializingError`, `VioDegradedError`, `VioFatalError`}. Every strategy raises only these; consumers catch only these. +- The composition root has a `build_vio_strategy(config: Config, *, fdr_client: FdrClient) -> VioStrategy` factory function at `src/gps_denied_onboard/runtime_root/vio_factory.py` that selects the strategy by `config.vio.strategy` (`okvis2 | vins_mono | klt_ransac`) and respects compile-time `BUILD_*` gating: requesting a strategy whose `BUILD_*` flag is OFF raises `StrategyNotAvailableError` at composition time (NOT at first frame). +- Concrete strategy modules are lazy-imported inside the factory under `if BUILD_OKVIS2: from c1_vio.okvis2 import Okvis2Strategy` so a Tier-0 workstation build that lacks OKVIS2 native libs still composes successfully when only KLT/RANSAC is requested. +- Every strategy's `current_strategy_label()` returns the lowercase label matching the config value (`"okvis2"`, `"vins_mono"`, `"klt_ransac"`); this is the FDR-stamped label for AC-NEW-3 audit. +- A `ConfigSchemaError` extension to AZ-269's config loader for the new `config.vio.strategy` enum + `config.vio.lost_frame_threshold` (default 9; consecutive-LOST frames before `VioFatalError` is raised) + `config.vio.warm_start_max_frames` (default 5; convergence budget after `reset_to_warm_start`). +- A frozen contract file at `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md` (already drafted alongside this task) carries the full shape; consumers read that file, not this task spec. + +## Scope + +### Included + +- `VioStrategy` Protocol with the four methods from `_docs/02_document/components/01_c1_vio/description.md` § 2: `process_frame`, `reset_to_warm_start`, `health_snapshot`, `current_strategy_label`. +- DTO dataclasses for `WarmStartPose`, `VioOutput`, `VioHealth`, `FeatureQuality`. All `frozen=True`. The `VioState` enum. +- Error hierarchy under `c1_vio.errors`: every error type the Protocol promises; all derived from a common `VioError` so consumers can catch the family. +- Placement of `VioOutput` and `VioHealth` in `src/gps_denied_onboard/_types/nav.py` (per `module-layout.md` cross-component DTO rule); the rest of the surface lives in `components/c1_vio/`. +- `build_vio_strategy(config, *, fdr_client) -> VioStrategy` composition-root factory in `src/gps_denied_onboard/runtime_root/vio_factory.py`. Imports the concrete strategy lazily — guarded by `if BUILD_OKVIS2: from c1_vio.okvis2 import Okvis2Strategy` so an OFF flag does not force a native-lib import. +- A `StrategyNotAvailableError` raised by the factory when the requested strategy is not built into this binary, with a message naming the missing `BUILD_*` flag. +- Config schema extension to AZ-269's loader for the three new fields (`config.vio.strategy`, `config.vio.lost_frame_threshold`, `config.vio.warm_start_max_frames`) with enum validation at load time. +- The contract file at `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md` filled per `decompose/templates/api-contract.md` with Shape, Invariants, Non-Goals, Versioning Rules, and at least three Test Cases (already drafted as part of this task). +- Type-only unit tests that verify each strategy module's class actually conforms to the Protocol via `runtime_checkable` + `isinstance` (catches drift at CI time, not deployment). + +### Excluded + +- `Okvis2Strategy` implementation — separate task in this epic. +- `VinsMonoStrategy` implementation — separate task in this epic. +- `KltRansacStrategy` implementation — separate task in this epic. +- Warm-start hint persistence (write to disk after takeoff, read after F8 reboot) — separate task in this epic; this task only defines the in-memory `WarmStartPose` DTO and the `reset_to_warm_start` Protocol method. +- IMU preintegration — owned by AZ-276 (`helpers.imu_preintegrator`); strategies feed `ImuWindow` to the helper, they do not implement preintegration here. +- C5 fusion wiring — owned by E-C5 (AZ-260). +- C13 FDR consumer wiring of `VioHealth` — owned by E-C13 (AZ-248). + +## Acceptance Criteria + +**AC-1: Protocol is conformance-checkable** +Given a class that implements all four Protocol methods with matching signatures +When `isinstance(impl, VioStrategy)` is evaluated under `runtime_checkable` +Then the result is `True`; for a class that omits any method or has a wrong signature, the result is `False` + +**AC-2: Frozen DTOs reject mutation** +Given a constructed `VioOutput(...)`, `VioHealth(...)`, `WarmStartPose(...)`, or `FeatureQuality(...)` instance +When the test attempts to reassign any field +Then `dataclasses.FrozenInstanceError` is raised; the original value is preserved + +**AC-3: Error hierarchy catchable as a single family** +Given any of the three documented error subtypes +When the consumer wraps a strategy call in `try: ... except c1_vio.errors.VioError` +Then every documented subtype is caught; an unrelated `Exception` (e.g., `ValueError`) is NOT caught (the Protocol's error envelope does not leak into general exception handling) + +**AC-4: Composition-root factory honours config** +Given `config.vio.strategy = "okvis2"` and `BUILD_OKVIS2=ON` +When `build_vio_strategy(config, fdr_client=fake_client)` is called +Then an `Okvis2Strategy` instance is returned and `instance.current_strategy_label() == "okvis2"` + +**AC-5: Composition-root factory honours BUILD flag gate** +Given `config.vio.strategy = "vins_mono"` and `BUILD_VINS_MONO=OFF` +When `build_vio_strategy(config, fdr_client=fake_client)` is called +Then `StrategyNotAvailableError` is raised at composition time with a message naming `"vins_mono"` AND the missing `BUILD_VINS_MONO` flag; no module-level import of `vins_mono` symbols has occurred (verifiable via `sys.modules` having no `gps_denied_onboard.components.c1_vio.vins_mono` entry) + +**AC-6: Unknown strategy label rejected at config load** +Given `config.vio.strategy = "openvslam"` (not in the enum) +When the config is loaded via AZ-269's loader +Then `ConfigSchemaError` is raised at load time with a message listing the valid values (`okvis2 | vins_mono | klt_ransac`); `build_vio_strategy` is never reached + +**AC-7: `current_strategy_label()` matches config value exactly** +Given any selectable strategy +When `instance.current_strategy_label()` is called +Then the returned string is one of `"okvis2"`, `"vins_mono"`, `"klt_ransac"` and equals `config.vio.strategy`; AC-NEW-3 audit relies on this exact-match property + +**AC-8: Contract file matches Protocol shape** +Given the contract file at `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md` +When a contract-test parses the Shape section's method/field tables and compares against the runtime Protocol via introspection +Then every method, every field, every error type is present and consistent in both + +**AC-9: `VioOutput.frame_id` echo invariant typed** +Given the DTO definitions +When inspecting `VioOutput.frame_id`'s type and the Protocol's docstring +Then `frame_id` is typed `str` and the docstring states "MUST equal `NavCameraFrame.frame_id` from the input frame"; AC-9 is enforced behaviourally by the strategy implementations and contract-tested in Step 9 + +## Non-Functional Requirements + +**Compatibility** +- The Protocol is `typing.Protocol` (PEP 544 structural typing) so existing components that import a concrete VIO class today (none yet — this is greenfield) can be retrofitted without inheritance changes. +- All error types subclass `Exception` (not `BaseException`) so `except Exception:` in upstream layers continues to work as expected. + +**Performance** +- The factory `build_vio_strategy` returns within 200 ms (it imports + constructs one strategy; native-library bring-up time inside OKVIS2 / VINS-Mono is amortised against this; the 200 ms is a sanity bound for the factory dispatch logic itself, not for OKVIS2 backend init). +- DTO construction (`VioOutput`, `VioHealth`, `WarmStartPose`) is dataclass-frozen; per-instance overhead is the bare-cost dataclass `__init__`. + +**Reliability** +- The Protocol is the boundary of acceptable runtime errors. Strategies MUST NOT raise other types into consumers; if a third-party library (OpenCV, OKVIS2, VINS-Mono, GTSAM) raises something else, the strategy catches and rewraps into `VioError` family. +- Versioning: any breaking change to the Protocol or its DTOs MUST bump the contract file's `Version` and notify every consumer task listed in the contract header. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `runtime_checkable` Protocol vs. a fully-implementing fake; vs. a fake missing one method | `isinstance` returns True for full, False for partial | +| AC-2 | Mutation attempt on each frozen DTO | `FrozenInstanceError` raised; original value preserved | +| AC-3 | Raise each of the three error subtypes; catch as `c1_vio.errors.VioError` | All caught; an unrelated `ValueError` is NOT caught by the same handler | +| AC-4 | `build_vio_strategy` with `okvis2` + flag ON → fake `Okvis2Strategy` | Returned instance is `Okvis2Strategy`; `current_strategy_label()` == `"okvis2"` | +| AC-5 | `build_vio_strategy` with `vins_mono` + flag OFF | `StrategyNotAvailableError`; `sys.modules` does NOT contain `c1_vio.vins_mono` | +| AC-6 | Config load with invalid `vio.strategy` value | `ConfigSchemaError`; valid values listed in message | +| AC-7 | `current_strategy_label()` for each strategy | Matches the config value used to construct it | +| AC-8 | Contract introspection vs. Protocol introspection | Shape parity test passes | +| AC-9 | Inspect `VioOutput.frame_id` type and docstring | `str` type; echo-invariant noted in docstring | +| NFR-perf-factory | Microbench `build_vio_strategy` × 1000 (with each strategy mocked) | p99 ≤ 200 ms (dominated by lazy import on first call; subsequent calls << 1 ms) | +| NFR-reliability-error-family | All three subtypes inherit from `c1_vio.errors.VioError` | Verified via `issubclass` for each | + +## Constraints + +- The Protocol uses `typing.Protocol` from stdlib; no third-party Protocol library is introduced. +- DTO dataclasses use stdlib `dataclasses` with `frozen=True`; no `pydantic` or `attrs` dependency. +- Lazy import of concrete strategies is mandatory. The factory's `if BUILD_OKVIS2: from c1_vio.okvis2 import Okvis2Strategy` block is not optional — it is the mechanism by which Tier-0 workstation builds compose without OKVIS2 native libs installed. +- The contract file at `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md` is the source of truth. If the Protocol shape changes here without the contract updating, that is a Spec-Gap finding (High) per code-review skill Phase 2. +- This task does NOT add new third-party dependencies — `typing.Protocol`, `dataclasses`, `enum` are stdlib. +- `VioOutput` and `VioHealth` MUST be placed in `src/gps_denied_onboard/_types/nav.py` per `module-layout.md` (cross-component DTOs live under `_types`); placing them inside `components/c1_vio/` would force C5 and C13 to import a Layer-3 module from another Layer-3 module, violating the layering table. + +## Risks & Mitigation + +**Risk 1: Protocol drift between contract and code** +- *Risk*: Strategy implementations diverge from the contract over time; consumers cannot tell which is canonical. +- *Mitigation*: AC-8 contract-introspection test runs in CI; any drift fails the test before merge. The contract file's `## Test Cases` section names this exact test. + +**Risk 2: Lazy-import gating is bypassed by a transitively-imported module** +- *Risk*: A consumer imports `c1_vio` (the package) and the package's `__init__.py` eagerly imports a concrete strategy, triggering the OKVIS2 native-lib import even when `BUILD_OKVIS2=OFF`. +- *Mitigation*: The package `__init__.py` re-exports ONLY the Protocol and DTOs and errors — it does NOT import any concrete strategy. AC-5 verifies via `sys.modules` that no strategy module is loaded during a Tier-0 factory call. + +**Risk 3: `VioDegradedError` is misused as a hard error** +- *Risk*: A strategy implementation interprets degraded operation as a raise condition and stops emitting `VioOutput`; C5 fusion thinks VIO is dead and falls back to FC IMU prematurely, missing AC-1.3 drift bound during recoverable degradations. +- *Mitigation*: Contract file Invariants section explicitly states degraded operation returns `VioOutput` with inflated covariance + `VioHealth.state = DEGRADED`; the error type exists only for the rare degraded-to-fatal transition. Strategy task specs (separate tasks in this epic) cite this in their Constraints. + +**Risk 4: Error hierarchy widens silently** +- *Risk*: A future strategy adds a fourth error type without updating the contract or the family base class. +- *Mitigation*: The contract file lists the canonical three. Implementations MUST raise only members of `c1_vio.errors.VioError`; a strategy raising a non-family error is a Spec-Gap finding (High) at code-review time. AC-3's catch-as-family test catches the obvious case. + +## Runtime Completeness + +- **Named capability**: typed Protocol + DTOs + error envelope + composition-root selection (architecture / E-C1 / ADR-001 + ADR-002 + ADR-009). +- **Production code that must exist**: real Protocol declaration, real frozen DTOs, real error hierarchy, real composition-root factory with lazy-import gating, real config-loader extension for the strategy enum + thresholds. +- **Allowed external stubs**: tests MAY substitute fake strategy classes that conform to the Protocol; production wiring uses the real strategies from this epic's other tasks. +- **Unacceptable substitutes**: ABCs instead of `typing.Protocol` (would force inheritance changes downstream), `pydantic.BaseModel` instead of `@dataclass(frozen=True)` (would add a runtime validation layer this task does not need), eager imports of concrete strategies in `__init__.py` (would defeat `BUILD_*` gating), or a `strategy: str` config field without an enum (would lose the load-time validation in AC-6). + +## Contract + +This task produces/implements the contract at `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-332_c1_okvis2_strategy.md b/_docs/02_tasks/todo/AZ-332_c1_okvis2_strategy.md new file mode 100644 index 0000000..d68a075 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-332_c1_okvis2_strategy.md @@ -0,0 +1,202 @@ +# C1 OKVIS2 Strategy — Production-Default VIO + +**Task**: AZ-332_c1_okvis2_strategy +**Name**: C1 OKVIS2 Strategy +**Description**: Implement `Okvis2Strategy`, the production-default `VioStrategy` for E-C1. The class is a Python facade over the OKVIS2 C++ tightly-coupled keyframe-based VIO core (sliding window of K=10–20 keyframes per D-C5-3) accessed via a pybind11 wrapper around `cpp/okvis2/`. The strategy owns the per-flight OKVIS2 estimator instance, feeds it nav-camera frames + IMU samples (via the AZ-276 `ImuPreintegrator` helper for the GTSAM `CombinedImuFactor` substrate that C5 also reads), and emits `VioOutput` with honest 6×6 covariance per AC-1.4 and per-frame `VioHealth`. Per `_docs/02_document/components/01_c1_vio/description.md` § 5: per-frame cost is dominated by feature extraction + matching, sliding-window optimisation is `O(F·log K)`; per-frame p95 latency must stay ≤ 80 ms on Tier-2 with C2 backbone running concurrently (C1-PT-01). Build-time gated by `BUILD_OKVIS2`. +**Complexity**: 5 points +**Dependencies**: AZ-331_c1_vio_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-276_imu_preintegrator, AZ-277_se3_utils, AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf +**Component**: c1_vio (epic AZ-254 / E-C1) +**Tracker**: AZ-332 +**Epic**: AZ-254 (E-C1) + +### Document Dependencies + +- `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md` — the Protocol this task implements; produced by AZ-331. +- `_docs/02_document/contracts/shared_helpers/imu_preintegrator.md` — IMU substrate (AZ-276); consumer of the GTSAM `CombinedImuFactor` per-keyframe. +- `_docs/02_document/contracts/shared_helpers/se3_utils.md` — SE(3) ↔ pose-matrix conversion utilities (AZ-277). +- `_docs/02_document/components/01_c1_vio/description.md` — § 5 implementation details + § 6 helpers + § 7 caveats (Okvis2 latency spike behaviour under thermal throttle). + +## Problem + +Without a production-default `Okvis2Strategy`: + +- The default airborne binary cannot operate — only the KLT/RANSAC simple-baseline (mandatory engine-rule path) would be available, and C1-PT-01 / AC-2.2 frame-to-frame MRE bounds were specified against OKVIS2. +- The honest 6×6 covariance contract (AC-1.4 / AC-NEW-4) loses its production producer; KLT/RANSAC's covariance is a documented degraded fallback, not the primary signal C5's iSAM2 graph fuses. +- D-CROSS-LATENCY-1's hybrid covariance auto-degrade decision in C4 has no `VioHealth` source-of-truth at production-quality numbers. +- The architecture's "tightly-coupled VIO with sliding-window optimisation" claim becomes documentation-only. +- Mode-B FT-P-04 / FT-P-05 suite-level scenarios cannot run against the production stack; FT-P-04 expects ≥ 95 % tracked-frame ratio on the Derkachi normal segment. + +This task delivers the canonical production VIO. The other two strategies (VINS-Mono research-only, KLT/RANSAC simple-baseline) are separate tasks; the contract task (AZ-331) defines the boundary all three share. + +## Outcome + +- An `Okvis2Strategy` class at `src/gps_denied_onboard/components/c1_vio/okvis2.py` conforming to the `VioStrategy` Protocol from AZ-331; `current_strategy_label() == "okvis2"`. +- A pybind11 wrapper at `src/gps_denied_onboard/components/c1_vio/_native/okvis2_binding.cpp` exposing the OKVIS2 C++ estimator (`okvis::ThreadedKFVio` or equivalent in the pinned upstream HEAD) to Python. The wrapper is built by CMake under `cpp/okvis2/` (build-time gated by `BUILD_OKVIS2`); the resulting `.so` is imported lazily inside `okvis2.py`. +- Constructor `__init__(self, *, calibration: CameraCalibration, preintegrator: ImuPreintegrator, fdr_client: FdrClient, logger: Logger, config: Okvis2Config)` — all dependencies constructor-injected per ADR-009. `Okvis2Config` (`@dataclass(frozen=True)`) carries the OKVIS2-specific knobs (sliding-window size K ∈ [10, 20], keyframe-decision parallax threshold, RANSAC inlier ratio, max optimisation iterations) loaded from `config.vio.okvis2.*` via AZ-269. +- `process_frame(frame, imu, calibration) -> VioOutput`: + 1. Append IMU samples to the injected `ImuPreintegrator` (strict-monotonic guarded; `ImuPreintegrationError` rewraps to `VioFatalError`). + 2. Feed the nav-camera frame to OKVIS2 via the pybind11 `add_frame` wrapper. + 3. If OKVIS2 emits a new estimator update, extract the relative pose (SE(3) via `helpers.se3_utils`), the 6×6 covariance from OKVIS2's internal Hessian (or marginalised block per upstream API), the latest IMU bias, and the feature-quality summary (tracked / new / lost / mean parallax / per-frame MRE). + 4. Build and return `VioOutput` with `frame_id` echoed. + 5. Emit per-frame DEBUG log (off by default) with backbone identity + elapsed milliseconds; emit WARN log when degraded covariance is detected (per `health_snapshot` heuristic); emit ERROR log on `VioFatalError`. +- `reset_to_warm_start(hint)`: tears down the current OKVIS2 estimator instance (releases C++ resources), constructs a fresh estimator, seeds the IMU bias from `hint.bias`, seeds the initial body-to-world pose from `hint.body_T_world`, and seeds the velocity from `hint.velocity_b`. The next `config.vio.warm_start_max_frames` frames are allowed to converge before the strategy reports `state == TRACKING` (AC-5.1). Calling `reset_to_warm_start` is idempotent across consecutive calls (the second call re-resets cleanly). +- `health_snapshot()` returns `VioHealth(state, consecutive_lost, bias_norm)` derived from OKVIS2's internal tracker state: `INIT` until enough keyframes are accumulated, `TRACKING` while the optimisation converges, `DEGRADED` when feature count drops below `config.vio.okvis2.degraded_feature_threshold` or covariance Frobenius norm exceeds 2× steady-state, `LOST` after `config.vio.lost_frame_threshold` consecutive frames without a successful update. +- The honest-covariance invariant (Protocol Invariant) is enforced behaviourally: the strategy MUST NOT shrink the reported covariance during a `DEGRADED` window (the OKVIS2 estimator's covariance is read directly; no smoothing or floor is applied that would mask degradation). +- Error envelope is closed: every OKVIS2 / pybind11 / Eigen exception is caught inside `process_frame` / `reset_to_warm_start` and rewrapped into the `VioError` family (`VioInitializingError` while INIT, `VioFatalError` on backend-init failure or sustained LOST). +- All FDR records emitted via the injected `FdrClient` use the `kind="vio.health"` schema from AZ-272; per-frame DEBUG goes to stdout/journald only (per description.md § 9 logging strategy). + +## Scope + +### Included + +- `Okvis2Strategy` class implementation + the `Okvis2Config` dataclass + the `_native/okvis2_binding.cpp` pybind11 wrapper. +- CMake target under `cpp/okvis2/` that links the OKVIS2 upstream pin (BSD-3-Clause) and produces the binding `.so`. Build flag `BUILD_OKVIS2`. +- The full `process_frame` / `reset_to_warm_start` / `health_snapshot` / `current_strategy_label` surface conforming to AZ-331's Protocol. +- IMU substrate via the constructor-injected `ImuPreintegrator` (AZ-276); this strategy never imports GTSAM directly. +- Honest-covariance reading from OKVIS2's internal estimator state (no client-side smoothing). +- Lazy import of the `_native` binding inside `okvis2.py` so a Tier-0 build with `BUILD_OKVIS2=OFF` does not force the OKVIS2 native lib to be present. +- Per-frame DEBUG log gated by `config.vio.per_frame_debug_log` (default OFF). +- WARN / ERROR / INFO logging per description.md § 9. +- Health-state transitions emitted as FDR records via the `kind="vio.health"` schema. +- Composition-root wiring (entry to the AZ-331 `build_vio_strategy` factory's `okvis2` branch). +- Standalone microbench script `python -m gps_denied_onboard.components.c1_vio.bench.okvis2 ` for C1-PT-01 latency measurements (referenced by Step 9 / E-BBT perf tests, not implemented as the test itself here — only the benchable surface). + +### Excluded + +- VINS-Mono strategy — separate task in this epic. +- KLT/RANSAC simple-baseline strategy — separate task in this epic. +- Warm-start hint persistence (write at takeoff, read at F8 reboot) — separate task in this epic; this strategy only consumes a constructed `WarmStartPose`. +- C5 fusion of `VioOutput` — owned by E-C5 (AZ-260). +- C13 FDR writer-thread / segment rotation — owned by E-C13 (AZ-248); this strategy only emits via the producer-side `FdrClient`. +- IMU preintegration mathematics — owned by AZ-276. +- The C1-IT-01..06 / C1-PT-01 tests themselves — deferred to Step 9 (E-BBT) per greenfield flow Step 6 rule. +- Honest-covariance contract test that sweeps all three strategies — that's a Step 9 / E-BBT cross-strategy test (epic child issue #7), not part of this single-strategy task. +- OKVIS2 upstream-source modifications — upstream HEAD is pinned per Plan-phase; deviations require an explicit ADR. +- Multi-camera OKVIS2 — out of scope (single nav-camera per RESTRICT-UAV-3). + +## Acceptance Criteria + +**AC-1: `current_strategy_label()` returns `"okvis2"`** +Given an `Okvis2Strategy` constructed via the AZ-331 factory with `config.vio.strategy = "okvis2"` +When `current_strategy_label()` is called +Then the returned string is exactly `"okvis2"` + +**AC-2: `process_frame` returns `VioOutput` with `frame_id` echoed** +Given a `NavCameraFrame` with `frame_id = "uuid-abc"` and a populated `ImuWindow` +When `process_frame(frame, imu, calibration)` is called and reaches a successful estimator update +Then the returned `VioOutput.frame_id == "uuid-abc"`; `pose_covariance_6x6` is symmetric and positive-definite; `imu_bias` is non-`None` + +**AC-3: `process_frame` rewraps every backend exception into `VioError`** +Given a malformed input that triggers an OKVIS2 / pybind11 / Eigen exception inside the backend +When `process_frame` is called +Then the raised exception is one of `VioInitializingError` / `VioDegradedError` / `VioFatalError`; the original exception is chained via `raise ... from`; no raw `RuntimeError` / `ValueError` from the backend leaks to the caller + +**AC-4: `reset_to_warm_start` clears state and seeds the hint** +Given a strategy with N processed frames and a non-default IMU bias +When `reset_to_warm_start(hint)` is called with a known `hint.bias` and `hint.body_T_world` +Then the next `process_frame` call's `VioOutput.imu_bias` reflects `hint.bias` (within numerical tolerance) and the resulting `relative_pose_T` is consistent with starting from `hint.body_T_world`; calling `reset_to_warm_start` a second time without intervening frames does not raise + +**AC-5: `health_snapshot()` reports `INIT` initially and `TRACKING` after warm-up** +Given a freshly-constructed strategy +When `health_snapshot()` is called before any `process_frame` invocation +Then `state == INIT`; after `config.vio.warm_start_max_frames` (default 5) successful `process_frame` calls on a normal-segment fixture, the next `health_snapshot()` returns `state == TRACKING` + +**AC-6: `health_snapshot()` reports `DEGRADED` on feature loss** +Given a strategy in TRACKING state +When `process_frame` is fed a frame with feature count below `config.vio.okvis2.degraded_feature_threshold` +Then the returned `VioOutput.pose_covariance_6x6` Frobenius norm is strictly greater than the prior frame's; the next `health_snapshot()` returns `state == DEGRADED`; the strategy MUST emit a `VioOutput` (not raise) so C5 can down-weight rather than fall back + +**AC-7: Sustained loss raises `VioFatalError`** +Given a strategy in DEGRADED state +When `config.vio.lost_frame_threshold` (default 9) consecutive frames fail to update the estimator +Then the next `process_frame` call raises `VioFatalError`; subsequent `health_snapshot()` returns `state == LOST`; the AC-5.2 fallback path (FC IMU-only after 3 s) is the consumer's responsibility + +**AC-8: `BUILD_OKVIS2=OFF` does not import OKVIS2 native libs** +Given the binary is built with `BUILD_OKVIS2=OFF` +When `gps_denied_onboard.components.c1_vio` is imported (NOT the `okvis2` submodule directly) +Then `sys.modules` does NOT contain `gps_denied_onboard.components.c1_vio.okvis2` or any `_native.okvis2_binding` entry; AZ-331's factory raises `StrategyNotAvailableError("okvis2", missing_flag="BUILD_OKVIS2")` if `okvis2` is requested + +**AC-9: Honest covariance — no shrinkage during DEGRADED** +Given a controlled-degradation 60 s synthetic input (same source as the deferred C1-IT-01 test fixture) +When `process_frame` runs through the degradation event +Then `||pose_covariance_6x6||_F` is monotonically non-decreasing from the moment `health_snapshot().state` first transitions to `DEGRADED` until either `TRACKING` is restored or `LOST` is reached; this is enforced by reading OKVIS2's internal covariance directly without any client-side floor or smoother + +**AC-10: FDR `vio.health` records emitted on every state transition** +Given the strategy is configured with a real `FdrClient` (or test double) +When `health_snapshot().state` transitions (`INIT → TRACKING`, `TRACKING → DEGRADED`, `DEGRADED → LOST`, etc.) +Then exactly one FDR record with `kind="vio.health"` and the new state is emitted via the `FdrClient.emit` API; no records are emitted on steady-state frames + +## Non-Functional Requirements + +**Performance** +- `process_frame` p95 ≤ 80 ms on Tier-2 with C2 backbone running concurrently (C1-PT-01 / NFT-PERF-01 component partition); failure threshold 120 ms. +- `process_frame` p50 ≤ 25 ms on Tier-2 (description.md C1-PT-01). +- Throughput ≥ 3 Hz sustained; failure threshold < 2.5 Hz. +- CPU ≤ 30 % of one core; memory ≤ 1.5 GB resident (description.md § 6 + epic NFR). + +**Compatibility** +- OKVIS2 upstream HEAD pinned per Plan-phase. No upstream-source modifications. +- pybind11 version matches the OKVIS2 / VINS-Mono / GTSAM build (description.md § 5 dependency table). +- Eigen version matches OKVIS2 / GTSAM pin. + +**Reliability** +- The error envelope is closed at the `VioError` family. No raw OKVIS2 / pybind11 / Eigen exceptions cross the Python boundary. +- `process_frame` is idempotent w.r.t. state when it raises: a raised exception leaves the estimator in a recoverable state; the next valid frame integrates as if the bad one never came. +- The strategy is single-threaded by Protocol contract; the composition root binds one instance to the camera ingest thread. + +**Concurrency** +- One `Okvis2Strategy` instance per camera ingest thread; concurrent calls to `process_frame` on the same instance are undefined behaviour (matches Protocol invariant). +- The injected `ImuPreintegrator` is also single-threaded; the same composition-root binding rule applies. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `current_strategy_label()` after factory build with `okvis2` config | Returns `"okvis2"` | +| AC-2 | `process_frame` with a fixture frame + IMU window | `VioOutput.frame_id` echoed; covariance SPD; `imu_bias` non-None | +| AC-3 | Inject a malformed frame that triggers a backend exception (mocked binding) | `VioError`-family exception raised; original chained via `__cause__` | +| AC-4 | `reset_to_warm_start` then `process_frame` × N | Bias reflects hint; second `reset_to_warm_start` does not raise | +| AC-5 | Cold construct → `health_snapshot` × N | `INIT` initially; `TRACKING` after `warm_start_max_frames` | +| AC-6 | Feed degraded fixture | Covariance Frobenius norm strictly increases; `health_snapshot` returns `DEGRADED`; `VioOutput` IS emitted (not raised) | +| AC-7 | Fed `lost_frame_threshold` consecutive failed frames | `VioFatalError` on the next `process_frame`; `health_snapshot` returns `LOST` | +| AC-8 | `BUILD_OKVIS2=OFF` import + factory call | Module not in `sys.modules`; factory raises `StrategyNotAvailableError` | +| AC-9 | 60 s controlled-degradation synthetic | Covariance Frobenius norm monotonically non-decreasing during DEGRADED window | +| AC-10 | Real / fake `FdrClient` spy through state transitions | Exactly one `vio.health` record per transition; no spam on steady-state | +| NFR-perf | C1-PT-01 microbench against the Derkachi normal segment fixture (Tier-2) | p95 ≤ 80 ms; p50 ≤ 25 ms; throughput ≥ 3 Hz | +| NFR-reliability-error-envelope | Raise each backend exception type via mock binding; assert no leakage | All caught and rewrapped to `VioError` family | + +## Constraints + +- This task implements (does NOT define) the AZ-331 Protocol; any signature mismatch is a Spec-Gap finding (High) per code-review skill Phase 2. +- The pybind11 binding lives under `_native/` per `module-layout.md`; the `.so` import path is CMake-known and lazy-imported inside `okvis2.py`. +- OKVIS2 native source lives under `cpp/okvis2/` (parallel to `src/`, NOT nested inside the Python package), per `module-layout.md` rule #4. +- The strategy MUST consume IMU via the AZ-276 `ImuPreintegrator` helper; constructing a second IMU integration path is forbidden (defeats the "single source of IMU truth" invariant). +- This task introduces no new third-party dependencies beyond OKVIS2 + pybind11 + Eigen (already pinned). +- Per-frame DEBUG logging defaults OFF (would flood at 3 Hz); enabled only via `config.vio.per_frame_debug_log`. +- The strategy MUST NOT apply a covariance floor or smoother on the read path — honest covariance is the safety floor for AC-NEW-4; smoothing is a Risks-and-Mitigation discussion only. +- The `Okvis2Config` schema extension to AZ-269 is owned by this task; the field set is documented above. + +## Risks & Mitigation + +**Risk 1: OKVIS2 latency spike on thermally-throttled Jetson breaks AC-4.1** +- *Risk*: description.md § 7 notes OKVIS2's sliding-window optimisation can spike to 80–120 ms on a thermally-throttled Jetson; the C1-PT-01 p95 ≤ 80 ms budget is the wire boundary. +- *Mitigation*: D-CROSS-LATENCY-1 hybrid auto-degrades **C4** covariance recovery (not C1) under thermal throttle, freeing budget. This task does NOT implement thermal-aware behaviour — it just measures and reports latency; the C4 task owns the degradation decision. AC-9 covers the honest-covariance side; AC-NFR-perf measures the latency. + +**Risk 2: pybind11 type marshalling overhead dominates the per-frame budget** +- *Risk*: Marshalling a 5472×3648×3 uint8 frame across the Python ↔ C++ boundary on every `process_frame` could add 10s of ms. +- *Mitigation*: The pybind11 binding accepts the frame as a `numpy.ndarray` with `py::array::c_style | py::array::forcecast` so the data buffer is shared (zero-copy on `c_style`-aligned input). The composition root binds the camera ingest path to emit `c_style` buffers (handled in `frame_source/LiveCameraFrameSource`, AZ-265 cycle-1 deliverable). If the zero-copy path is broken, AC-NFR-perf microbench shows it immediately. + +**Risk 3: OKVIS2 internal covariance is reported in a frame-convention C5 does not expect** +- *Risk*: OKVIS2 reports covariance in its own body-frame; C5 expects body-to-world. A frame-convention bug would silently produce wrong covariance to iSAM2. +- *Mitigation*: The strategy uses `helpers.se3_utils` (AZ-277) to convert OKVIS2's frame to the canonical body-to-world convention; the conversion is unit-tested at the helper level and asserted by AC-2 (covariance SPD) + the deferred C1-IT-02 (cross-strategy invariants test). + +**Risk 4: OKVIS2 BSD-3-Clause license attribution missed** +- *Risk*: Failing to include OKVIS2's license notice in the airborne binary's NOTICE file violates BSD-3-Clause. +- *Mitigation*: The CMake target under `cpp/okvis2/` includes the upstream LICENSE file in the build artifact's NOTICE bundle; CI's SBOM step (existing infra) verifies presence. Tracked in the project NOTICE generation pipeline (out of scope here). + +## Runtime Completeness + +- **Named capability**: OKVIS2 tightly-coupled keyframe-based VIO + sliding-window optimisation + honest 6×6 covariance via OKVIS2's internal Hessian (architecture / E-C1 / `solution.md` "Strategy: Okvis2 production-default" / D-C5-3). +- **Production code that must exist**: real `Okvis2Strategy` class implementing the AZ-331 Protocol; real pybind11 binding to `cpp/okvis2/` (real OKVIS2 upstream, not a mock); real per-frame OKVIS2 estimator update; real covariance read from OKVIS2's internal Hessian; real bias propagation through the AZ-276 `ImuPreintegrator`. +- **Allowed external stubs**: tests MAY use a fake pybind11 binding that returns scripted `VioOutput` payloads (AC-3 / AC-6 / AC-7 use this for backend-exception injection); production wiring uses the real OKVIS2 upstream pinned by Plan-phase. +- **Unacceptable substitutes**: a Python-level "OKVIS2" wrapper that re-implements the optimisation loop in pure Python (would defeat C1-PT-01 ≤ 80 ms p95); a covariance floor or smoother on the read path (would break AC-9 honest-covariance contract); skipping the AZ-276 `ImuPreintegrator` and integrating IMU samples internally (would break the single-IMU-truth invariant); using a pre-built deterministic-fallback `VioOutput` while OKVIS2 is "compiled out" (would silently break C5 fusion at deploy time without the BUILD-flag gate firing first). diff --git a/_docs/02_tasks/todo/AZ-333_c1_vins_mono_strategy.md b/_docs/02_tasks/todo/AZ-333_c1_vins_mono_strategy.md new file mode 100644 index 0000000..0010b7b --- /dev/null +++ b/_docs/02_tasks/todo/AZ-333_c1_vins_mono_strategy.md @@ -0,0 +1,198 @@ +# C1 VINS-Mono Strategy — Research-Only Comparative VIO + +**Task**: AZ-333_c1_vins_mono_strategy +**Name**: C1 VINS-Mono Strategy +**Description**: Implement `VinsMonoStrategy`, the research-only `VioStrategy` that participates in the IT-12 comparative-study build only. The class is a Python facade over the VINS-Mono C++ loosely-coupled VIO core (sliding-window optimizer with separate IMU pre-integration thread) accessed via a pybind11 wrapper around `cpp/vins_mono/`. Build-time gated by `BUILD_VINS_MONO`; not present in any deployment-bound binary (airborne / operator-tooling / replay-cli all OFF; only research is ON per `module-layout.md` Build-Time Exclusion Map). MRE p95 < 1 px frame-to-frame is **not** required of VINS-Mono per the C1 component's `tests.md` (only Okvis2 + KltRansac are bound by AC-2.2); VINS-Mono is exempt as research-only. +**Complexity**: 5 points +**Dependencies**: AZ-331_c1_vio_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-276_imu_preintegrator, AZ-277_se3_utils, AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf +**Component**: c1_vio (epic AZ-254 / E-C1) +**Tracker**: AZ-333 +**Epic**: AZ-254 (E-C1) + +### Document Dependencies + +- `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md` — the Protocol this task implements; produced by AZ-331. +- `_docs/02_document/contracts/shared_helpers/imu_preintegrator.md` — IMU substrate (AZ-276). +- `_docs/02_document/contracts/shared_helpers/se3_utils.md` — SE(3) ↔ pose-matrix utilities (AZ-277). +- `_docs/02_document/components/01_c1_vio/description.md` — § 5 implementation details + § 6 helpers; § 7 caveats note VINS-Mono is research-only. +- `_docs/02_document/components/01_c1_vio/tests.md` — § Component-Internal Tests note "VinsMono is research-only and exempt from MRE bound (only IT-12 comparative-study coverage)" (C1-IT-04). + +## Problem + +Without `VinsMonoStrategy`: + +- The IT-12 comparative-study scenario (Mode B FT-P-04 / FT-P-05) cannot run all three VIO backends side-by-side; the research binary's purpose collapses to OKVIS2 + KltRansac only. +- Independent verification that OKVIS2 outperforms a comparable open-source loosely-coupled VIO on the Derkachi fixture has no producer; future architectural decisions to swap the production-default VIO have no comparative basis. +- The composition root's three-strategy switch is asymmetric — adding a fourth strategy in a future cycle would require revisiting the factory pattern instead of simply adding a fourth lazy branch. +- Consumers of `VioOutput` (C5 fusion) would have to be re-validated against a smaller dataset of behaviours; cross-strategy contract tests (deferred to Step 9 / E-BBT) lose a third data point. + +This task delivers the comparative-research third strategy. Production binaries do NOT link it; only the IT-12 research binary loads it via `BUILD_VINS_MONO=ON`. + +## Outcome + +- A `VinsMonoStrategy` class at `src/gps_denied_onboard/components/c1_vio/vins_mono.py` conforming to the `VioStrategy` Protocol from AZ-331; `current_strategy_label() == "vins_mono"`. +- A pybind11 wrapper at `src/gps_denied_onboard/components/c1_vio/_native/vins_mono_binding.cpp` exposing the VINS-Mono C++ estimator (`vins_estimator::Estimator` or equivalent in the pinned upstream HEAD) to Python. The wrapper is built by CMake under `cpp/vins_mono/` (build-time gated by `BUILD_VINS_MONO`); the resulting `.so` is imported lazily inside `vins_mono.py`. +- Constructor `__init__(self, *, calibration: CameraCalibration, preintegrator: ImuPreintegrator, fdr_client: FdrClient, logger: Logger, config: VinsMonoConfig)` — all dependencies constructor-injected per ADR-009. `VinsMonoConfig` (`@dataclass(frozen=True)`) carries the VINS-Mono-specific knobs (sliding-window size, feature tracker thresholds, marginalisation strategy, max optimisation iterations) loaded from `config.vio.vins_mono.*` via AZ-269. +- `process_frame(frame, imu, calibration) -> VioOutput`: + 1. Append IMU samples to the injected `ImuPreintegrator` (strict-monotonic guarded; `ImuPreintegrationError` rewraps to `VioFatalError`). + 2. Feed the nav-camera frame to VINS-Mono via the pybind11 `add_image` wrapper. + 3. If VINS-Mono emits a new estimator update, extract the relative pose (SE(3) via `helpers.se3_utils`), the 6×6 covariance from VINS-Mono's marginalised information matrix, the latest IMU bias, and the feature-quality summary. + 4. Build and return `VioOutput` with `frame_id` echoed. +- `reset_to_warm_start(hint)`: tears down the current VINS-Mono estimator instance, constructs a fresh one, seeds the IMU bias and initial pose from `hint`. The next `config.vio.warm_start_max_frames` frames are allowed to converge before the strategy reports `state == TRACKING`. +- `health_snapshot()` returns `VioHealth(state, consecutive_lost, bias_norm)` derived from VINS-Mono's internal initialiser flag and feature-tracker health: `INIT` until the SfM bootstrap succeeds, `TRACKING` while the optimisation converges, `DEGRADED` when feature count drops below `config.vio.vins_mono.degraded_feature_threshold` or the marginalised information matrix's smallest eigenvalue drops below threshold, `LOST` after `config.vio.lost_frame_threshold` consecutive failed updates. +- The honest-covariance invariant is enforced behaviourally as in OKVIS2: VINS-Mono's marginalised covariance is read directly with no client-side floor or smoother. +- Error envelope is closed: every VINS-Mono / pybind11 / Eigen / Ceres exception is caught and rewrapped into the `VioError` family. +- All FDR records emitted via the injected `FdrClient` use the `kind="vio.health"` schema from AZ-272. + +## Scope + +### Included + +- `VinsMonoStrategy` class implementation + the `VinsMonoConfig` dataclass + the `_native/vins_mono_binding.cpp` pybind11 wrapper. +- CMake target under `cpp/vins_mono/` that links the VINS-Mono upstream pin (BSD-3-Clause-style ROS license) and produces the binding `.so`. Build flag `BUILD_VINS_MONO`; default OFF for airborne / operator-tooling / replay-cli. +- The full `process_frame` / `reset_to_warm_start` / `health_snapshot` / `current_strategy_label` surface conforming to AZ-331's Protocol. +- IMU substrate via the constructor-injected `ImuPreintegrator` (AZ-276). +- Honest-covariance reading from VINS-Mono's marginalised information matrix. +- Lazy import of the `_native` binding inside `vins_mono.py`. +- Per-frame DEBUG log gated by `config.vio.per_frame_debug_log` (default OFF). +- WARN / ERROR / INFO logging per description.md § 9. +- Health-state transitions emitted as FDR records. +- Composition-root wiring (entry to the AZ-331 `build_vio_strategy` factory's `vins_mono` branch). +- VINS-Mono upstream's ROS dependency (if any) MUST be stripped or vendored — VINS-Mono historically ships as a ROS package; this task uses an upstream pin that has been de-ROSified (e.g., the `vins-mono-no-ros` community port) OR vendors only the `vins_estimator` / `feature_tracker` cores. The decision (which upstream to pin) is recorded as an ADR addendum if not already covered by Plan-phase pin selection. + +### Excluded + +- OKVIS2 strategy — separate task in this epic. +- KLT/RANSAC simple-baseline strategy — separate task in this epic. +- Warm-start hint persistence — separate task in this epic. +- C5 fusion of `VioOutput` — owned by E-C5. +- C13 FDR writer-thread — owned by E-C13. +- IMU preintegration mathematics — owned by AZ-276. +- The C1-IT-01..06 / C1-PT-01 tests themselves — deferred to Step 9 (E-BBT). Note: AC-2.2 MRE bound is exempt for VINS-Mono per `tests.md`. +- The IT-12 comparative-study harness — owned by suite-level test harness (Step 9 / E-BBT or test-spec extension). +- VINS-Mono upstream-source modifications beyond ROS-stripping — bug fixes upstream require a separate ADR. +- Multi-camera VINS-Mono — out of scope. + +## Acceptance Criteria + +**AC-1: `current_strategy_label()` returns `"vins_mono"`** +Given a `VinsMonoStrategy` constructed via the AZ-331 factory with `config.vio.strategy = "vins_mono"` +When `current_strategy_label()` is called +Then the returned string is exactly `"vins_mono"` + +**AC-2: `process_frame` returns `VioOutput` with `frame_id` echoed** +Given a `NavCameraFrame` with `frame_id = "uuid-xyz"` and a populated `ImuWindow` +When `process_frame(frame, imu, calibration)` is called and reaches a successful estimator update +Then the returned `VioOutput.frame_id == "uuid-xyz"`; `pose_covariance_6x6` is symmetric and positive-definite; `imu_bias` is non-`None` + +**AC-3: `process_frame` rewraps every backend exception into `VioError`** +Given a malformed input that triggers a VINS-Mono / pybind11 / Eigen / Ceres exception inside the backend +When `process_frame` is called +Then the raised exception is one of `VioInitializingError` / `VioDegradedError` / `VioFatalError`; the original exception is chained via `raise ... from`; no raw backend exception leaks + +**AC-4: `reset_to_warm_start` clears state and seeds the hint** +Given a strategy with N processed frames +When `reset_to_warm_start(hint)` is called with a known `hint.bias` and `hint.body_T_world` +Then the next `process_frame` call's `VioOutput.imu_bias` reflects `hint.bias` (within numerical tolerance); calling `reset_to_warm_start` a second time without intervening frames does not raise + +**AC-5: `health_snapshot()` reports `INIT` until SfM bootstrap completes** +Given a freshly-constructed strategy +When `health_snapshot()` is called before VINS-Mono's SfM bootstrap has succeeded +Then `state == INIT`; once bootstrap completes (typically 10–20 frames per VINS-Mono behaviour), the next `health_snapshot()` returns `state == TRACKING` + +**AC-6: `health_snapshot()` reports `DEGRADED` on feature loss** +Given a strategy in TRACKING state +When `process_frame` is fed a frame with feature count below `config.vio.vins_mono.degraded_feature_threshold` or with the marginalised information matrix's smallest eigenvalue below threshold +Then the returned `VioOutput.pose_covariance_6x6` Frobenius norm is strictly greater than the prior frame's; the next `health_snapshot()` returns `state == DEGRADED`; the strategy MUST emit a `VioOutput` (not raise) + +**AC-7: Sustained loss raises `VioFatalError`** +Given a strategy in DEGRADED state +When `config.vio.lost_frame_threshold` consecutive frames fail to update +Then the next `process_frame` call raises `VioFatalError`; subsequent `health_snapshot()` returns `state == LOST` + +**AC-8: `BUILD_VINS_MONO=OFF` does not import VINS-Mono native libs** +Given the airborne / operator-tooling / replay-cli binary built with `BUILD_VINS_MONO=OFF` +When `gps_denied_onboard.components.c1_vio` is imported +Then `sys.modules` does NOT contain `gps_denied_onboard.components.c1_vio.vins_mono` or any `_native.vins_mono_binding` entry; AZ-331's factory raises `StrategyNotAvailableError("vins_mono", missing_flag="BUILD_VINS_MONO")` if `vins_mono` is requested + +**AC-9: Honest covariance — no shrinkage during DEGRADED** +Given a controlled-degradation 60 s synthetic input +When `process_frame` runs through the degradation event +Then `||pose_covariance_6x6||_F` is monotonically non-decreasing from the moment `health_snapshot().state` first transitions to `DEGRADED` until either `TRACKING` is restored or `LOST` is reached + +**AC-10: FDR `vio.health` records emitted on every state transition** +Given the strategy is configured with a real `FdrClient` (or test double) +When `health_snapshot().state` transitions +Then exactly one FDR record with `kind="vio.health"` and the new state is emitted; no records on steady-state frames + +## Non-Functional Requirements + +**Performance** +- Per-frame latency budget: VINS-Mono is research-only and is NOT bound by C1-PT-01's ≤ 80 ms p95 target. Document VINS-Mono's actual p95 in the Step 9 / E-BBT comparative-study report (no hard threshold). +- Throughput: best-effort; expected to operate at 3 Hz on Tier-2 in the research binary but no failure threshold this cycle. +- CPU / memory: best-effort within the research binary's overall budget (research binary is not deployed; resource limits are looser). + +**Compatibility** +- VINS-Mono upstream HEAD (de-ROSified port) pinned per Plan-phase. Upstream-source modifications beyond the ROS-strip require an explicit ADR addendum. +- pybind11 / Eigen / Ceres versions match the OKVIS2 build to avoid ABI conflicts inside the same research binary. + +**Reliability** +- Error envelope closed at the `VioError` family; no raw VINS-Mono / Ceres / Eigen exceptions cross the Python boundary. +- Single-threaded by Protocol contract; one instance per camera ingest thread inside the research binary. +- AC-2.2 MRE bound is **exempt** per `tests.md` C1-IT-04 — VINS-Mono is research-only; no per-frame MRE assertion in this task's tests. + +**Concurrency** +- One `VinsMonoStrategy` instance per research-binary camera ingest thread. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `current_strategy_label()` after factory build with `vins_mono` config | Returns `"vins_mono"` | +| AC-2 | `process_frame` with a fixture frame + IMU window | `VioOutput.frame_id` echoed; covariance SPD; `imu_bias` non-None | +| AC-3 | Inject a malformed frame that triggers a backend exception (mocked binding) | `VioError`-family exception raised; original chained via `__cause__` | +| AC-4 | `reset_to_warm_start` then `process_frame` × N | Bias reflects hint; second `reset_to_warm_start` does not raise | +| AC-5 | Cold construct → process N frames | `INIT` until SfM bootstrap completes; then `TRACKING` | +| AC-6 | Feed degraded fixture | Covariance Frobenius norm strictly increases; `health_snapshot` returns `DEGRADED`; `VioOutput` IS emitted | +| AC-7 | `lost_frame_threshold` consecutive failed frames | `VioFatalError` on next `process_frame`; `health_snapshot` returns `LOST` | +| AC-8 | `BUILD_VINS_MONO=OFF` import + factory call | Module not in `sys.modules`; factory raises `StrategyNotAvailableError` | +| AC-9 | 60 s controlled-degradation synthetic | Covariance Frobenius norm monotonically non-decreasing during DEGRADED window | +| AC-10 | Real / fake `FdrClient` spy through state transitions | Exactly one `vio.health` record per transition | +| NFR-reliability-error-envelope | Raise each backend exception type via mock | All caught and rewrapped to `VioError` family | +| NFR-perf-document | Microbench `process_frame` on Derkachi fixture (research binary) | Document p50/p95 in the Step 9 comparative-study report (no hard threshold) | + +## Constraints + +- This task implements (does NOT define) the AZ-331 Protocol; signature mismatch is a Spec-Gap finding (High) at code-review. +- The pybind11 binding lives under `_native/` per `module-layout.md`; lazy-imported inside `vins_mono.py`. +- VINS-Mono native source lives under `cpp/vins_mono/` per `module-layout.md` rule #4. The chosen upstream MUST be ROS-free at the source level (either upstream port or in-tree ROS-strip). +- The strategy MUST consume IMU via the AZ-276 `ImuPreintegrator` helper; constructing a second IMU integration path is forbidden. +- This task introduces no new third-party dependencies beyond VINS-Mono + pybind11 + Eigen + Ceres (the Ceres dependency is unique to VINS-Mono among the three strategies; it is pinned via `cpp/vins_mono/CMakeLists.txt` and excluded from airborne / operator-tooling / replay-cli builds because `BUILD_VINS_MONO=OFF` for those binaries). +- Per-frame DEBUG logging defaults OFF. +- The strategy MUST NOT apply a covariance floor or smoother on the read path. +- AC-2.2 MRE bound is exempt per the C1 component's `tests.md`; the test task in Step 9 / E-BBT will configure C1-IT-04 to exclude VINS-Mono. + +## Risks & Mitigation + +**Risk 1: VINS-Mono upstream ships as a ROS package and ROS deps leak into the research binary** +- *Risk*: A naive vendored VINS-Mono pulls in `roscpp`, `rosbag`, etc., bloating the research binary and creating a build-time mess. +- *Mitigation*: The chosen upstream pin is a de-ROSified community port (or in-tree ROS-strip applied during the CMake build under `cpp/vins_mono/`). If a clean port does not exist at Plan-phase pin time, this task's Plan-phase decision records the chosen approach; CI's research-binary SBOM step asserts no ROS package leaks. + +**Risk 2: Ceres + Eigen ABI conflict with OKVIS2's Eigen pin** +- *Risk*: VINS-Mono uses Ceres (for nonlinear optimisation); OKVIS2 also uses Eigen heavily. ABI mismatch between the two builds in the same binary produces silent corruption. +- *Mitigation*: Both `cpp/okvis2/CMakeLists.txt` and `cpp/vins_mono/CMakeLists.txt` link the same Eigen pin from `cpp/_third_party/eigen/`. The research binary's CMake build is the only place both load simultaneously; CI's research build asserts the linked Eigen version with `ldd`-style introspection. + +**Risk 3: `BUILD_VINS_MONO=ON` accidentally enabled in a deployment binary** +- *Risk*: A misconfigured build flag could ship VINS-Mono to a deployed Jetson, blowing the binary size and adding an attack surface. +- *Mitigation*: `module-layout.md` Build-Time Exclusion Map locks `BUILD_VINS_MONO=OFF` for airborne / operator-tooling / replay-cli; CI's per-binary SBOM diff (`ci/sbom_diff.py`) fails if `vins_mono` appears in any non-research SBOM. The composition root validator additionally raises `ConfigurationError` at startup if `config.vio.strategy="vins_mono"` is requested in the airborne binary. + +**Risk 4: VINS-Mono's loosely-coupled covariance is over-confident vs OKVIS2's tightly-coupled** +- *Risk*: The IT-12 comparative study could mislead an architect into picking VINS-Mono if its covariance under-reports. +- *Mitigation*: AC-9 honest-covariance enforcement applies to VINS-Mono too; the IT-12 report (Step 9 / E-BBT) compares calibrated covariances side-by-side; D-C5-5 captures the architectural decision that production stays on OKVIS2. + +## Runtime Completeness + +- **Named capability**: VINS-Mono loosely-coupled VIO + sliding-window optimisation + marginalised information matrix → 6×6 covariance (architecture / E-C1 / `solution.md` "research-only IT-12 comparative-study" / D-C5-3 sliding window context). +- **Production code that must exist**: real `VinsMonoStrategy` class implementing the AZ-331 Protocol; real pybind11 binding to `cpp/vins_mono/` (real VINS-Mono upstream, de-ROSified); real per-frame estimator update; real covariance read from VINS-Mono's marginalised information matrix; real bias propagation through AZ-276. +- **Allowed external stubs**: tests MAY use a fake pybind11 binding that returns scripted `VioOutput` payloads (AC-3 / AC-6 / AC-7); production wiring (research binary only) uses the real VINS-Mono upstream. +- **Unacceptable substitutes**: a pure-Python VINS-Mono re-implementation (would defeat the whole point of comparative study); skipping the AZ-276 `ImuPreintegrator` (would break the single-IMU-truth invariant); a covariance floor on the read path; shipping VINS-Mono in a deployment binary by default. diff --git a/_docs/02_tasks/todo/AZ-334_c1_klt_ransac_strategy.md b/_docs/02_tasks/todo/AZ-334_c1_klt_ransac_strategy.md new file mode 100644 index 0000000..839dff5 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-334_c1_klt_ransac_strategy.md @@ -0,0 +1,220 @@ +# C1 KLT/RANSAC Strategy — Mandatory Simple-Baseline VIO + +**Task**: AZ-334_c1_klt_ransac_strategy +**Name**: C1 KLT/RANSAC Strategy +**Description**: Implement `KltRansacStrategy`, the mandatory simple-baseline `VioStrategy` that satisfies the ADR-002 engine rule (every component MUST ship a simple-baseline strategy alongside its production-default). The class is a pure-Python facade over OpenCV's pyramidal KLT optical-flow + RANSAC essential-matrix path. No C++/pybind11 — OpenCV's Python bindings provide everything needed, keeping the simple-baseline path camera-agnostic, dependency-light, and dead-easy to reason about. Bound by AC-2.1a (≥ 95 % tracked-frame ratio on the Derkachi normal segment) and AC-2.2 (MRE p95 < 1 px frame-to-frame; `tests.md` C1-IT-04 binds KltRansac alongside Okvis2). Build-time gated by `BUILD_KLT_RANSAC=ON` (airborne / research / replay-cli; operator-tooling does not need VIO). +**Complexity**: 5 points +**Dependencies**: AZ-331_c1_vio_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-276_imu_preintegrator, AZ-277_se3_utils, AZ-282_ransac_filter, AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf +**Component**: c1_vio (epic AZ-254 / E-C1) +**Tracker**: AZ-334 +**Epic**: AZ-254 (E-C1) + +### Document Dependencies + +- `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md` — the Protocol this task implements; produced by AZ-331. +- `_docs/02_document/contracts/shared_helpers/imu_preintegrator.md` — IMU substrate (AZ-276); the simple-baseline still consumes the GTSAM `CombinedImuFactor` so C5 fusion sees a consistent `VioOutput` shape. +- `_docs/02_document/contracts/shared_helpers/se3_utils.md` — SE(3) ↔ pose-matrix utilities (AZ-277). +- `_docs/02_document/contracts/shared_helpers/ransac_filter.md` — generic RANSAC inlier filter (AZ-282); reused for the essential-matrix outlier rejection step. +- `_docs/02_document/components/01_c1_vio/description.md` — § 5 implementation details + § 6 helpers + § 7 caveats (sharp turns < 5 % frame overlap cause feature-track loss in all three strategies). +- `_docs/02_document/components/01_c1_vio/tests.md` — C1-IT-03 (≥ 95 % tracked-frame ratio); C1-IT-04 (MRE p95 < 1 px) — both bind this strategy. +- ADR-002 — the engine rule that mandates this baseline. + +## Problem + +Without `KltRansacStrategy`: + +- The ADR-002 engine rule is violated; every airborne binary MUST link a simple-baseline strategy alongside the production-default. +- AC-2.1a (95 % tracked-frame ratio on normal segments — the engine rule's quantitative gate) has no producer; the Derkachi C1-IT-03 fixture cannot validate the rule. +- The deployment binary's defense-in-depth posture collapses: an OKVIS2 backend init failure leaves the runtime with no fallback path. +- The simple-baseline serves as an interpretability anchor — when the production-default OKVIS2 produces a surprising covariance, the operator's debugging step is "compare to KLT/RANSAC on the same input"; without it, the comparison has nothing to compare against. +- The dependency-light path (no OKVIS2 / VINS-Mono native libs) becomes important for Tier-0 workstation development; without `KltRansacStrategy`, every developer needs OKVIS2 native libs installed. +- IT-12 comparative-study cannot include the simple-baseline as a third data point. + +This task delivers the engine-rule mandatory baseline. It is the lowest-complexity strategy in this epic by code volume but carries the highest test coverage ratio because the AC-2.1a / AC-2.2 bounds are the gate. + +## Outcome + +- A `KltRansacStrategy` class at `src/gps_denied_onboard/components/c1_vio/klt_ransac.py` conforming to the `VioStrategy` Protocol from AZ-331; `current_strategy_label() == "klt_ransac"`. +- No `_native/` binding — pure Python over OpenCV's `cv2.calcOpticalFlowPyrLK` + `cv2.goodFeaturesToTrack` + `cv2.findEssentialMat` + `cv2.recoverPose` path. The `helpers.ransac_filter` (AZ-282) is reused for outlier rejection of the per-frame correspondences before pose recovery. +- Constructor `__init__(self, *, calibration: CameraCalibration, preintegrator: ImuPreintegrator, ransac_filter: RansacFilter, fdr_client: FdrClient, logger: Logger, config: KltRansacConfig)` — all dependencies constructor-injected per ADR-009. `KltRansacConfig` (`@dataclass(frozen=True)`) carries the KLT-specific knobs (max corner count, KLT pyramid levels, KLT window size, RANSAC inlier ratio, essential-matrix RANSAC threshold, min-features-for-pose) loaded from `config.vio.klt_ransac.*` via AZ-269. +- `process_frame(frame, imu, calibration) -> VioOutput`: + 1. Append IMU samples to the injected `ImuPreintegrator` (the IMU contributes to bias accumulation; KLT itself is vision-only, but the `VioOutput.imu_bias` field still gets populated from the helper's current bias estimate). + 2. Convert the input `NavCameraFrame.pixels` to grayscale (OpenCV expects `uint8` single-channel for KLT). + 3. If this is the first frame, run `cv2.goodFeaturesToTrack` to seed the feature track buffer; emit `VioOutput` with `state == INIT` (zero relative pose, conservative covariance). + 4. Otherwise, call `cv2.calcOpticalFlowPyrLK` to track the prior frame's features into this frame; reject `status==0` correspondences. + 5. Run `cv2.findEssentialMat` with `cv2.RANSAC` over the surviving correspondences using `calibration.K`; reject correspondences whose RANSAC mask is 0. + 6. Recover the relative pose via `cv2.recoverPose`; convert to SE(3) via `helpers.se3_utils`. + 7. Estimate per-frame covariance from the inlier-residual scatter (sample covariance of 2D reprojection residuals back-projected to a 6×6 pose-perturbation covariance via the camera Jacobian; standard textbook approach — see Risks for the honest-covariance constraint). + 8. Compute `feature_quality.mre_px` from the inlier residuals; `tracked` / `new` / `lost` counts come from the KLT step. + 9. Build and return `VioOutput` with `frame_id` echoed; emit per-frame DEBUG log if enabled. +- `reset_to_warm_start(hint)`: clears the prior-frame feature track buffer, seeds the IMU bias from `hint.bias` via the preintegrator's `reset_with_bias`, and resets the internal "first frame seen" flag so the next `process_frame` re-seeds features. The hint's `body_T_world` is recorded as the baseline for relative-pose chaining; KLT/RANSAC's per-frame relative pose is interpreted relative to this baseline by C5 fusion. +- `health_snapshot()` returns `VioHealth(state, consecutive_lost, bias_norm)`: + - `INIT` for the first 1–2 frames (until KLT has a prior to track from); + - `TRACKING` when inlier count ≥ `config.vio.klt_ransac.min_features_for_pose`; + - `DEGRADED` when inlier count drops below threshold (covariance Frobenius norm grows in proportion); + - `LOST` after `config.vio.lost_frame_threshold` consecutive frames where pose recovery fails (e.g., RANSAC finds no consensus). +- The honest-covariance invariant is enforced behaviourally: the per-frame covariance grows monotonically as inlier count drops; no client-side floor or smoother is applied. +- Error envelope is closed: every OpenCV `cv2.error` is caught inside `process_frame` / `reset_to_warm_start` and rewrapped into the `VioError` family. `VioInitializingError` for the first frame; `VioFatalError` on sustained pose-recovery failure (RANSAC consensus < threshold for `lost_frame_threshold` frames). +- All FDR records emitted via the injected `FdrClient` use the `kind="vio.health"` schema. + +## Scope + +### Included + +- `KltRansacStrategy` class + `KltRansacConfig` dataclass. +- The full `process_frame` / `reset_to_warm_start` / `health_snapshot` / `current_strategy_label` surface conforming to AZ-331's Protocol. +- IMU substrate via the constructor-injected `ImuPreintegrator` (AZ-276); the simple-baseline still calls into the helper for bias accumulation so `VioOutput.imu_bias` is consistent across all three strategies. +- RANSAC outlier rejection via the constructor-injected `RansacFilter` (AZ-282); KLT/RANSAC does not duplicate AZ-282's logic. +- Honest covariance estimation from the residual scatter — the formula and its limitations are documented in the spec; no smoothing on the read path. +- Per-frame DEBUG log gated by `config.vio.per_frame_debug_log` (default OFF). +- WARN / ERROR / INFO logging per description.md § 9. +- Health-state transitions emitted as FDR records. +- Composition-root wiring (entry to the AZ-331 `build_vio_strategy` factory's `klt_ransac` branch). +- KLT/RANSAC is the only strategy that builds for ALL of airborne / research / replay-cli; module-layout's Build-Time Exclusion Map shows `BUILD_KLT_RANSAC=ON` for those three binaries (operator-tooling does not need VIO). +- The KLT path MUST be camera-agnostic — no `adti20` / `adti26` specific branches; the calibration arrives via the per-call `CameraCalibration` argument. + +### Excluded + +- OKVIS2 strategy — separate task. +- VINS-Mono strategy — separate task. +- Warm-start hint persistence — separate task in this epic. +- C5 fusion of `VioOutput` — owned by E-C5. +- C13 FDR writer-thread — owned by E-C13. +- IMU preintegration — owned by AZ-276. +- The C1-IT-01..06 / C1-PT-01 tests themselves — deferred to Step 9 / E-BBT (KLT/RANSAC's AC-2.1a + AC-2.2 bindings will live there). +- Non-OpenCV optical-flow algorithms (Farnebäck dense flow, etc.) — out of scope; KLT pyramidal is the canonical simple baseline. +- Bundle-adjustment refinement of the recovered pose — out of scope this cycle; the engine rule's bound is per-frame relative pose, not refined keyframe pose. + +## Acceptance Criteria + +**AC-1: `current_strategy_label()` returns `"klt_ransac"`** +Given a `KltRansacStrategy` constructed via the AZ-331 factory with `config.vio.strategy = "klt_ransac"` +When `current_strategy_label()` is called +Then the returned string is exactly `"klt_ransac"` + +**AC-2: First frame emits `VioOutput` with `state == INIT` and zero relative pose** +Given a freshly-constructed strategy and the very first `NavCameraFrame` +When `process_frame(frame, imu, calibration)` is called +Then a `VioOutput` is returned with `relative_pose_T` equal to the SE(3) identity (within numerical tolerance) and `pose_covariance_6x6` equal to the configured INIT-state conservative covariance; `health_snapshot().state == INIT` + +**AC-3: Steady-state frame emits `VioOutput` with non-zero relative pose and SPD covariance** +Given a strategy with prior frames processed and a current frame with sufficient inliers +When `process_frame` is called +Then `relative_pose_T` is non-identity (the camera moved between frames in the fixture); `pose_covariance_6x6` is symmetric and positive-definite; `feature_quality.mre_px > 0`; `feature_quality.tracked > 0` + +**AC-4: Pose recovery rewraps `cv2.error` into `VioError`** +Given a frame that triggers `cv2.findEssentialMat` or `cv2.recoverPose` to raise `cv2.error` +When `process_frame` is called +Then `VioFatalError` is raised; the original `cv2.error` is chained via `raise ... from`; no raw `cv2.error` leaks to the caller + +**AC-5: `reset_to_warm_start` clears feature buffer and re-seeds bias** +Given a strategy with N processed frames and a non-default IMU bias +When `reset_to_warm_start(hint)` is called with a known `hint.bias` +Then the next `process_frame` call's first behaviour is `cv2.goodFeaturesToTrack` (verifiable via spy on the OpenCV call); `VioOutput.imu_bias` reflects `hint.bias` (within numerical tolerance); calling `reset_to_warm_start` again does not raise + +**AC-6: Inlier loss → `DEGRADED` state with monotonically growing covariance** +Given a strategy in TRACKING state +When a frame is processed where the surviving inlier count is below `config.vio.klt_ransac.min_features_for_pose` +Then `VioOutput.pose_covariance_6x6` Frobenius norm is strictly greater than the prior frame's; `health_snapshot().state == DEGRADED`; `VioOutput` IS emitted (not raised) + +**AC-7: Sustained pose-recovery failure raises `VioFatalError`** +Given a strategy in DEGRADED state +When `config.vio.lost_frame_threshold` consecutive frames fail pose recovery (e.g., RANSAC finds no consensus) +Then the next `process_frame` call raises `VioFatalError`; `health_snapshot().state == LOST` + +**AC-8: `BUILD_KLT_RANSAC=OFF` does not import the strategy module** +Given the operator-tooling binary built with `BUILD_KLT_RANSAC=OFF` +When `gps_denied_onboard.components.c1_vio` is imported +Then `sys.modules` does NOT contain `gps_denied_onboard.components.c1_vio.klt_ransac`; AZ-331's factory raises `StrategyNotAvailableError("klt_ransac", missing_flag="BUILD_KLT_RANSAC")` if `klt_ransac` is requested + +**AC-9: Honest covariance — no shrinkage during DEGRADED** +Given a controlled-degradation 60 s synthetic input +When `process_frame` runs through the degradation event +Then `||pose_covariance_6x6||_F` is monotonically non-decreasing from the moment `health_snapshot().state` first transitions to `DEGRADED` until `TRACKING` is restored or `LOST` is reached; the covariance estimator does NOT apply any client-side floor or smoother + +**AC-10: FDR `vio.health` records emitted on every state transition** +Given the strategy is configured with a real `FdrClient` (or test double) +When `health_snapshot().state` transitions +Then exactly one FDR record with `kind="vio.health"` and the new state is emitted; no records on steady-state frames + +**AC-11: Camera-agnostic path** +Given two `CameraCalibration` instances representing different deployed cameras (test fixture `adti26` and a synthetic alternate calibration) +When the same `process_frame` code path is exercised against both calibrations +Then the strategy produces sensible `VioOutput` for both without any calibration-specific branch in the source code (verifiable via static-grep CI gate: no `adti20` / `adti26` literals in `klt_ransac.py`) + +## Non-Functional Requirements + +**Performance** +- `process_frame` p95 ≤ 80 ms on Tier-2 (budget shared with OKVIS2 — KLT/RANSAC is typically faster but the budget is the wire boundary). Failure threshold 120 ms. +- Throughput ≥ 3 Hz sustained; failure threshold < 2.5 Hz. +- CPU ≤ 30 % of one core (OpenCV's `cv2.calcOpticalFlowPyrLK` is multi-threaded internally; bound at 30 % per ADR-002 budget partition). +- Memory ≤ 1.5 GB resident. +- AC-2.1a: ≥ 95 % tracked-frame ratio on Derkachi normal segment (C1-IT-03; deferred to Step 9 / E-BBT for the actual test, but this strategy MUST be capable of meeting it on the named fixture). +- AC-2.2: MRE p95 < 1 px frame-to-frame (C1-IT-04; this strategy IS bound by it per `tests.md`). + +**Compatibility** +- OpenCV ≥ 4.12.0 (CVE-2025-53644 mitigation per architecture § 2 dependency table). +- No additional third-party dependencies — OpenCV + numpy only (numpy already pinned). + +**Reliability** +- Error envelope closed at the `VioError` family; no raw OpenCV / numpy exceptions cross the API surface. +- Single-threaded by Protocol contract; one instance per camera ingest thread. +- Pure-Python — no native-lib install requirement; works on Tier-0 workstation with `pip install opencv-python` only. + +**Concurrency** +- One `KltRansacStrategy` instance per camera ingest thread. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `current_strategy_label()` after factory build with `klt_ransac` config | Returns `"klt_ransac"` | +| AC-2 | First-frame `process_frame` on a fixture | `VioOutput` returned; `relative_pose_T` ≈ identity; `state == INIT` | +| AC-3 | Process N frames; inspect `VioOutput` shape | `relative_pose_T` non-identity; `pose_covariance_6x6` SPD; `feature_quality.mre_px > 0` | +| AC-4 | Inject a frame that triggers `cv2.error` (mock OpenCV call) | `VioFatalError` raised; `__cause__` is the `cv2.error` | +| AC-5 | `reset_to_warm_start` then `process_frame` | First call to OpenCV is `goodFeaturesToTrack`; `imu_bias` reflects hint | +| AC-6 | Feed degraded fixture (low inlier count) | Covariance Frobenius norm strictly increases; `state == DEGRADED`; `VioOutput` IS emitted | +| AC-7 | `lost_frame_threshold` consecutive failed-pose frames | `VioFatalError` on next `process_frame`; `state == LOST` | +| AC-8 | `BUILD_KLT_RANSAC=OFF` import + factory call | Module not in `sys.modules`; factory raises `StrategyNotAvailableError` | +| AC-9 | 60 s controlled-degradation synthetic | Covariance monotonically non-decreasing during DEGRADED | +| AC-10 | Real / fake `FdrClient` spy through state transitions | Exactly one `vio.health` record per transition | +| AC-11 | Static-grep gate + run with two calibrations | No `adti20` / `adti26` literals in source; both calibrations produce sensible output | +| NFR-perf | C1-PT-01 microbench against Derkachi normal segment (Tier-2) | p95 ≤ 80 ms; throughput ≥ 3 Hz | +| NFR-reliability-error-envelope | Raise `cv2.error` from each OpenCV call point | All caught and rewrapped to `VioError` family | + +## Constraints + +- This task implements (does NOT define) the AZ-331 Protocol. +- KLT/RANSAC is pure Python — NO `_native/` binding under this strategy. +- The strategy MUST consume IMU via the AZ-276 `ImuPreintegrator` for bias propagation, even though KLT itself is vision-only (keeps `VioOutput.imu_bias` consistent across all three strategies). +- The strategy MUST consume RANSAC via the AZ-282 `RansacFilter` for the inlier-rejection step (cross-cutting helper; do not duplicate locally). +- OpenCV ≥ 4.12.0 is the only third-party dependency added by this task (already pinned at the project level). +- No covariance floor / smoother on the read path — the residual-scatter covariance estimator is the canonical formula; document its limitations in the Risks section. +- Per-frame DEBUG defaults OFF. +- Camera-agnostic — no calibration-specific branches in source. CI grep gate enforces. +- The `KltRansacConfig` schema extension to AZ-269 is owned by this task. + +## Risks & Mitigation + +**Risk 1: Residual-scatter covariance under-reports during high-overlap straight flight** +- *Risk*: The standard residual-scatter formula assumes residual noise is uncorrelated with pose perturbation; in long straight-flight segments the assumption holds, but in low-parallax scenarios the formula can under-report covariance — not a "honest-covariance violation" in the AC-9 sense (that test catches monotonicity), but a quantitative under-report C5 fusion will over-trust. +- *Mitigation*: D-CROSS-LATENCY-1 + AC-NEW-4 statistical headroom carry the residual risk; the C1-IT-12 comparative-study report (Step 9 / E-BBT) cross-validates KLT's covariance against OKVIS2's tightly-coupled output. The strategy spec documents the limitation; the deployed binary uses OKVIS2 by default and KLT only as a fallback / engine-rule baseline. + +**Risk 2: KLT loses track on the first frame after take-off (no prior frame to track from)** +- *Risk*: AC-2 covers the INIT-state behaviour, but a misconfigured deployment that calls `process_frame` once and then crashes would leave C5 with no `VioOutput`; the AC-5.2 fallback is the right path but the diagnostic is harder. +- *Mitigation*: Per-frame DEBUG log (when enabled) records the INIT-state transition; the FDR `vio.health` record at INIT → TRACKING is emitted (AC-10) regardless of DEBUG state, so post-flight inspection always shows the warm-up. + +**Risk 3: `cv2.findEssentialMat` is sensitive to RANSAC inlier-threshold tuning** +- *Risk*: The default OpenCV RANSAC threshold is in pixel units of normalised image coordinates; a misconfiguration makes the strategy either reject every correspondence or accept every outlier. +- *Mitigation*: `config.vio.klt_ransac.essential_matrix_ransac_threshold` is documented + tested with a sensitivity sweep in the deferred Step 9 test. AZ-282 (`RansacFilter`) provides a generic RANSAC entry point that this strategy uses for the AZ-282-managed correspondence-rejection step (a separate stage from `cv2.findEssentialMat`'s internal RANSAC). + +**Risk 4: Sharp turns < 5 % frame overlap (RESTRICT-UAV-3) cause feature-track loss** +- *Risk*: The architecture's RESTRICT-UAV-3 calls out this constraint; KLT/RANSAC will lose tracks faster than OKVIS2 in this regime. +- *Mitigation*: The strategy reports `DEGRADED` (AC-6) immediately when inlier count drops; F6 satellite re-localisation (E-C2 / E-C3 / E-C4 path) is the recovery; no work for this strategy beyond honest reporting. + +## Runtime Completeness + +- **Named capability**: KLT pyramidal optical-flow + RANSAC essential-matrix simple-baseline VIO; the ADR-002 engine-rule mandatory baseline (architecture / E-C1 / `solution.md` "KltRansac mandatory simple-baseline"). +- **Production code that must exist**: real `KltRansacStrategy` class implementing the AZ-331 Protocol; real OpenCV ≥ 4.12.0 calls (`cv2.calcOpticalFlowPyrLK`, `cv2.goodFeaturesToTrack`, `cv2.findEssentialMat`, `cv2.recoverPose`); real residual-scatter covariance from the inlier residuals; real bias propagation through AZ-276; real RANSAC inlier rejection through AZ-282. +- **Allowed external stubs**: tests MAY mock `cv2.error` raises at specific call points (AC-4); production wiring uses real OpenCV. +- **Unacceptable substitutes**: a deterministic-fallback `VioOutput` that bypasses OpenCV (would defeat AC-2.1a's tracked-frame ratio); a covariance floor (would break AC-9); skipping the AZ-276 `ImuPreintegrator` (would break the `VioOutput.imu_bias` consistency invariant across strategies); duplicating RANSAC logic instead of reusing AZ-282 (cross-cutting violation per `coderule.mdc` and decompose Step 2 § 9). diff --git a/_docs/02_tasks/todo/AZ-335_c1_warm_start_recovery.md b/_docs/02_tasks/todo/AZ-335_c1_warm_start_recovery.md new file mode 100644 index 0000000..51f75b2 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-335_c1_warm_start_recovery.md @@ -0,0 +1,191 @@ +# C1 Warm-Start Hint Persistence + F8 Reboot Recovery Wiring + +**Task**: AZ-335_c1_warm_start_recovery +**Name**: C1 Warm-Start + F8 Reboot Recovery +**Description**: Implement the cross-cutting wiring that lets every `VioStrategy` recover from F8 (companion reboot) without fake confidence and lets F2 (takeoff load) seed the strategy with the FC EKF's last valid GPS + IMU-extrapolated pose. Adds a small `WarmStartHintStore` (atomic JSON sidecar persistence, written after every successful `VioOutput`, read once at process startup before the first `process_frame`), plus the runtime composition glue that captures the hint flow at the appropriate flight-state boundaries. The strategy implementations (AZ-332 / AZ-333 / AZ-334) already implement `reset_to_warm_start`; this task delivers the orchestration around them — what lives where on disk between flights, when `reset_to_warm_start` is invoked, and how AC-5.1 (converge within 5 frames) and AC-5.3 (no fake confidence after reboot) are satisfied at the wiring layer rather than per-strategy. +**Complexity**: 3 points +**Dependencies**: AZ-331_c1_vio_strategy_protocol, AZ-332_c1_okvis2_strategy, AZ-333_c1_vins_mono_strategy, AZ-334_c1_klt_ransac_strategy, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-270_compose_root, AZ-280_sha256_sidecar, AZ-272_fdr_record_schema +**Component**: c1_vio (epic AZ-254 / E-C1) +**Tracker**: AZ-335 +**Epic**: AZ-254 (E-C1) + +### Document Dependencies + +- `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md` — `WarmStartPose` DTO + `reset_to_warm_start` Protocol method (AZ-331). +- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — atomic write + sidecar pattern (AZ-280). +- `_docs/02_document/contracts/shared_helpers/se3_utils.md` — SE(3) ↔ JSON-serialisable form (AZ-277; relied on indirectly via `WarmStartPose` field types). +- `_docs/02_document/components/01_c1_vio/description.md` — § 1 mentions F2 takeoff load and F8 reboot recovery; § 5 notes strategy lives for the duration of a flight; reset on `reset_to_warm_start` for F8 reboot. +- `_docs/02_document/components/01_c1_vio/tests.md` — C1-IT-05 (warm-start convergence within 5 frames; AC-5.1) and C1-IT-06 (F8 reboot recovery; AC-5.3) bind this task's behaviour at the test layer (deferred to Step 9 / E-BBT). + +## Problem + +Without this wiring: + +- AC-5.1 (initialisation from FC EKF's last valid GPS + IMU-extrapolated) has no producer at the runtime layer; each strategy's `reset_to_warm_start` is a stub that no one calls. +- AC-5.3 (on F8 reboot, re-init from FC IMU-extrapolated pose without fake confidence) collapses; the companion process restart path lands in a cold-start that takes minutes to converge — outside the AC-NEW-1 30 s budget. +- The composition root would have to grow per-strategy F2/F8 logic, violating the "interface-first composition root" principle (ADR-009): cross-strategy concerns belong in shared wiring, not duplicated across strategy modules. +- The hint file format and on-disk location would drift across operator deployments (every site picking its own path) — making post-flight forensics and operator playbooks inconsistent. +- "No fake confidence" (AC-5.3) — the requirement that post-reboot covariance must NOT be smaller than pre-reboot — has no enforcement point; each strategy implementation could silently emit a tighter post-reboot covariance because it "knows" the hint is good, defeating the safety invariant. + +This task delivers the cross-strategy orchestration: a small persistence helper, the F2 / F8 hooks in the runtime composition root, and the AC-5.3 enforcement that the post-reboot strategy emits inflated covariance until it has independently re-converged. + +## Outcome + +- A `WarmStartHintStore` interface + default implementation at `src/gps_denied_onboard/components/c1_vio/warm_start_store.py`: + - `WarmStartHintStore` Protocol (PEP 544): `save(hint: WarmStartPose) -> None` + `load() -> WarmStartPose | None` + `clear() -> None`. + - `JsonSidecarWarmStartHintStore`: writes a JSON file `{store_dir}/c1_warm_start.json` via the AZ-280 `Sha256Sidecar` atomic-write+sidecar pattern (file + `.sha256`); `load()` verifies the sidecar before returning the hint (corrupted file → `load()` returns `None` and emits a WARN log; the wiring path treats this as "no hint" and falls through to cold-start). + - The store is constructor-injected into the strategy through the composition root; the strategy itself does NOT touch the filesystem. +- Runtime composition root extension at `src/gps_denied_onboard/runtime_root/vio_factory.py` (already extended by AZ-331; this task adds two hooks): + 1. **F2 takeoff hook** (`prime_warm_start_from_fc(strategy, fc_adapter, store)`): reads the FC EKF's last valid GPS + IMU-extrapolated pose via the C8 `FcAdapter` interface (consumed via the constructor-injected interface, NOT a direct C8 module import — Layer 3→4 ban respected via the interface-at-producer pattern), constructs a `WarmStartPose`, calls `strategy.reset_to_warm_start(hint)`, and saves the hint to the store. This hook is invoked once at takeoff (operator-side or auto-detected via FC's `flight_state` transition to `IN_AIR`). + 2. **F8 reboot hook** (`prime_warm_start_from_disk(strategy, store)`): at every process startup before the first `process_frame`, calls `store.load()`; if the result is non-None, calls `strategy.reset_to_warm_start(hint)`. If `load()` returns None (cold start; no prior hint or corrupted), no `reset_to_warm_start` is invoked; the strategy emits its INIT-state behaviour for the first `warm_start_max_frames` (AC-5.1 per AZ-331's contract). +- Per-frame save hook (cross-cutting): every emitted `VioOutput` from `process_frame` is converted into a `WarmStartPose` (relative-pose chained against the prior baseline by the runtime root, plus the latest `imu_bias` from the same `VioOutput`) and saved via `store.save(hint)`. Save throughput is bounded — `config.vio.warm_start_save_period_frames` (default 5) limits how often the disk write is incurred (every Nth frame). +- AC-5.3 enforcement at the wiring layer: after `prime_warm_start_from_disk` injects a hint, the runtime root sets a small `consecutive_post_reset_frames` counter on the strategy facade (NOT mutating the strategy itself; the counter lives in the wiring); for the first `config.vio.warm_start_max_frames` (default 5) frames after a `reset_to_warm_start`, the runtime root post-processes the emitted `VioOutput` to inflate `pose_covariance_6x6` by a configurable factor (default 2× steady-state) — this guarantees no post-reboot strategy emits a covariance smaller than pre-reboot, regardless of what the strategy itself thinks. The inflation is removed once the counter elapses. +- Config schema extension to AZ-269: `config.vio.warm_start_store_dir` (default `/var/lib/gps_denied_onboard/warm_start/`), `config.vio.warm_start_save_period_frames` (default 5), `config.vio.post_reset_covariance_inflation_factor` (default 2.0). +- INFO log on every successful `prime_warm_start_*` invocation (with the source: `f2_takeoff_fc` / `f8_reboot_disk` / `cold_start_no_hint`); WARN log on hint file corruption; ERROR log on any strategy `reset_to_warm_start` failure. +- FDR record `kind="vio.warm_start"` emitted on every prime invocation, with the source label and the `bias_norm` of the loaded hint (lets post-flight forensics see whether the hint was used and how stale it was). + +## Scope + +### Included + +- `WarmStartHintStore` Protocol + `JsonSidecarWarmStartHintStore` default implementation. +- `prime_warm_start_from_fc(strategy, fc_adapter, store)` runtime composition function. +- `prime_warm_start_from_disk(strategy, store)` runtime composition function. +- Per-frame save hook integration (called by the runtime root after every successful `process_frame` emission). +- Post-reset covariance inflation wrapper at the wiring layer (NOT inside any strategy). +- Config schema extension to AZ-269 for the three new fields. +- INFO / WARN / ERROR logging per description.md § 9. +- FDR `kind="vio.warm_start"` record emission via the injected `FdrClient`. +- Atomic write + sidecar verification via AZ-280 (no naked `Path.write_bytes` / `open().write` in this task). +- Unit tests covering hint round-trip, corruption handling, post-reset inflation, F8 cold-start fall-through. + +### Excluded + +- The `VioStrategy` Protocol itself — owned by AZ-331. +- The three strategy implementations of `reset_to_warm_start` — owned by AZ-332 / AZ-333 / AZ-334. +- C8 `FcAdapter` interface — owned by E-C8 (AZ-261); this task consumes the interface, does NOT define it. +- AC-5.1 / AC-5.3 / C1-IT-05 / C1-IT-06 component-internal tests themselves — deferred to Step 9 / E-BBT per greenfield flow Step 6 rule. +- Multi-flight hint history (only the latest hint is persisted; older hints are overwritten by atomic write). +- Operator UI for inspecting hint freshness — out of scope; operator reads the FDR record. +- Hint encryption — the warm-start hint contains pose + bias, not credentials; on-disk encryption is outside the threat model this cycle. + +## Acceptance Criteria + +**AC-1: `WarmStartHintStore` round-trip** +Given an empty store directory and a constructed `WarmStartPose` instance +When `store.save(hint)` is called and then `store.load()` is called +Then `load()` returns a `WarmStartPose` deep-equal to the original hint; the on-disk file at `{store_dir}/c1_warm_start.json` exists; the sidecar at `{store_dir}/c1_warm_start.json.sha256` exists and verifies + +**AC-2: Corrupted hint file → `load()` returns `None` + WARN log** +Given a `c1_warm_start.json` whose actual sha256 does not match the sidecar +When `store.load()` is called +Then `None` is returned; ONE WARN log `kind="c1.warm_start.corrupted"` with the offending path is emitted; the file is NOT silently deleted (operator may want to forensically inspect) + +**AC-3: Cold-start path — no hint, no reset** +Given an empty store directory at process startup +When `prime_warm_start_from_disk(strategy, store)` is called +Then `store.load()` returns `None`; `strategy.reset_to_warm_start` is NOT invoked (verifiable via spy); ONE INFO log `kind="c1.warm_start.cold_start"` is emitted; the strategy proceeds with its own INIT-state behaviour + +**AC-4: F8 reboot path — hint loaded, `reset_to_warm_start` invoked** +Given a populated store directory with a known hint +When `prime_warm_start_from_disk(strategy, store)` is called +Then `store.load()` returns the hint; `strategy.reset_to_warm_start(hint)` is invoked exactly once with the loaded hint (verifiable via spy); ONE INFO log `kind="c1.warm_start.f8_reboot_disk"` and ONE FDR record `kind="vio.warm_start"` are emitted + +**AC-5: F2 takeoff path — FC adapter queried, hint persisted** +Given a constructed `FcAdapter` (test double) returning a known last-valid-GPS + IMU-extrapolated pose +When `prime_warm_start_from_fc(strategy, fc_adapter, store)` is called +Then a `WarmStartPose` constructed from the FC data is passed to `strategy.reset_to_warm_start`; the same hint is then saved via `store.save`; ONE INFO log `kind="c1.warm_start.f2_takeoff_fc"` and ONE FDR record `kind="vio.warm_start"` are emitted + +**AC-6: Per-frame save respects period** +Given `config.vio.warm_start_save_period_frames = 5` and a strategy emitting 12 successful `VioOutput`s +When the per-frame save hook is invoked once per emission +Then `store.save` is called exactly 2 times (after frames 5 and 10; frame 12 is mid-period); the on-disk hint reflects frame 10's `VioOutput`; the next save will occur after frame 15 + +**AC-7: Post-reset covariance inflation — first N frames inflated** +Given `config.vio.warm_start_max_frames = 5` and `config.vio.post_reset_covariance_inflation_factor = 2.0`, after `prime_warm_start_from_disk` invokes `reset_to_warm_start` +When the next 5 `VioOutput`s flow through the runtime root +Then each output's `pose_covariance_6x6` Frobenius norm is exactly 2.0× the strategy's emitted norm; the 6th frame's covariance is the strategy's unmodified emitted norm; the inflation is reflected in the consumer's view (C5 fusion sees the inflated covariance) + +**AC-8: AC-5.3 — post-reboot covariance never below pre-reboot** +Given a saved hint with `||pose_covariance_6x6||_F = X` (the last pre-reboot value, captured by the wiring at save time as a "baseline" alongside the hint) +When `prime_warm_start_from_disk` runs and the strategy emits 5 post-reset frames +Then every post-reset `VioOutput.pose_covariance_6x6` Frobenius norm is ≥ X (after the 2.0× inflation in AC-7); the AC-5.3 "no fake confidence" invariant is enforced at the wiring layer regardless of strategy behaviour + +**AC-9: `store.clear()` removes file + sidecar** +Given a populated store directory +When `store.clear()` is called +Then both `c1_warm_start.json` and `c1_warm_start.json.sha256` are removed; subsequent `store.load()` returns `None`; ONE INFO log `kind="c1.warm_start.cleared"` is emitted + +**AC-10: Atomic write — process kill mid-save leaves no half-written file** +Given a save in progress (mid-write) +When the process is killed +Then on next startup `store.load()` either returns the prior valid hint (the temp-file rename was not yet committed) or `None` if no prior hint existed; there is NO scenario where a half-written file is loaded as a "valid" hint (AZ-280 `Sha256Sidecar` atomic write + sidecar verify guarantee this) + +## Non-Functional Requirements + +**Performance** +- `store.save(hint)` p99 ≤ 50 ms on Tier-2 NVMe (a single atomic JSON write of ~1 KB + 64-byte sidecar). On a 3 Hz frame rate with `warm_start_save_period_frames = 5`, the amortised cost is < 50 ms / (5 / 3 Hz) ≈ 3 ms per frame. +- `store.load()` p99 ≤ 20 ms on Tier-2 NVMe (one read + one sha256 verify of ~1 KB). +- Post-reset covariance inflation is a single matrix scalar multiplication per `VioOutput` — sub-microsecond cost; no measurable latency impact on the C1-PT-01 budget. + +**Compatibility** +- JSON schema for the hint file is fixed at v1; future schema changes require a `version` field and the AZ-280 sidecar pattern continues to handle bit-rot detection. +- The store directory MUST be on a writable mount with sufficient space (a few KB suffices); the operator deployment ensures this via the systemd unit. + +**Reliability** +- Atomic write + sidecar verify defends against process kill mid-save and against bit-flip. +- The post-reset covariance inflation is the only safety invariant enforced at the wiring layer; per-strategy honest-covariance behaviour during steady-state is enforced by the strategies themselves (AZ-332 / AZ-333 / AZ-334 each have an AC-9 honest-covariance contract). +- Failure of `prime_warm_start_*` MUST NOT crash the process — a malformed hint or a missing FC adapter response degrades to cold-start with a WARN log; the process continues. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `save` then `load` round-trip | Loaded hint deep-equal to original; file + sidecar exist | +| AC-2 | Corrupt hint file (flip 1 byte) | `load()` returns `None`; WARN log emitted; file NOT deleted | +| AC-3 | `prime_warm_start_from_disk` with empty store | `reset_to_warm_start` NOT called (spy); INFO log `cold_start` | +| AC-4 | `prime_warm_start_from_disk` with valid hint | `reset_to_warm_start(hint)` called once; INFO log `f8_reboot_disk`; FDR record emitted | +| AC-5 | `prime_warm_start_from_fc` with fake FC adapter | Hint constructed from FC data; `reset_to_warm_start` called; `store.save` called; INFO log `f2_takeoff_fc`; FDR record emitted | +| AC-6 | Per-frame save with period=5, 12 frames | `store.save` called exactly twice (after frames 5 and 10) | +| AC-7 | Post-reset inflation × 5 frames | Each output's covariance Frobenius norm = 2.0× strategy's emitted norm; 6th frame is unmodified | +| AC-8 | Pre-reboot baseline X; post-reboot 5 frames | Every post-reset covariance ≥ X (after inflation) | +| AC-9 | `store.clear()` then `load()` | Both files removed; `load()` returns `None`; INFO log emitted | +| AC-10 | Mock process-kill mid-save | On restart, `load()` returns prior valid hint OR `None`; no half-written file ever loaded | +| NFR-perf-save | Microbench `store.save` × 1000 | p99 ≤ 50 ms on Tier-2 NVMe | +| NFR-perf-load | Microbench `store.load` × 1000 | p99 ≤ 20 ms on Tier-2 NVMe | +| NFR-reliability-no-crash | Inject malformed FC adapter response | `prime_warm_start_from_fc` logs WARN and returns; process does NOT crash | + +## Constraints + +- The persistence path uses AZ-280's `Sha256Sidecar` for atomic write + verify — no naked `Path.write_bytes` / `open().write` (per `coderule.mdc` "follow established project patterns"). +- The store interface is a Protocol; the JSON-sidecar implementation is the default but a future operator-managed store (e.g., Redis-backed) could plug in via the same interface. +- The post-reset covariance inflation lives at the wiring layer — NOT inside any strategy. Adding inflation to a strategy is forbidden (would double-inflate when the wiring also inflates). +- The runtime root reads the FC adapter via the constructor-injected `FcAdapter` interface (Layer 3 → Layer 4 interface-at-producer pattern; documented in `module-layout.md` Layering notes); direct import of any C8 concrete adapter is forbidden in this task's source. +- The hint file's JSON schema is owned by this task; its `version` field is `1` and any future change requires a major bump per the standard versioning rule. +- Per-frame save throttling defaults to every 5 frames (0.6 Hz at 3 Hz frame rate); the value is config-driven. +- The post-reset baseline (the pre-reboot Frobenius norm used as the AC-8 floor) is persisted alongside the hint in the JSON file under a `pre_reboot_covariance_norm` field; AC-8's enforcement reads it back at load time. + +## Risks & Mitigation + +**Risk 1: The store directory is not writable in the airborne deployment** +- *Risk*: Read-only root filesystem (a hardening choice some operators make) defeats `store.save`; every flight reverts to cold-start, blowing the AC-NEW-1 budget. +- *Mitigation*: `store.save` failures emit ERROR logs but do NOT crash the process; the operator's deployment playbook (out of scope here) ensures `config.vio.warm_start_store_dir` points at a writable mount. AC-NFR-reliability-no-crash covers the no-crash case. + +**Risk 2: Hint goes stale when the operator changes camera calibration between flights** +- *Risk*: Saved hint is for `adti26` calibration; operator swaps to a new camera; the hint's pose / bias are no longer applicable. +- *Mitigation*: The JSON schema includes a `calibration_id` field (the calibration's content hash); `load()` returns `None` if the current `CameraCalibration.id` does not match the saved hint's `calibration_id`; ONE WARN log `kind="c1.warm_start.calibration_mismatch"` is emitted. This forces a clean cold-start when calibration changes — correct behaviour. + +**Risk 3: Per-frame save throughput pressure on slow disks** +- *Risk*: A slow operator-provided storage device makes `store.save` exceed the 50 ms budget at default period; per-frame DEBUG log records the slowness but the sustained pressure could starve other I/O. +- *Mitigation*: The throttle period (`warm_start_save_period_frames`) is config-driven; an operator with slow storage can raise it to 30 (one save per 10 s at 3 Hz). The save itself is sync — no async queue this cycle. + +**Risk 4: Post-reset covariance inflation is too aggressive (or not aggressive enough)** +- *Risk*: 2.0× factor is a heuristic; if a strategy's natural post-reset behaviour is already inflated 3×, the wiring inflates further to 6× — over-cautious. If it's 1.1×, the wiring is barely-honest at 2.2×. +- *Mitigation*: The factor is config-driven (`post_reset_covariance_inflation_factor`); a future cycle's calibration test (Step 9 / E-BBT) will tune it per strategy. The 2.0× default is a safety conservative baseline; the AC-8 floor (post-reset ≥ pre-reboot) is the hard invariant. + +## Runtime Completeness + +- **Named capability**: cross-strategy warm-start hint persistence + F2 takeoff + F8 reboot recovery wiring + AC-5.3 honest-covariance enforcement at the wiring layer (architecture / E-C1 / `solution.md` "F2 takeoff load" + "F8 Companion-reboot recovery" / AC-5.1 + AC-5.3). +- **Production code that must exist**: real `JsonSidecarWarmStartHintStore` using AZ-280's atomic write + verify; real `prime_warm_start_from_fc` consuming a real `FcAdapter` interface; real `prime_warm_start_from_disk` invoked at process startup; real per-frame save hook in the runtime composition root; real post-reset covariance inflation wrapper. +- **Allowed external stubs**: tests MAY use a fake `FcAdapter` returning scripted FC data (AC-5); a fake `WarmStartHintStore` for testing the runtime hooks in isolation (AC-3 / AC-4 / AC-7 / AC-8); production wiring uses the real AZ-280 store + the real C8 FC adapter selected at composition root. +- **Unacceptable substitutes**: in-memory store that loses state across process restart (would defeat AC-4 / AC-5.3 entirely); naked `open().write` in place of AZ-280's atomic-write pattern (would lose AC-10 atomicity); per-strategy warm-start logic that bypasses the runtime root (would force every new strategy to reinvent the wiring); a 1.0× inflation factor (would defeat AC-8); reading the FC adapter via direct C8 module import (would violate Layer 3 → Layer 4 ban). diff --git a/_docs/02_tasks/todo/AZ-336_c2_vpr_strategy_protocol.md b/_docs/02_tasks/todo/AZ-336_c2_vpr_strategy_protocol.md new file mode 100644 index 0000000..1d3bade --- /dev/null +++ b/_docs/02_tasks/todo/AZ-336_c2_vpr_strategy_protocol.md @@ -0,0 +1,183 @@ +# C2 VPR Strategy Protocol + Factory + Composition + +**Task**: AZ-336_c2_vpr_strategy_protocol +**Name**: C2 `VprStrategy` Protocol + Factory + Composition +**Description**: Define the public `VprStrategy` Protocol (PEP 544 structural interface), the `BackbonePreprocessor` C2-internal helper Protocol, the C2 DTOs (`VprQuery`, `VprCandidate`, `VprResult`), the error hierarchy (`VprError` family with `VprBackboneError`, `VprPreprocessError`, `IndexUnavailableError`), and the composition-root factory `build_vpr_strategy(config, tile_store, inference_runtime) -> VprStrategy` that selects the concrete backbone at startup based on `config.vpr.strategy` with lazy import + `BUILD_VPR_` flag gating per ADR-002. Includes a pre-flight `descriptor_dim()` ↔ C6 corpus sidecar `descriptor_dim` validation that fires at startup (NOT at first frame). This task delivers the foundational scaffolding every other C2 task depends on; no concrete backbone is implemented here. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-270_compose_root, AZ-303_c6_storage_interfaces (for `TileStore` + descriptor_dim sidecar), AZ-297_c7_runtime_protocol (for `InferenceRuntime` interface), AZ-266_log_module +**Component**: c2_vpr (epic AZ-255 / E-C2) +**Tracker**: AZ-336 +**Epic**: AZ-255 (E-C2) + +### Document Dependencies + +- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md` — the public contract this task implements (Protocol surface + DTOs + error hierarchy + factory signature + invariants + test cases). +- `_docs/02_document/components/02_c2_vpr/description.md` — § 2 `VprStrategy` interface table + DTOs + § 5 error handling + § 6 helpers + § 9 logging. +- `_docs/02_document/module-layout.md` — § Per-Component Mapping `c2_vpr` (Public API + Internal + Owns + Imports from); § Build-Time Exclusion Map `BUILD_VPR_` row; § Layering — Layer 3. +- `_docs/02_document/architecture.md` — ADR-001 (Strategy + composition root), ADR-002 (build-time exclusion via CMake `BUILD_*` flags), ADR-009 (interface-first DI, composition root the only place that imports concrete strategies). +- `_docs/02_document/contracts/c6_tile_cache/descriptor_index.md` — descriptor index sidecar format (used by the pre-flight dim-match validation). +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — `InferenceRuntime` interface consumed by every concrete backbone (referenced in the factory signature; not implemented in this task). + +## Problem + +Without this task, every concrete backbone (AZ-337..AZ-340) and the FAISS wiring (AZ-341) and the downstream consumer C2.5 ReRanker (AZ-256) would each invent their own ad-hoc interface, breaking three architectural invariants: + +- **ADR-001 (Strategy)**: backbones must be swappable at composition time; without a shared Protocol, swapping requires rewriting every consumer. +- **ADR-002 (build-time exclusion)**: each backbone is gated by `BUILD_VPR_`; without the lazy-import factory, any single TRT engine compile failure cascades into a hard import error at runtime, defeating per-binary exclusion. +- **ADR-009 (interface-first DI)**: the composition root must be the single place that knows about concrete strategy classes; consumers (C2.5, runtime root) must hold typed references to the Protocol only. Without the Protocol, every consumer would import the concrete `UltraVprStrategy` directly. + +The `descriptor_dim` validation also matters: without it, a config that points at the UltraVPR strategy (D=512) but the corpus index built for NetVLAD (D=4096) silently produces garbage retrievals — the kind of mismatch that should crash at startup, NOT at the first frame after takeoff. + +## Outcome + +- `src/gps_denied_onboard/components/c2_vpr/interface.py` defining: + - `VprStrategy` Protocol with `embed_query`, `retrieve_topk`, `descriptor_dim` (PEP 544 structural with `@runtime_checkable`). + - `BackbonePreprocessor` Protocol with `preprocess`, `input_shape`. + - All seven invariants from the contract documented in the Protocol's docstring. +- `src/gps_denied_onboard/components/c2_vpr/__init__.py` re-exporting the Protocol + DTOs (Public API per module-layout `c2_vpr` mapping). +- `src/gps_denied_onboard/_types/vpr.py` defining the three frozen + slotted dataclasses: `VprQuery`, `VprCandidate`, `VprResult`. Added under shared `_types/` because `VprResult` is consumed cross-component (by C2.5 ReRanker). +- `src/gps_denied_onboard/components/c2_vpr/errors.py` defining `VprError`, `VprBackboneError`, `VprPreprocessError`, `IndexUnavailableError`. +- `src/gps_denied_onboard/runtime_root/vpr_factory.py` exporting `build_vpr_strategy(config, tile_store, inference_runtime) -> VprStrategy`. The function: + 1. Reads `config.vpr.strategy` (one of: `ultra_vpr`, `net_vlad`, `mega_loc`, `mix_vpr`, `sela_vpr`, `eigen_places`, `salad`). + 2. Lazy-imports the concrete module via `importlib.import_module(f"gps_denied_onboard.components.c2_vpr.{config.vpr.strategy}")`. ImportError → `ConfigurationError(f"BUILD_VPR_{strategy.upper()} is OFF for this binary; cannot select strategy={strategy}")`. + 3. Constructs the strategy via its module-level `create(config, tile_store, inference_runtime)` factory function (each concrete strategy module exports `create` as its public entry-point — keeps `__init__.py` re-exports minimal). + 4. Pre-flight validation: queries `strategy.descriptor_dim()`; queries `tile_store.descriptor_dim()` (a small read of the FAISS index sidecar); on mismatch raises `ConfigurationError(f"descriptor_dim mismatch: strategy={strategy_dim}, corpus={corpus_dim}")`. + 5. Returns the instance. +- Composition-root `compose_root` extension: invoke `build_vpr_strategy` and bind the result to the C2 ingest thread (single-thread invariant per INV-1). +- Config schema extension to AZ-269: `config.vpr.strategy` (enum), `config.vpr.backbone_weights_path` (path), `config.vpr.faiss_index_path` (path); validated at config load. +- INFO log on every successful `build_vpr_strategy`: `kind="c2.vpr.strategy_loaded"` with strategy name + `descriptor_dim` value. ERROR log on `ConfigurationError` (with the specific dim mismatch or missing flag). + +## Scope + +### Included + +- The two Protocols (`VprStrategy`, `BackbonePreprocessor`) + their docstrings encoding all seven invariants from the contract. +- The three DTOs in `_types/vpr.py`. +- The four-class error hierarchy in `c2_vpr/errors.py`. +- The composition-root factory `build_vpr_strategy` with lazy-import + ImportError → `ConfigurationError` mapping + pre-flight `descriptor_dim` validation. +- Config schema extension for `config.vpr.{strategy, backbone_weights_path, faiss_index_path}`. +- Strategy resolution table comment in `vpr_factory.py` matching the contract's table verbatim. +- Unit tests covering: Protocol conformance for a fake strategy, factory rejection on missing flag (lazy-import → ImportError → `ConfigurationError`), factory rejection on dim mismatch, pre-flight INFO log emission, DTO immutability + slot enforcement. +- INFO / ERROR log emission per description.md § 9. + +### Excluded + +- Any concrete backbone implementation — owned by AZ-337 (UltraVPR), AZ-338 (NetVLAD), AZ-339 (MegaLoc + MixVPR), AZ-340 (SelaVPR + EigenPlaces + SALAD). +- FAISS HNSW retrieve wiring — owned by AZ-341. +- The `DescriptorNormaliser` helper — already AZ-283 (E-CC-HELPERS). +- Component-internal tests beyond Protocol-conformance + factory-validation: C2-IT-01 / C2-IT-02 / C2-IT-03 / C2-IT-04 / C2-PT-01 / C2-ST-01 are deferred to Step 9 / E-BBT. +- The C7 `InferenceRuntime` interface itself — owned by AZ-297; this task consumes the interface in the factory signature, does NOT define it. +- The C6 `TileStore` interface itself — owned by AZ-303; this task consumes the interface (`tile_store.descriptor_dim()` for pre-flight match) and the `TileStore` Public API at `components/c6_tile_cache/__init__.py`. + +## Acceptance Criteria + +**AC-1: Protocol conformance — fake strategy passes `runtime_checkable`** +Given a `FakeVprStrategy` test double implementing `embed_query`, `retrieve_topk`, `descriptor_dim` +When `isinstance(fake, VprStrategy)` is evaluated +Then the result is `True`; the same evaluation against an object missing any one method returns `False` + +**AC-2: DTO immutability + slots** +Given a constructed `VprQuery`, `VprCandidate`, `VprResult` +When attempting to mutate any field via attribute assignment +Then `FrozenInstanceError` is raised; `__slots__` is non-empty (verified via `cls.__slots__`); the dataclasses use `frozen=True, slots=True` + +**AC-3: Factory rejects missing build flag — ImportError → ConfigurationError** +Given `config.vpr.strategy = "vins_mono"` (a non-existent C2 strategy that simulates a missing build flag) AND a `tile_store` test double AND a `inference_runtime` test double +When `build_vpr_strategy(config, tile_store, inference_runtime)` is called +Then `ConfigurationError` is raised with message containing `"BUILD_VPR_VINS_MONO is OFF"`; ONE ERROR log `kind="c2.vpr.build_flag_off"` is emitted + +**AC-4: Factory rejects descriptor_dim mismatch** +Given `config.vpr.strategy = "ultra_vpr"` (FakeUltraVpr returns `descriptor_dim() = 512`) AND `tile_store.descriptor_dim()` returns 4096 +When `build_vpr_strategy(...)` is called +Then `ConfigurationError` is raised with message containing `"descriptor_dim mismatch: strategy=512, corpus=4096"`; ONE ERROR log `kind="c2.vpr.dim_mismatch"` is emitted; the strategy is NOT bound to the runtime root + +**AC-5: Successful factory load emits INFO log** +Given `config.vpr.strategy = "ultra_vpr"` AND matching `descriptor_dim` AND a valid lazy-importable `ultra_vpr` test double module +When `build_vpr_strategy(...)` is called +Then a `VprStrategy` instance is returned; ONE INFO log `kind="c2.vpr.strategy_loaded"` is emitted with structured fields `{strategy: "ultra_vpr", descriptor_dim: 512}` + +**AC-6: Strategy resolution table — every entry resolves to its module path** +Given each of the seven valid `config.vpr.strategy` values +When `build_vpr_strategy` is called with each (assuming the module exists as a test double) +Then each call returns a `VprStrategy` instance; the resolved module path matches the contract's strategy resolution table verbatim (`gps_denied_onboard.components.c2_vpr.`) + +**AC-7: Error hierarchy — every concrete error is catchable as `VprError`** +Given test instances of `VprBackboneError`, `VprPreprocessError`, `IndexUnavailableError` +When caught by `except VprError` +Then all three are caught; `isinstance(err, VprError)` is `True` for each + +**AC-8: Public API surface — `__init__.py` re-exports** +Given `from gps_denied_onboard.components.c2_vpr import VprStrategy, VprQuery, VprResult` +When the import is evaluated +Then all three names resolve; `BackbonePreprocessor` is NOT in the Public API (C2-internal only — verified by `BackbonePreprocessor not in c2_vpr.__all__`) + +**AC-9: Strategy bound to single ingest thread by composition root** +Given a `compose_root(config)` invocation that wires C2 +When the resulting strategy is bound +Then the strategy is bound to exactly one ingest thread (verifiable via the runtime root's thread-binding registry); a second binding attempt to the same strategy raises `RuntimeError` + +## Non-Functional Requirements + +**Performance** +- `build_vpr_strategy` p99 ≤ 50 ms — the factory itself is a config read + lazy import + one `descriptor_dim()` call + one `tile_store.descriptor_dim()` call. Most of the construction cost lives inside the concrete strategy's `create(...)` function (TRT engine load — owned by AZ-337..AZ-340), NOT in this task. +- Pre-flight validation overhead is bounded by the C6 sidecar read: ≤ 5 ms at p99. + +**Compatibility** +- The `VprStrategy` Protocol is a major API surface; any change to method signatures is a breaking change requiring a coordinated update of every implementation (lockstep — see Versioning in the contract). +- DTO field additions follow the standard "frozen dataclass + new optional field with default" pattern. + +**Reliability** +- Lazy-import via `importlib.import_module` — a build-time-excluded backbone's import never executes (no native library load attempted, no CUDA initialisation, no TRT runtime instantiation). +- Pre-flight `descriptor_dim` validation catches the silent-garbage failure mode (config + corpus mismatch) at startup. +- Single-thread invariant enforced by composition root binding (AC-9); the strategy itself is not responsible for thread safety. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `runtime_checkable` Protocol conformance | Fake strategy passes; partial fake fails | +| AC-2 | DTO immutability + slots | `FrozenInstanceError` on mutation; `__slots__` non-empty | +| AC-3 | Factory + nonexistent backbone module | `ConfigurationError("BUILD_VPR_ is OFF")`; ERROR log emitted | +| AC-4 | Factory + dim mismatch | `ConfigurationError("descriptor_dim mismatch: strategy=X, corpus=Y")`; ERROR log emitted; no binding | +| AC-5 | Factory + valid load | Strategy instance returned; INFO log emitted with structured fields | +| AC-6 | Each of 7 strategy values | Each resolves to correct module path | +| AC-7 | Error catchability | All three concrete errors caught by `except VprError` | +| AC-8 | Public API re-exports | `VprStrategy`, `VprQuery`, `VprResult` resolve; `BackbonePreprocessor` not in Public API | +| AC-9 | Single-thread binding | First binding succeeds; second on same instance raises `RuntimeError` | +| NFR-perf-factory | Microbench `build_vpr_strategy` × 100 with mock concretes | p99 ≤ 50 ms | +| NFR-perf-validate | Microbench pre-flight `descriptor_dim` check × 100 | p99 ≤ 5 ms | + +## Constraints + +- **No business logic beyond Protocol + factory + DTOs + errors.** The factory's pre-flight `descriptor_dim` check is the ONLY runtime computation this task performs. +- **Lazy import is mandatory** — direct `from gps_denied_onboard.components.c2_vpr.ultra_vpr import UltraVprStrategy` in the factory is forbidden (would defeat ADR-002 build-time exclusion). +- **`@runtime_checkable` MUST be used** — INV-1 isolates the binding-side enforcement of single-thread invariant; runtime_checkable lets composition root assert via `isinstance` without forcing every consumer to import the Protocol. +- **DTOs MUST be `frozen=True, slots=True`** — immutability prevents accidental mutation across thread boundaries; slots reduces memory footprint at 3 Hz frame rate × N seconds. +- **Strategy modules export `create(config, tile_store, inference_runtime)` as their entry-point** — keeps the factory's lazy-import surface uniform; per-strategy constructors stay private. +- **`BackbonePreprocessor` is C2-internal** — must NOT be re-exported from `c2_vpr/__init__.py` (would violate description.md § 6 "C2-internal helper, NOT a shared helper"). +- **Config schema field `config.vpr.strategy` is an enum** validated at config load — typo'd values fail before the factory runs. + +## Risks & Mitigation + +**Risk 1: `runtime_checkable` Protocol checks have known performance cost** +- *Risk*: `isinstance(obj, RuntimeCheckableProtocol)` walks the method table; called per-frame at 3 Hz × 7 strategies it could add measurable overhead. +- *Mitigation*: `isinstance` is called ONCE at composition-root binding time (AC-9), NOT per-frame. The per-frame path uses the bound concrete reference. Test asserts the binding-time check is the only `isinstance` call site. + +**Risk 2: Lazy-import error message obscures the real failure mode** +- *Risk*: A native library (e.g., FAISS or TensorRT) failing to load triggers `ImportError` from the lazy import, which the factory currently maps to "BUILD flag OFF" — but the actual cause may be a missing `.so` or version mismatch. +- *Mitigation*: The factory catches `ImportError`, inspects `e.msg`; if the message contains "No module named" → "BUILD flag OFF" (the build-time-excluded case); otherwise re-raises the original ImportError preserving the native-library context. AC-3 covers the build-flag case; a separate test covers the native-library load case. + +**Risk 3: `descriptor_dim` mismatch is detected too late if the corpus sidecar is corrupted** +- *Risk*: A bit-flipped corpus sidecar reports the wrong `descriptor_dim`; the factory passes the validation but every retrieval returns garbage. +- *Mitigation*: The C6 sidecar uses the AZ-280 `Sha256Sidecar` pattern; corruption is detected at sidecar load time (C6's responsibility, NOT this task's). The factory's contract is "match the sidecar's declared dim"; if the sidecar itself is wrong, that's a C6 bug. + +**Risk 4: `compose_root` thread-binding registry is not yet implemented** +- *Risk*: AC-9 references a "thread-binding registry" that AZ-270 (`compose_root`) may not yet provide. +- *Mitigation*: This task's Public API is the factory; the runtime root is responsible for thread binding. If AZ-270 has not yet implemented the registry, this task delivers AC-1..AC-8 + a stub `bind_to_thread(strategy)` interface that AZ-270 fills in. AC-9 is gated on AZ-270's progress and may move to a follow-up task if the registry isn't ready. **Decision**: keep AC-9 in this task; if AZ-270 lacks the registry by implementation time, AZ-270 is the upstream blocker — escalate via the standard tracker dependency mechanism. + +## Runtime Completeness + +- **Named capability**: cross-strategy `VprStrategy` Protocol + composition-root factory + pre-flight `descriptor_dim` validation + ADR-002 build-time exclusion enforcement (architecture / E-C2 / `solution.md` "Strategy + multiple backbones" / ADR-001 + ADR-002 + ADR-009). +- **Production code that must exist**: real `VprStrategy` Protocol + real DTOs + real error hierarchy + real `build_vpr_strategy` factory with real lazy-import + real ImportError mapping + real `descriptor_dim` validation + real config schema extension. +- **Allowed external stubs**: tests MAY use `FakeVprStrategy`, `FakeBackbonePreprocessor`, `FakeTileStore` returning a fixed `descriptor_dim`, `FakeInferenceRuntime`. Production wiring uses real concrete strategies (selected from AZ-337..AZ-340 at composition time) + the real C6 `TileStore` + the real C7 `InferenceRuntime`. +- **Unacceptable substitutes**: direct `from gps_denied_onboard.components.c2_vpr.ultra_vpr import UltraVprStrategy` in the factory (would defeat ADR-002); a `Type[VprStrategy]` registry that pre-imports all 7 backbones (would defeat lazy-import); skipping the `descriptor_dim` pre-flight check (would let dim mismatch crash at first frame instead of startup); using `frozen=False` dataclasses (would let consumers mutate `VprResult` candidates list); making `BackbonePreprocessor` part of the Public API (would let other components import it, violating description.md § 6). diff --git a/_docs/02_tasks/todo/AZ-337_c2_ultra_vpr.md b/_docs/02_tasks/todo/AZ-337_c2_ultra_vpr.md new file mode 100644 index 0000000..f74e8b7 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-337_c2_ultra_vpr.md @@ -0,0 +1,237 @@ +# C2 UltraVPR Primary Backbone + +**Task**: AZ-337_c2_ultra_vpr +**Name**: C2 UltraVPR Primary Backbone (TRT) +**Description**: Implement `UltraVprStrategy`, the production-default `VprStrategy` (per ADR-001 default selection). UltraVPR is the Documentary Lead's PRIMARY backbone selected at config time and ON in airborne / research / replay-cli binaries (per ADR-002 build-time exclusion map). Wraps the upstream UltraVPR research code drop, exposes its forward pass via the C7 `InferenceRuntime` (TensorRT 10.3 primary, ONNX-Runtime fallback), and produces L2-normalised float16 embeddings (D=512 typical) for FAISS HNSW retrieval. Includes the concrete `UltraVprBackbonePreprocessor` (resize / centre-crop / mean-std normalise per UltraVPR's input contract). The strategy MUST satisfy the AC-2.1b recall@10 ≥ 0.95 floor on the Derkachi normal segment and the C2-PT-01 latency budget (`embed_query` p95 ≤ 60 ms). +**Complexity**: 5 points +**Dependencies**: AZ-336_c2_vpr_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-298_c7_tensorrt_runtime, AZ-303_c6_storage_interfaces, AZ-283_descriptor_normaliser, AZ-281_engine_filename_schema, AZ-321_c10_engine_compiler (engine compile path; UltraVPR engine is one of the engines C10 builds), AZ-266_log_module, AZ-272_fdr_record_schema +**Component**: c2_vpr (epic AZ-255 / E-C2) +**Tracker**: AZ-337 +**Epic**: AZ-255 (E-C2) + +### Document Dependencies + +- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md` — Protocol contract this task implements (every invariant MUST be satisfied). +- `_docs/02_document/components/02_c2_vpr/description.md` — § 1 PRIMARY backbone designation; § 2 interface; § 5 backbone weights ≤ 600 MB GPU; § 7 GPU stream race notes; § 9 logging. +- `_docs/02_document/module-layout.md` — `c2_vpr` Per-Component Mapping (`ultra_vpr.py` Internal); `BUILD_VPR_ULTRA_VPR` row in build-time exclusion map (ON for airborne/research/replay-cli, OFF for operator-tooling). +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — `InferenceRuntime` interface (engine load, forward pass, output extraction). +- `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md` — L2 normalisation contract (UltraVPR raw embeddings are NOT L2-normalised; this task MUST normalise). +- `_docs/02_document/contracts/shared_helpers/engine_filename_schema.md` — TensorRT engine filename → metadata extraction. +- `_docs/02_document/components/02_c2_vpr/tests.md` — C2-IT-01 (recall@10 ≥ 0.95 on Derkachi); C2-IT-02 (`VprResult` invariants); C2-PT-01 (`embed_query` p95 ≤ 60 ms; combined ≤ 65 ms; ≤ 600 MB GPU; ≤ 200 MB sys mem). + +## Problem + +UltraVPR is the production-default backbone (description.md § 1 "Documentary Lead PRIMARY backbone"). Without this task: + +- The composition root has no concrete strategy to wire when `config.vpr.strategy = "ultra_vpr"` (the default value); the airborne binary cannot start. +- AC-2.1b (recall@10 ≥ 0.95 on Derkachi) — the highest-priority C2 acceptance criterion — has no producer; the suite-level FT-P-19 satellite re-loc test cannot pass. +- AC-4.1 latency budget for VPR is allocated against UltraVPR specifically (60 ms `embed_query`); without the TRT-backed implementation, the budget is unconsumable and the E2E latency target (400 ms p95) cannot be validated. +- The 600 MB GPU memory ceiling for backbone weights is enforced at the implementation layer; without it, no operator can validate the airborne deployment fits the Tier-1 Jetson Orin's GPU memory budget. +- UltraVPR has a non-trivial input preprocessing contract (specific resize target, centre-crop, ImageNet mean/std normalisation, FP16 cast); without `UltraVprBackbonePreprocessor`, every consumer would re-derive the contract → silent recall regression. + +## Outcome + +- `src/gps_denied_onboard/components/c2_vpr/ultra_vpr.py` defining: + - `UltraVprStrategy` class implementing the `VprStrategy` Protocol (AZ-336). + - Constructor signature: `__init__(self, runtime: InferenceRuntime, tile_store: TileStore, weights_path: Path, preprocessor: UltraVprBackbonePreprocessor, normaliser: DescriptorNormaliser, fdr_client: FdrClient)`. + - `embed_query(frame, calibration)`: + 1. `tensor = self._preprocessor.preprocess(frame, calibration)` (returns FP16 NCHW (1, 3, H, W)). + 2. `raw = self._runtime.forward(self._engine_id, {"input": tensor})["embedding"]` (returns FP16 (1, 512)). + 3. `embedding = self._normaliser.l2_normalise(raw[0])` (returns FP16 (512,) with `||embedding||_2 == 1.0 ± 1e-3`). + 4. Return `VprQuery(frame_id, embedding, produced_at=monotonic_ns())`. + 5. Catch RuntimeError / CudaError → wrap in `VprBackboneError`; emit ERROR log + FDR record `kind="vpr.backbone_error"`. + - `retrieve_topk(query, k)`: + 1. `distances, tile_ids = self._tile_store.faiss_topk(query.embedding, k)` (delegates to C6 TileStore Public API). + 2. Build `[VprCandidate(tile_id, distance, descriptor_dim=512) for ...]`. + 3. Return `VprResult(query.frame_id, candidates, retrieved_at=monotonic_ns(), backbone_label="ultra_vpr")`. + 4. On `IndexUnavailableError` (raised by C6 TileStore on stale handle), re-raise unchanged. + - `descriptor_dim() -> int`: returns 512 (the UltraVPR research code drop's published embedding dim; the value is asserted at engine-load time against the engine's output tensor shape; mismatch → `RuntimeError` at startup). + - Module-level `create(config, tile_store, inference_runtime) -> VprStrategy`: + 1. Resolve `weights_path = config.vpr.backbone_weights_path` (a TensorRT engine file produced by C10's engine compiler — AZ-321 — with the AZ-281 self-describing filename schema). + 2. Construct `UltraVprBackbonePreprocessor(input_shape=(384, 384), mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))` (parameters from the upstream UltraVPR config, hard-coded here per CONST.SRP — these are weights-coupled, not config-knobs). + 3. Construct `DescriptorNormaliser` (or fetch from helpers; AZ-283). + 4. Load engine via `inference_runtime.load_engine(weights_path)` — the engine ID is captured for later `forward` calls. + 5. Assert engine output shape == `(1, 512)` FP16; mismatch → `ConfigurationError`. + 6. Construct and return `UltraVprStrategy(...)`. +- `src/gps_denied_onboard/components/c2_vpr/_preprocessor_ultra_vpr.py` (or `_preprocessor.py` shared scaffolding + concrete `UltraVprBackbonePreprocessor`): + - Implements `BackbonePreprocessor` Protocol from AZ-336. + - `preprocess(frame, calibration)`: + 1. Decode `frame.image_bytes` to RGB uint8 ndarray (H_in, W_in, 3) via OpenCV / Pillow. + 2. Centre-crop to a square region of side `min(H_in, W_in)` using calibration's principal point if non-centre (otherwise geometric centre). Calibration is consumed here for principal-point alignment per the upstream UltraVPR contract; if calibration is absent, fall back to geometric centre with a WARN log. + 3. Resize to `(384, 384)` via OpenCV `INTER_AREA` for downscale, `INTER_CUBIC` for upscale. + 4. Normalise: `(pixel/255.0 - mean) / std` per channel; cast to FP16. + 5. Transpose HWC → CHW; add batch dim → NCHW. + 6. Return ndarray of shape `(1, 3, 384, 384)` dtype float16. + - `input_shape() -> tuple[int, ...]`: returns `(384, 384)`. + - On any preprocessing failure (corrupt image bytes, calibration mismatch), raise `VprPreprocessError` and emit ERROR log + FDR record `kind="vpr.preprocess_error"`. +- Composition-root wiring: `runtime_root.compose_root` includes a path that, when `config.vpr.strategy == "ultra_vpr"`, calls `UltraVprStrategy.create(config, tile_store, inference_runtime)` via the AZ-336 factory. +- Logging per description.md § 9: + - INFO `kind="c2.vpr.ready"` with `{strategy: "ultra_vpr", descriptor_dim: 512, corpus_size: }` after engine load. + - WARN `kind="c2.vpr.top1_distance_above_threshold"` if top-1 distance > `config.vpr.warn_top1_threshold` (default 0.30). + - ERROR `kind="c2.vpr.backbone_error"` and `kind="c2.vpr.preprocess_error"` per error path. + - DEBUG `kind="c2.vpr.frame_distances"` with top-K distances per frame (gated by config; off by default to avoid log volume at 3 Hz). +- FDR records emitted: `kind="vpr.embed_query"` (per frame, with frame_id + backbone_label + bbox of distances), `kind="vpr.backbone_error"` and `kind="vpr.preprocess_error"` (per error). + +## Scope + +### Included + +- `UltraVprStrategy` class implementing the `VprStrategy` Protocol exactly per the AZ-336 contract. +- `UltraVprBackbonePreprocessor` implementing `BackbonePreprocessor` Protocol with the upstream UltraVPR's published preprocessing parameters. +- Module-level `create(config, tile_store, inference_runtime)` factory entry-point. +- Engine-output-shape assertion at load time (`(1, 512)` FP16); mismatch → `ConfigurationError`. +- L2-normalisation of every embedding via the AZ-283 `DescriptorNormaliser` helper. +- Composition-root wiring path for `config.vpr.strategy == "ultra_vpr"`. +- Logging per description.md § 9 (INFO ready, WARN top-1-above-threshold, ERROR error paths, DEBUG per-frame distances). +- FDR record emission for embed-query and error paths. +- Unit tests covering all 7 invariants (INV-1..INV-7), the engine-output-shape assertion, the preprocessing contract, the L2-normalisation post-condition, the composition-root wiring path. +- `BUILD_VPR_ULTRA_VPR` CMake flag wiring (per ADR-002): the strategy module is excluded from the operator-tooling binary. + +### Excluded + +- The `VprStrategy` Protocol + `BackbonePreprocessor` Protocol + DTOs + errors + factory — owned by AZ-336. +- The `DescriptorNormaliser` helper — already AZ-283. +- The C7 `InferenceRuntime` (engine load + forward pass) — owned by AZ-298 (TensorRT runtime). +- The C6 `TileStore.faiss_topk` query — owned by AZ-303 / AZ-306; this task consumes the Public API. +- Engine compile (`.onnx` → `.trt`) — owned by AZ-321 (`c10_engine_compiler`); this task consumes the produced `.trt` engine via `config.vpr.backbone_weights_path`. +- Other backbones — AZ-338 (NetVLAD), AZ-339 (MegaLoc + MixVPR), AZ-340 (SelaVPR + EigenPlaces + SALAD). +- FAISS HNSW wiring at the strategy level — `retrieve_topk` delegates to `tile_store.faiss_topk`; the FAISS index lifecycle (mmap, sidecar verify, handle invalidation) is owned by AZ-341. +- Component-internal tests beyond Protocol + invariants + preprocessing-contract: C2-IT-01 (recall@10 acceptance test), C2-IT-03 (poisoned-tile), C2-IT-04 (scale-ratio), C2-PT-01 (latency NFR), C2-ST-01 (stale handle) are deferred to Step 9 / E-BBT. + +## Acceptance Criteria + +**AC-1: Protocol conformance** +Given a constructed `UltraVprStrategy` instance +When `isinstance(strategy, VprStrategy)` is evaluated +Then the result is `True`; the instance has `embed_query`, `retrieve_topk`, `descriptor_dim` + +**AC-2: `embed_query` produces L2-normalised FP16 (512,) embedding** +Given a valid `NavCameraFrame` and `CameraCalibration` +When `strategy.embed_query(frame, calibration)` is called +Then a `VprQuery` is returned with `embedding.shape == (512,)`, `embedding.dtype == np.float16`, `||embedding||_2 == 1.0 ± 1e-3` + +**AC-3: `embed_query` is deterministic (INV-2 + INV-6)** +Given the same frame + calibration +When `embed_query` is called 3 times +Then all three returns have bit-exact `embedding` arrays (ULP-tolerant for FP16); `frame_id` and `produced_at` differ across calls but `embedding` does not + +**AC-4: `retrieve_topk` returns exactly k candidates sorted ascending** +Given a corpus of 100 tiles loaded into C6 TileStore + a constructed `VprQuery` +When `strategy.retrieve_topk(query, k=10)` is called +Then `len(candidates) == 10`; `[c.descriptor_distance for c in candidates]` is non-strictly-ascending; `backbone_label == "ultra_vpr"`; `candidates[0].descriptor_dim == 512` + +**AC-5: `descriptor_dim()` is stable and returns 512** +Given a constructed `UltraVprStrategy` +When `descriptor_dim()` is called 100 times +Then every call returns `512` + +**AC-6: Engine output shape mismatch at load → `ConfigurationError`** +Given a TRT engine whose output tensor shape is `(1, 256)` (not 512) +When `UltraVprStrategy.create(config, tile_store, inference_runtime)` is called +Then `ConfigurationError` is raised with message containing `"engine output shape mismatch: expected (1, 512), got (1, 256)"`; the strategy is NOT instantiated + +**AC-7: `VprBackboneError` on forward-pass failure** +Given an `InferenceRuntime` test double that raises `RuntimeError` from `forward` +When `strategy.embed_query(frame, calibration)` is called +Then `VprBackboneError` is raised; ONE ERROR log `kind="c2.vpr.backbone_error"` is emitted; ONE FDR record `kind="vpr.backbone_error"` is emitted + +**AC-8: `VprPreprocessError` on corrupt image bytes** +Given a `NavCameraFrame` with malformed `image_bytes` (not decodable) +When `strategy.embed_query(frame, calibration)` is called +Then `VprPreprocessError` is raised; ONE ERROR log `kind="c2.vpr.preprocess_error"` is emitted; ONE FDR record `kind="vpr.preprocess_error"` is emitted + +**AC-9: Calibration absent → centre-crop falls back to geometric centre + WARN log** +Given a frame with `calibration = None` (or `calibration.principal_point` absent) +When `embed_query(frame, calibration)` is called +Then preprocessing succeeds with geometric-centre crop; ONE WARN log `kind="c2.vpr.calibration_missing"` is emitted; the embedding is L2-normalised (AC-2 still holds) + +**AC-10: `IndexUnavailableError` propagated unchanged from `retrieve_topk`** +Given a C6 `TileStore` test double that raises `IndexUnavailableError` from `faiss_topk` +When `strategy.retrieve_topk(query, k=10)` is called +Then `IndexUnavailableError` is raised unchanged (NOT wrapped); no candidates returned + +**AC-11: Composition-root wiring — `config.vpr.strategy = "ultra_vpr"`** +Given `config.vpr.strategy = "ultra_vpr"` AND a valid weights_path AND matching `descriptor_dim` in C6 sidecar +When `compose_root(config)` runs +Then a `UltraVprStrategy` instance is wired into the runtime root; the AZ-336 factory's pre-flight `descriptor_dim` validation passes; ONE INFO log `kind="c2.vpr.ready"` with `{strategy: "ultra_vpr", descriptor_dim: 512, corpus_size: }` is emitted + +**AC-12: WARN log on top-1 distance above threshold** +Given `config.vpr.warn_top1_threshold = 0.30` AND a `VprResult` whose top-1 `descriptor_distance = 0.42` +When `retrieve_topk` returns +Then ONE WARN log `kind="c2.vpr.top1_distance_above_threshold"` with structured field `{distance: 0.42, threshold: 0.30}` is emitted + +## Non-Functional Requirements + +**Performance** (deferred validation to C2-PT-01 / E-BBT; this task delivers the implementation): +- `embed_query` p95 ≤ 60 ms on Tier-1 Jetson Orin with TensorRT 10.3 FP16 — bounded by the TRT engine forward-pass time + preprocessing overhead. The preprocessing path itself MUST be ≤ 5 ms p95 (so the TRT call has ~55 ms budget). +- `retrieve_topk` p95 ≤ 2 ms — bounded by C6 FAISS HNSW; this task contributes only the Python wrapping overhead. +- GPU memory: ≤ 600 MB resident for backbone weights (FP16 engine ~ 100-150 MB; remainder is workspace). +- System memory: ≤ 200 MB for the mmap'd FAISS index handle (C6 owns this; this task consumes). + +**Compatibility** +- The TRT engine file format is owned by C10 / C7; this task consumes the produced `.trt` engine via `config.vpr.backbone_weights_path`. Engine version mismatches surface via the AZ-281 self-describing filename schema; the C7 `load_engine` enforces compatibility. +- The upstream UltraVPR research code drop is pinned per Plan-phase; weight-format changes between drops would require a new engine build (C10) and a re-run of C2-IT-01 to confirm recall@10 still passes. + +**Reliability** +- Strategy is single-threaded by contract (INV-1, AZ-336); composition root binds to one ingest thread. +- L2-normalisation is unconditional (INV-3); raw UltraVPR embeddings are not L2-normalised by the upstream forward pass. +- `VprBackboneError` does not crash the process; downstream C5 falls back to VIO-only with provenance `visual_propagated` (AC-1.4). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `isinstance(UltraVprStrategy(...), VprStrategy)` | `True` | +| AC-2 | `embed_query` output | shape (512,), dtype float16, L2-norm == 1.0 ± 1e-3 | +| AC-3 | `embed_query` × 3 same frame | bit-exact embeddings (ULP-tolerant FP16) | +| AC-4 | `retrieve_topk` against fixture corpus | `len == 10`, sorted ascending, `backbone_label == "ultra_vpr"`, `descriptor_dim == 512` | +| AC-5 | `descriptor_dim()` × 100 | always 512 | +| AC-6 | TRT engine with wrong output shape | `ConfigurationError` at create time | +| AC-7 | `InferenceRuntime.forward` raises | `VprBackboneError`; ERROR log + FDR record | +| AC-8 | malformed `image_bytes` | `VprPreprocessError`; ERROR log + FDR record | +| AC-9 | `calibration = None` | preprocessing succeeds with geometric centre; WARN log | +| AC-10 | `tile_store.faiss_topk` raises `IndexUnavailableError` | propagated unchanged | +| AC-11 | `compose_root(config="ultra_vpr")` | wired; INFO log with `{strategy, descriptor_dim, corpus_size}` | +| AC-12 | top-1 distance > threshold | WARN log emitted | +| Preprocess-shape | `preprocessor.preprocess(frame)` output | shape `(1, 3, 384, 384)`, dtype float16 | +| Preprocess-mean-std | preprocessing on a uniform-grey image | per-channel `(grey - mean) / std` matches expected to ULP | +| Preprocess-input-shape | `preprocessor.input_shape()` | returns `(384, 384)` | + +## Constraints + +- **The `BackbonePreprocessor` instance for UltraVPR lives next to the strategy, NOT in `helpers/`** — preprocessing parameters are weights-coupled (description.md § 6 "C2-internal helper, NOT a shared helper"). +- **Preprocessing parameters are hard-coded** — `(384, 384)` resize target, `(0.485, 0.456, 0.406)` ImageNet mean, `(0.229, 0.224, 0.225)` ImageNet std. These are weights-coupled per the upstream UltraVPR contract; making them config-knobs would let an operator silently break the AC-2.1b recall floor. +- **L2-normalisation is mandatory** even though some downstream code paths are robust to non-normalised embeddings — INV-3 from the contract is non-negotiable. +- **Engine load happens at `create` time, NOT at first frame** — the engine-output-shape assertion (AC-6) MUST fire at startup. +- **The strategy holds the engine ID returned by `inference_runtime.load_engine`, NOT the engine itself** — engine lifecycle is owned by C7. +- **Constructor injection only** — no `import gps_denied_onboard.config` inside the strategy module; config is consumed via the `create` factory. +- **No GPU operations outside `embed_query`** — `__init__` does the engine load (one-time cost), `embed_query` does the per-frame forward pass; nothing else touches the GPU stream. + +## Risks & Mitigation + +**Risk 1: UltraVPR upstream code drop ships an unsupported ONNX op** +- *Risk*: The TRT 10.3 ONNX importer doesn't support a custom op in UltraVPR's graph; engine compilation fails at C10 stage. +- *Mitigation*: Engine compile is C10's responsibility (AZ-321). This task consumes the produced engine and assumes it's loadable. If C10 cannot build the engine, the strategy cannot be wired — a hard upstream blocker that surfaces during AZ-321 implementation, NOT here. + +**Risk 2: FP16 precision insufficient for AC-2.1b recall@10 ≥ 0.95** +- *Risk*: FP16 quantisation degrades embedding fidelity below the recall floor on the Derkachi corpus. +- *Mitigation*: C2-IT-01 (deferred to Step 9) is the validation gate. If FP16 fails, the operator can fall back to FP32 by rebuilding the engine via C10 with `precision=fp32` — this is a config-time decision, NOT a code change in this task. The strategy treats FP16 vs FP32 as transparent (the engine output dtype is asserted at load time; embedding dtype follows the engine). + +**Risk 3: Centre-crop with calibration's principal point introduces non-determinism if calibration changes mid-flight** +- *Risk*: An operator hot-swaps calibration during flight; embeddings shift; recall drops silently. +- *Mitigation*: Calibration changes mid-flight are forbidden by the broader F1 / F2 / F3 lifecycle (calibration is loaded once per flight at takeoff). If a future cycle adds hot-swap support, a separate task adds calibration-versioning to embeddings. + +**Risk 4: Per-frame DEBUG log volume at 3 Hz × 10 distances = 30 entries/sec** +- *Risk*: Default-on DEBUG logging floods journald. +- *Mitigation*: DEBUG `kind="c2.vpr.frame_distances"` is gated by `config.vpr.debug_per_frame_distances` (default `false`); operators enable it only for forensic investigation of a specific flight. + +**Risk 5: WARN-threshold default (0.30) needs calibration** +- *Risk*: The 0.30 default threshold for top-1 distance WARN is a placeholder; production-tuned values come from FT-P-19 telemetry. +- *Mitigation*: `config.vpr.warn_top1_threshold` is config-driven (default 0.30); a follow-up cycle will tune from real flight FDR data. The default is a conservative starting point that surfaces obvious false-positives without flooding logs. + +## Runtime Completeness + +- **Named capability**: production-default `VprStrategy` for top-K retrieval against the C6 FAISS corpus (architecture / E-C2 / `solution.md` "UltraVPR primary backbone" / AC-2.1b + AC-4.1). +- **Production code that must exist**: real `UltraVprStrategy` calling real C7 `InferenceRuntime.forward` with a real TRT-compiled UltraVPR engine; real `UltraVprBackbonePreprocessor` performing real OpenCV resize + ImageNet normalisation + FP16 cast; real L2-normalisation via real `DescriptorNormaliser`; real composition-root wiring in `runtime_root.compose_root` for the `ultra_vpr` strategy choice. +- **Allowed external stubs**: tests MAY use `FakeInferenceRuntime` returning pre-computed embeddings (AC-2..AC-7), `FakeTileStore` (AC-4 / AC-10 / AC-11), `FakeFdrClient` (verifying FDR record emission), a synthetic frame fixture for preprocessing tests; production wiring uses the real C7 + C6 + UltraVPR engine. +- **Unacceptable substitutes**: a Python-only NumPy implementation of UltraVPR's forward pass (would not satisfy C2-PT-01 latency at 60 ms p95; would defeat the GPU-bound architectural choice); skipping L2-normalisation (would break INV-3 and downstream cosine-similarity assumptions); making preprocessing parameters config-knobs (would let operators silently break AC-2.1b); engine load at first frame instead of `create` time (would defer the engine-output-shape assertion past startup, defeating fail-fast); per-strategy thread safety (the contract is single-thread; adding locks would mask the composition-root binding bug if it ever broke); a "demo mode" that returns dummy embeddings to bypass the TRT engine. diff --git a/_docs/02_tasks/todo/AZ-338_c2_net_vlad.md b/_docs/02_tasks/todo/AZ-338_c2_net_vlad.md new file mode 100644 index 0000000..eb6a94c --- /dev/null +++ b/_docs/02_tasks/todo/AZ-338_c2_net_vlad.md @@ -0,0 +1,217 @@ +# C2 NetVLAD Mandatory Simple-Baseline + +**Task**: AZ-338_c2_net_vlad +**Name**: C2 NetVLAD Mandatory Simple-Baseline +**Description**: Implement `NetVladStrategy`, the C2 mandatory simple-baseline `VprStrategy` (engine rule: every component MUST ship a comparative baseline alongside its production-default; description.md § 1 designates NetVLAD as the C2 baseline). NetVLAD has a much higher embedding dim than UltraVPR (D=4096 with NetVLAD-VGG16 default; can be reduced to D=512 via PCA-whitening per the upstream NetVLAD code drop) and uses PyTorch FP16 (NOT TensorRT) per the simple-baseline policy: "the baseline runs on the simplest available runtime" so a TRT engine compile bug doesn't simultaneously break baseline AND primary. Includes the concrete `NetVladBackbonePreprocessor` (different resize target + normalisation than UltraVPR). MUST satisfy AC-2.1b's relaxed engine-rule floor `recall@10 ≥ 0.85` on Derkachi normal segment. +**Complexity**: 3 points +**Dependencies**: AZ-336_c2_vpr_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-300_c7_pytorch_baseline, AZ-303_c6_storage_interfaces, AZ-283_descriptor_normaliser, AZ-266_log_module, AZ-272_fdr_record_schema +**Component**: c2_vpr (epic AZ-255 / E-C2) +**Tracker**: AZ-338 +**Epic**: AZ-255 (E-C2) + +### Document Dependencies + +- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md` — Protocol contract; every invariant MUST be satisfied; INV-3 (L2-normalised) is critical because NetVLAD raw embeddings include intra-cluster residuals that must be globally L2-normalised after the VLAD aggregation. +- `_docs/02_document/components/02_c2_vpr/description.md` — § 1 NetVLAD designated as mandatory simple-baseline; § 5 PyTorch matches simple-baseline track; § 9 logging. +- `_docs/02_document/module-layout.md` — `c2_vpr.net_vlad` Internal entry; `BUILD_VPR_NETVLAD` row; `BUILD_PYTORCH_RUNTIME` row (NetVLAD requires PyTorch runtime ON which is OFF for airborne — NetVLAD is research/replay-only by build-flag combination). +- `_docs/02_document/components/02_c2_vpr/tests.md` — C2-IT-01 engine rule check `recall@10 ≥ 0.85` for NetVLAD on Derkachi normal segment. +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — `InferenceRuntime` interface; AZ-300 `pytorch_fp16_runtime` is the consumed concrete runtime. +- `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md` — L2 + intra-normalisation (NetVLAD's published preprocessing chain includes intra-cluster normalisation BEFORE the global L2 normalisation; the `DescriptorNormaliser` helper must support both). + +## Problem + +Without this task: + +- The C2 component has no comparative baseline; the engine rule (every primary backbone has a baseline alongside it for FT-12 comparative-study and for risk reduction if the primary fails) is violated for C2 specifically — the project-wide policy goes unsatisfied for one of its largest backbone surfaces. +- AC-2.1b's relaxed-floor check (`recall@10 ≥ 0.85` for NetVLAD) has no producer; suite-level FT-P-19 cannot validate the engine rule. +- The research binary (which links every backbone for IT-12 comparative studies) cannot ship without a NetVLAD strategy; researchers cannot run the comparative study that informs whether the primary's engine choice is justified. +- A code drop / weights / engine compile bug in UltraVPR has no fallback at the strategy layer; the operator who notices a sudden drop in suite-level satellite re-loc accuracy would have no mechanism to A/B against the baseline. + +## Outcome + +- `src/gps_denied_onboard/components/c2_vpr/net_vlad.py` defining: + - `NetVladStrategy` class implementing the `VprStrategy` Protocol. + - Constructor signature: `__init__(self, runtime: InferenceRuntime, tile_store: TileStore, weights_path: Path, preprocessor: NetVladBackbonePreprocessor, normaliser: DescriptorNormaliser, fdr_client: FdrClient, descriptor_dim: int = 4096)`. + - `embed_query(frame, calibration)`: + 1. `tensor = self._preprocessor.preprocess(frame, calibration)` (returns FP16 NCHW (1, 3, H, W); H=W=480 per the upstream NetVLAD-VGG16 default). + 2. `intermediate = self._runtime.forward(self._engine_id, {"input": tensor})["vlad_descriptor"]` (returns FP16 (1, descriptor_dim) post-VLAD aggregation). + 3. `intra_normalised = self._normaliser.intra_cluster_normalise(intermediate[0], num_clusters=64)` (per NetVLAD's published preprocessing: intra-cluster L2 first). + 4. `embedding = self._normaliser.l2_normalise(intra_normalised)` (then global L2). + 5. Return `VprQuery(frame_id, embedding, produced_at=monotonic_ns())`. + 6. Catch RuntimeError → wrap in `VprBackboneError`; emit ERROR log + FDR record. + - `retrieve_topk(query, k)`: identical to UltraVPR — delegates to `tile_store.faiss_topk`; returns `VprResult` with `backbone_label="net_vlad"`. + - `descriptor_dim() -> int`: returns the constructor-passed value (default 4096); asserted at engine-load time against the engine's output tensor shape; mismatch → `RuntimeError`. + - Module-level `create(config, tile_store, inference_runtime) -> VprStrategy`: + 1. Resolve `weights_path = config.vpr.backbone_weights_path` (a PyTorch state_dict file with the `.pth` extension; NetVLAD does NOT use the AZ-281 self-describing TRT filename schema — its own AZ-280 sidecar carries the PCA matrix + cluster centres). + 2. Resolve `descriptor_dim = config.vpr.netvlad_descriptor_dim` (default 4096; can be 512 if PCA-whitened weights are loaded). + 3. Construct `NetVladBackbonePreprocessor(input_shape=(480, 480), mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))`. + 4. Construct `DescriptorNormaliser` with `intra_cluster_normalise` capability. + 5. Load model via `inference_runtime.load_engine(weights_path)` (the PyTorch runtime accepts `.pth` files; AZ-300). + 6. Assert engine output shape == `(1, descriptor_dim)`; mismatch → `ConfigurationError`. + 7. Construct and return `NetVladStrategy(...)`. +- `src/gps_denied_onboard/components/c2_vpr/_preprocessor_net_vlad.py`: + - Implements `BackbonePreprocessor` Protocol. + - `preprocess(frame, calibration)`: + 1. Decode `frame.image_bytes` to RGB uint8 (H_in, W_in, 3). + 2. Centre-crop to a square region (same calibration-aware logic as UltraVPR — copied here, NOT shared, because the calibration handling is part of the preprocessor's contract). + 3. Resize to `(480, 480)` via OpenCV. + 4. Normalise: `(pixel/255.0 - mean) / std`; cast to FP16. + 5. Transpose HWC → CHW; add batch dim. + 6. Return ndarray of shape `(1, 3, 480, 480)` dtype float16. + - `input_shape() -> tuple[int, ...]`: returns `(480, 480)`. + - On failure: raise `VprPreprocessError`. +- Composition-root wiring path for `config.vpr.strategy == "net_vlad"`. +- Logging per description.md § 9: INFO `kind="c2.vpr.ready"` with `{strategy: "net_vlad", descriptor_dim: 4096}`; ERROR / WARN identical to UltraVPR. +- FDR records emitted: `kind="vpr.embed_query"`, `kind="vpr.backbone_error"`, `kind="vpr.preprocess_error"`. + +## Scope + +### Included + +- `NetVladStrategy` implementing the Protocol; `NetVladBackbonePreprocessor` implementing `BackbonePreprocessor`. +- Module-level `create(config, tile_store, inference_runtime)` factory entry-point. +- Intra-cluster L2 normalisation BEFORE global L2 normalisation (NetVLAD's published preprocessing chain). +- Composition-root wiring for `config.vpr.strategy == "net_vlad"`. +- Engine output shape assertion at load time. +- Logging + FDR records identical to UltraVPR (the per-backbone label distinguishes the records). +- Unit tests covering all 7 invariants, the dual-stage normalisation, the preprocessing contract, the load-time shape assertion. +- `BUILD_VPR_NETVLAD` CMake flag wiring per ADR-002 (ON for research; OFF for airborne / operator-tooling because PyTorch runtime is excluded; ON-but-effectively-unused for replay-cli unless explicitly selected). + +### Excluded + +- The `VprStrategy` Protocol — owned by AZ-336. +- The `DescriptorNormaliser.l2_normalise` — already AZ-283. **Note**: AZ-283 ships `l2_normalise`; this task may need to extend AZ-283 to add `intra_cluster_normalise(vec, num_clusters)`. **Decision**: extending AZ-283 is in scope here as a small contract addition (the helper ships with `l2_normalise`; adding `intra_cluster_normalise` is a single function). If AZ-283 is already merged when this task starts, the addition is a backward-compatible function add; no breaking change. +- The C7 PyTorch runtime — owned by AZ-300; this task consumes the interface. +- Other backbones — owned by AZ-337 (UltraVPR), AZ-339 (MegaLoc + MixVPR), AZ-340 (SelaVPR + EigenPlaces + SALAD). +- FAISS retrieve wiring — owned by AZ-341. +- C2-IT-01's NetVLAD recall@10 ≥ 0.85 acceptance test — deferred to Step 9 / E-BBT. + +## Acceptance Criteria + +**AC-1: Protocol conformance** +Given a constructed `NetVladStrategy` instance +When `isinstance(strategy, VprStrategy)` is evaluated +Then the result is `True` + +**AC-2: `embed_query` produces L2-normalised FP16 (descriptor_dim,) embedding** +Given a valid `NavCameraFrame` and `CameraCalibration` +When `strategy.embed_query(frame, calibration)` is called +Then `embedding.shape == (4096,)` (or the configured `descriptor_dim`), `embedding.dtype == np.float16`, `||embedding||_2 == 1.0 ± 1e-3` + +**AC-3: Dual-stage normalisation — intra-cluster THEN global L2** +Given a fake intermediate VLAD descriptor with non-zero per-cluster sub-vectors +When the embedding pipeline runs +Then `intra_cluster_normalise` is called BEFORE `l2_normalise` (verifiable via spy on the normaliser); the order is NEVER reversed; the output's per-cluster sub-vectors are unit-norm in the intra-cluster sense AND the full vector is unit-norm globally + +**AC-4: `embed_query` is deterministic** +Given the same frame + calibration +When `embed_query` is called 3 times +Then all three returns have bit-exact `embedding` arrays (ULP-tolerant FP16) + +**AC-5: `retrieve_topk` returns exactly k candidates with `backbone_label = "net_vlad"`** +Given a corpus of 100 tiles + a constructed `VprQuery` with D=4096 +When `strategy.retrieve_topk(query, k=10)` is called +Then `len(candidates) == 10`; sorted ascending; `backbone_label == "net_vlad"`; `candidates[0].descriptor_dim == 4096` + +**AC-6: `descriptor_dim()` is config-driven and stable** +Given construction with `descriptor_dim=4096` +When `descriptor_dim()` is called 100 times +Then every call returns 4096; constructing a second instance with `descriptor_dim=512` (PCA-whitened weights case) returns 512 from that instance's `descriptor_dim()` + +**AC-7: Engine output shape mismatch at load → `ConfigurationError`** +Given a model whose output tensor shape is `(1, 2048)` while `config.vpr.netvlad_descriptor_dim = 4096` +When `NetVladStrategy.create(...)` is called +Then `ConfigurationError` is raised with message containing `"engine output shape mismatch: expected (1, 4096), got (1, 2048)"`; the strategy is NOT instantiated + +**AC-8: `VprBackboneError` on forward-pass failure** +Given a `InferenceRuntime` test double that raises `RuntimeError` from `forward` +When `embed_query` is called +Then `VprBackboneError` is raised; ERROR log + FDR record emitted + +**AC-9: `VprPreprocessError` on corrupt image bytes** +Given a frame with malformed `image_bytes` +When `embed_query` is called +Then `VprPreprocessError` is raised; ERROR log + FDR record emitted + +**AC-10: Composition-root wiring** +Given `config.vpr.strategy = "net_vlad"` AND valid weights AND matching `descriptor_dim` +When `compose_root(config)` runs +Then a `NetVladStrategy` is wired; AZ-336 factory's pre-flight `descriptor_dim` validation passes; INFO log `kind="c2.vpr.ready"` with `{strategy: "net_vlad", descriptor_dim: 4096}` emitted + +**AC-11: Build-flag combination — NetVLAD requires PyTorch runtime** +Given `config.vpr.strategy = "net_vlad"` AND `BUILD_PYTORCH_RUNTIME=OFF` (airborne binary) +When the binary tries to load +Then `ConfigurationError` is raised at composition-root time with message containing `"NetVLAD requires BUILD_PYTORCH_RUNTIME=ON; this binary has BUILD_PYTORCH_RUNTIME=OFF"`; the binary refuses to start (fail-fast) + +## Non-Functional Requirements + +**Performance** +- `embed_query` p95 ≤ 80 ms on Tier-1 Jetson Orin with PyTorch FP16 — looser than UltraVPR's 60 ms because the simple-baseline runs on the simpler runtime; not on the production critical path. +- `retrieve_topk` p95 ≤ 4 ms — slightly looser than UltraVPR because the higher embedding dim (4096 vs 512) makes FAISS lookup ~ 8× more compute; still sub-frame at 3 Hz. +- GPU memory: ≤ 800 MB resident for backbone weights — looser than UltraVPR's 600 MB because NetVLAD's VGG16 backbone is larger. +- These NFRs are not enforced as engine-rule blockers; they're operator guidance for the research binary's resource budget. + +**Compatibility** +- The PyTorch state_dict format is owned by C7's PyTorch runtime (AZ-300); this task consumes the produced model via `config.vpr.backbone_weights_path`. +- The upstream NetVLAD code drop is pinned per Plan-phase; PCA-whitening parameters change with weights → AZ-280 sidecar carries them. + +**Reliability** +- Strategy is single-threaded by contract (INV-1). +- Dual-stage normalisation order (intra-cluster THEN global L2) is mandatory; reversing the order produces a different embedding subspace and silently breaks AC-2.1b (recall regression). +- `VprBackboneError` does not crash the process; downstream falls back to VIO-only. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | `isinstance(NetVladStrategy(...), VprStrategy)` | `True` | +| AC-2 | `embed_query` output | shape (4096,), dtype float16, L2-norm == 1.0 ± 1e-3 | +| AC-3 | Spy on normaliser methods | `intra_cluster_normalise` called BEFORE `l2_normalise` exactly once each per `embed_query` | +| AC-4 | `embed_query` × 3 same frame | bit-exact embeddings | +| AC-5 | `retrieve_topk` against fixture corpus | `len == 10`, sorted, `backbone_label == "net_vlad"`, `descriptor_dim == 4096` | +| AC-6 | `descriptor_dim()` × 100 (D=4096 instance) + a second D=512 instance | first instance always 4096; second always 512 | +| AC-7 | Model with wrong output shape | `ConfigurationError` at create time | +| AC-8 | `forward` raises | `VprBackboneError`; ERROR log + FDR | +| AC-9 | malformed `image_bytes` | `VprPreprocessError`; ERROR log + FDR | +| AC-10 | `compose_root(config="net_vlad")` | wired; INFO log with `{strategy: "net_vlad", descriptor_dim: 4096}` | +| AC-11 | airborne binary + `config.vpr.strategy = "net_vlad"` | `ConfigurationError` with PyTorch-OFF message; fail-fast | +| Preprocess-shape | `preprocessor.preprocess(frame)` output | shape `(1, 3, 480, 480)`, dtype float16 | +| Preprocess-input-shape | `preprocessor.input_shape()` | returns `(480, 480)` | + +## Constraints + +- **Dual-stage normalisation order is non-negotiable** — intra-cluster THEN global L2. Reversing is forbidden. +- **NetVLAD uses the PyTorch runtime, NOT TensorRT** — the simple-baseline policy isolates it from TRT engine compile risk. The research binary links both runtimes; airborne binary excludes the PyTorch runtime via `BUILD_PYTORCH_RUNTIME=OFF`, which makes NetVLAD effectively unselectable for airborne (AC-11). +- **Preprocessing parameters are weights-coupled** — `(480, 480)` resize, ImageNet mean/std. Hard-coded; not config-knobs. +- **`descriptor_dim` IS config-driven** (unlike UltraVPR which hard-codes 512) because NetVLAD ships in two flavours: full 4096-d and PCA-whitened 512-d. The choice is part of the operator's deployment, not a runtime decision. +- **Constructor injection only**; no `import gps_denied_onboard.config` inside the strategy module. +- **The strategy holds the engine ID, NOT the engine itself** — engine lifecycle is owned by C7. + +## Risks & Mitigation + +**Risk 1: NetVLAD embedding dim of 4096 is 8× larger than UltraVPR's 512; FAISS HNSW lookup is slower** +- *Risk*: `retrieve_topk` may exceed C2-PT-01's 2 ms budget for the lookup stage; the budget was set against UltraVPR's D=512. +- *Mitigation*: `retrieve_topk` p95 ≤ 4 ms is the looser baseline budget (acknowledged in NFRs); for the research binary this is acceptable since NetVLAD is comparison-only. If an operator wants the production-fast path with NetVLAD, they configure PCA-whitening (D=512) at corpus build time (C10). + +**Risk 2: NetVLAD recall@10 ≥ 0.85 floor not achievable with FP16** +- *Risk*: FP16 quantisation degrades the VLAD aggregation precision below the relaxed engine-rule floor. +- *Mitigation*: C2-IT-01's NetVLAD assertion is the validation gate (deferred to Step 9). If FP16 fails, the operator can configure FP32 weights — the strategy does not hard-code dtype; it follows the runtime's loaded model. + +**Risk 3: PyTorch FP16 runtime on Tier-1 Jetson is slower than expected** +- *Risk*: PyTorch FP16 inference on Jetson has known pipeline-stall issues compared to TRT. +- *Mitigation*: NetVLAD is research-only by build-flag combination (AC-11 enforces); the production critical path is UltraVPR. If a future cycle wants NetVLAD on the airborne binary, that's a separate task: convert NetVLAD to ONNX → TRT engine, then update this strategy to use the TRT runtime. + +**Risk 4: Operator picks NetVLAD on airborne binary by mistake** +- *Risk*: A typo in the airborne config that selects `net_vlad` would silently fall back to VIO-only every flight if the runtime were missing. +- *Mitigation*: AC-11 makes this fail-fast at composition-root time with a clear error message. Operators learn at startup, not after takeoff. + +**Risk 5: AZ-283 `DescriptorNormaliser` may not yet ship `intra_cluster_normalise`** +- *Risk*: The helper as defined in AZ-283 ships only `l2_normalise`; this task needs `intra_cluster_normalise` too. +- *Mitigation*: As noted in Scope/Excluded, extending AZ-283 to add `intra_cluster_normalise` is a backward-compatible function addition. If AZ-283 already merged before this task starts, the addition is committed alongside this task with a one-line note in `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md`. If AZ-283 not yet merged, coordinate the addition during AZ-283's implementation. Either way, no breaking change to existing consumers. + +## Runtime Completeness + +- **Named capability**: mandatory simple-baseline `VprStrategy` for engine-rule comparative validation against the production-default UltraVPR (architecture / E-C2 / `solution.md` "NetVLAD mandatory simple-baseline" / engine rule + AC-2.1b relaxed floor). +- **Production code that must exist**: real `NetVladStrategy` calling real C7 PyTorch `InferenceRuntime.forward` with a real loaded NetVLAD `.pth` model; real `NetVladBackbonePreprocessor` performing real OpenCV resize + ImageNet normalisation + FP16 cast; real dual-stage normalisation (intra-cluster THEN global L2); real composition-root wiring path. +- **Allowed external stubs**: tests MAY use `FakeInferenceRuntime` returning pre-computed VLAD descriptors; `FakeTileStore`; `FakeFdrClient`; `FakeDescriptorNormaliser` instrumented to verify call order (AC-3); production wiring uses the real C7 PyTorch runtime + real NetVLAD weights + real C6. +- **Unacceptable substitutes**: a NumPy-only NetVLAD forward pass (would not satisfy NFR-perf budget; would defeat the runtime-isolation strategy of using a different runtime than UltraVPR); skipping intra-cluster normalisation (would silently break AC-2.1b's recall floor); using TensorRT for NetVLAD (would defeat the simple-baseline policy of isolating runtime risk); making preprocessing parameters config-knobs (would let operators silently break the recall floor); selecting NetVLAD in an airborne binary (must fail-fast per AC-11); a single-stage L2-only normalisation (would deviate from NetVLAD's published preprocessing chain; recall regression risk). diff --git a/_docs/02_tasks/todo/AZ-339_c2_megaloc_mixvpr.md b/_docs/02_tasks/todo/AZ-339_c2_megaloc_mixvpr.md new file mode 100644 index 0000000..74780f4 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-339_c2_megaloc_mixvpr.md @@ -0,0 +1,207 @@ +# C2 MegaLoc + MixVPR Secondary Backbones + +**Task**: AZ-339_c2_megaloc_mixvpr +**Name**: C2 MegaLoc + MixVPR Secondary Backbones (Research-only) +**Description**: Implement `MegaLocStrategy` and `MixVprStrategy`, two secondary `VprStrategy` backbones used for IT-12 comparative-study purposes (research binary only). Both run on the C7 TensorRT runtime (same path as UltraVPR; FP16 engines compiled by C10) but are gated OFF for airborne and operator-tooling per ADR-002 — they're available only in the research binary and (selectable) replay-cli. Each strategy ships its own concrete `BackbonePreprocessor` (different resize target and normalisation per upstream code drop). Embeddings: MegaLoc D=2048, MixVPR D=4096. Both produce L2-normalised embeddings; both delegate `retrieve_topk` to the C6 TileStore Public API. Neither is on the production critical path; performance NFRs are looser than UltraVPR. +**Complexity**: 5 points +**Dependencies**: AZ-336_c2_vpr_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-298_c7_tensorrt_runtime, AZ-303_c6_storage_interfaces, AZ-283_descriptor_normaliser, AZ-281_engine_filename_schema, AZ-321_c10_engine_compiler, AZ-266_log_module, AZ-272_fdr_record_schema +**Component**: c2_vpr (epic AZ-255 / E-C2) +**Tracker**: AZ-339 +**Epic**: AZ-255 (E-C2) + +### Document Dependencies + +- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md` — Protocol contract; both strategies satisfy every invariant. +- `_docs/02_document/components/02_c2_vpr/description.md` — § 1 secondary backbones for IT-12 comparative study; § 5 backbone library list. +- `_docs/02_document/module-layout.md` — `c2_vpr.mega_loc` and `c2_vpr.mix_vpr` Internal entries; `BUILD_VPR_MEGALOC` and `BUILD_VPR_MIXVPR` rows (both OFF for airborne/operator-tooling, ON for research; replay-cli inherits research selection at config time). +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — `InferenceRuntime` interface (TRT runtime). +- `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md` — L2 normalisation. + +## Problem + +Without this task: + +- The IT-12 comparative-study cannot enumerate MegaLoc and MixVPR alongside UltraVPR / NetVLAD; researchers cannot quantify whether UltraVPR's PRIMARY designation is justified against the broader VPR-backbone landscape. +- The research binary's link surface is incomplete; the comparative-study CI matrix entry that asserts the research binary contains every secondary backbone fails. +- A future cycle that wants to swap MegaLoc to PRIMARY (e.g., if UltraVPR's upstream code drop becomes unmaintained) would have no migration path — the strategy class would not yet exist. + +## Outcome + +- `src/gps_denied_onboard/components/c2_vpr/mega_loc.py` defining `MegaLocStrategy` (Protocol-conforming) + `create(config, tile_store, inference_runtime)` factory entry-point. + - Constructor signature: `__init__(self, runtime, tile_store, weights_path, preprocessor, normaliser, fdr_client)`. + - `embed_query`: preprocess → TRT forward → L2 normalise → return `VprQuery`. + - `retrieve_topk`: delegate to `tile_store.faiss_topk`; return `VprResult` with `backbone_label="mega_loc"`, `descriptor_dim=2048`. + - `descriptor_dim() -> int`: returns 2048; engine output shape asserted at load. +- `src/gps_denied_onboard/components/c2_vpr/_preprocessor_mega_loc.py` defining `MegaLocBackbonePreprocessor`: + - `input_shape() -> (322, 322)` per upstream MegaLoc default. + - Normalisation: ImageNet mean/std (same as UltraVPR — common upstream convention; not a coupling, both happen to use ImageNet). + - Centre-crop with calibration-aware logic (same pattern as UltraVPR / NetVLAD; copied not shared per description.md § 6). + - Output dtype FP16, NCHW. +- `src/gps_denied_onboard/components/c2_vpr/mix_vpr.py` defining `MixVprStrategy` (mirrors `MegaLocStrategy` structure): + - `backbone_label="mix_vpr"`, `descriptor_dim=4096`. +- `src/gps_denied_onboard/components/c2_vpr/_preprocessor_mix_vpr.py` defining `MixVprBackbonePreprocessor`: + - `input_shape() -> (320, 320)` per upstream MixVPR default. + - Normalisation: ImageNet mean/std. + - Output dtype FP16, NCHW. +- Composition-root wiring paths for `config.vpr.strategy in {"mega_loc", "mix_vpr"}`. +- `BUILD_VPR_MEGALOC` and `BUILD_VPR_MIXVPR` CMake flags wired per ADR-002. +- Logging per description.md § 9 (INFO ready, WARN top-1-above-threshold, ERROR / FDR per error path). +- Engine output shape assertion at load for both strategies. +- Unit tests covering Protocol conformance, L2-normalisation, deterministic embeddings, top-K invariants, error paths — for BOTH strategies. + +## Scope + +### Included + +- Both `MegaLocStrategy` and `MixVprStrategy` classes implementing the Protocol. +- Both concrete `BackbonePreprocessor` implementations (one per strategy; preprocessing parameters per upstream code drop). +- Module-level `create` factory functions for both. +- Composition-root wiring for both strategy choices. +- Engine output shape assertion at load for both. +- Logging + FDR records identical pattern to UltraVPR (per-backbone `backbone_label`). +- Unit tests for both strategies covering invariants + error paths. +- `BUILD_VPR_MEGALOC` and `BUILD_VPR_MIXVPR` CMake flag wiring. + +### Excluded + +- The `VprStrategy` Protocol — owned by AZ-336. +- Shared `DescriptorNormaliser` — already AZ-283. +- C7 TensorRT runtime — owned by AZ-298. +- Engine compilation — owned by AZ-321. +- Other backbones — AZ-337 (UltraVPR), AZ-338 (NetVLAD), AZ-340 (SelaVPR + EigenPlaces + SALAD). +- FAISS retrieve wiring — owned by AZ-341. +- Recall@10 acceptance tests for these secondary backbones — deferred to Step 9 / E-BBT (and the floors are looser per the engine rule — these are research-only, not engine-rule-binding). + +## Acceptance Criteria + +**AC-1 (per strategy): Protocol conformance** +Given a constructed `MegaLocStrategy` AND a constructed `MixVprStrategy` +When `isinstance(strategy, VprStrategy)` is evaluated +Then both return `True` + +**AC-2 (per strategy): `embed_query` produces L2-normalised FP16 embedding of correct dim** +Given a valid `NavCameraFrame` and `CameraCalibration` +When `embed_query` is called on each strategy +Then MegaLoc returns `embedding.shape == (2048,)`, MixVPR returns `embedding.shape == (4096,)`; both are `dtype == np.float16`; both have `||embedding||_2 == 1.0 ± 1e-3` + +**AC-3 (per strategy): Deterministic embeddings** +Given the same frame +When `embed_query` is called 3 times +Then bit-exact embeddings (ULP-tolerant FP16) for each strategy + +**AC-4 (per strategy): `retrieve_topk` returns exactly k candidates with correct backbone_label** +Given a corpus of 100 tiles per strategy's descriptor_dim + a constructed `VprQuery` +When `retrieve_topk(query, k=10)` is called on each strategy +Then `len(candidates) == 10`, sorted ascending; `backbone_label == "mega_loc"` for MegaLoc; `backbone_label == "mix_vpr"` for MixVPR; `descriptor_dim` matches + +**AC-5 (per strategy): `descriptor_dim()` is stable** +Given a constructed strategy +When `descriptor_dim()` is called 100 times +Then MegaLoc returns 2048 every call; MixVPR returns 4096 every call + +**AC-6 (per strategy): Engine output shape mismatch → `ConfigurationError`** +Given a TRT engine whose output tensor shape does not match the strategy's expected `descriptor_dim` +When `create(...)` is called +Then `ConfigurationError` is raised; the strategy is NOT instantiated + +**AC-7 (per strategy): `VprBackboneError` on forward-pass failure** +Given an `InferenceRuntime` test double that raises +When `embed_query` is called +Then `VprBackboneError` is raised; ERROR log + FDR record emitted + +**AC-8 (per strategy): `VprPreprocessError` on corrupt image bytes** +Given a frame with malformed `image_bytes` +When `embed_query` is called +Then `VprPreprocessError` is raised; ERROR log + FDR record emitted + +**AC-9 (per strategy): Composition-root wiring** +Given `config.vpr.strategy = "mega_loc"` (resp. `"mix_vpr"`) AND valid weights AND matching `descriptor_dim` +When `compose_root(config)` runs +Then the corresponding strategy is wired; AZ-336 factory's pre-flight `descriptor_dim` validation passes; INFO log `kind="c2.vpr.ready"` with `{strategy: "mega_loc", descriptor_dim: 2048}` (resp. `mix_vpr` / 4096) is emitted + +**AC-10 (per strategy): Build-flag exclusion in airborne binary** +Given `config.vpr.strategy = "mega_loc"` (resp. `"mix_vpr"`) AND `BUILD_VPR_MEGALOC=OFF` (resp. `BUILD_VPR_MIXVPR=OFF`) — the airborne case +When the binary tries to load +Then `ConfigurationError` is raised at composition-root time with message containing the missing flag; the binary refuses to start (fail-fast per AZ-336 factory's lazy-import → ImportError → `ConfigurationError` mapping) + +**AC-11 (per strategy): Preprocessing input shape** +Given the strategy's preprocessor instance +When `input_shape()` is called +Then MegaLoc returns `(322, 322)`; MixVPR returns `(320, 320)` + +## Non-Functional Requirements + +**Performance** (looser than UltraVPR — research-only, not on production critical path): +- MegaLoc `embed_query` p95 ≤ 80 ms on Tier-1 Jetson Orin (FP16 TRT). +- MixVPR `embed_query` p95 ≤ 100 ms on Tier-1 Jetson Orin (FP16 TRT) — slightly higher because MixVPR's mix-net is ~30% larger than UltraVPR's backbone. +- `retrieve_topk` p95: MegaLoc ≤ 3 ms, MixVPR ≤ 4 ms (4096-d FAISS HNSW slower than 512-d). +- GPU memory per strategy: MegaLoc ≤ 700 MB; MixVPR ≤ 800 MB resident. +- These NFRs are research-side guidance; not engine-rule blockers. + +**Compatibility** +- Both consume TRT engines produced by AZ-321 with the AZ-281 self-describing filename schema. +- Upstream code drops pinned per Plan-phase; weight-format changes between drops require engine rebuild. + +**Reliability** +- Both strategies single-threaded by contract. +- Both use unconditional L2-normalisation (INV-3). +- Errors do not crash the process; downstream falls back to VIO-only. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 (MegaLoc) | `isinstance(MegaLocStrategy(...), VprStrategy)` | `True` | +| AC-1 (MixVPR) | `isinstance(MixVprStrategy(...), VprStrategy)` | `True` | +| AC-2 (MegaLoc) | `embed_query` output | shape (2048,), dtype float16, L2-norm ≈ 1.0 | +| AC-2 (MixVPR) | `embed_query` output | shape (4096,), dtype float16, L2-norm ≈ 1.0 | +| AC-3 (each) | `embed_query` × 3 same frame | bit-exact embeddings (ULP-tolerant) | +| AC-4 (each) | `retrieve_topk` against fixture corpus | `len == 10`, sorted, correct `backbone_label`, correct `descriptor_dim` | +| AC-5 (each) | `descriptor_dim()` × 100 | always returns the correct dim | +| AC-6 (each) | TRT engine with wrong output shape | `ConfigurationError` at create time | +| AC-7 (each) | `forward` raises | `VprBackboneError`; ERROR log + FDR | +| AC-8 (each) | malformed `image_bytes` | `VprPreprocessError`; ERROR log + FDR | +| AC-9 (each) | `compose_root(config=)` | wired; INFO log with correct backbone label and dim | +| AC-10 (each) | airborne binary + strategy chosen | `ConfigurationError` with missing-flag message; fail-fast | +| AC-11 (MegaLoc) | `MegaLocBackbonePreprocessor.input_shape()` | returns `(322, 322)` | +| AC-11 (MixVPR) | `MixVprBackbonePreprocessor.input_shape()` | returns `(320, 320)` | +| Preprocess-shape (each) | `preprocess(frame)` output | NCHW shape `(1, 3, H, W)`, dtype float16 | + +## Constraints + +- **Each strategy ships its own concrete preprocessor** — preprocessing parameters per upstream code drop (description.md § 6 "C2-internal helper, NOT a shared helper"). +- **Preprocessing parameters are weights-coupled** — `(322, 322)` for MegaLoc, `(320, 320)` for MixVPR; ImageNet mean/std for both. Hard-coded; not config-knobs. +- **Centre-crop logic is duplicated, NOT shared** — copying preprocessing between strategies is intentional per the contract; sharing would couple weights-versions across strategies and let one strategy's upgrade silently break another's preprocessing. +- **Both use TensorRT runtime** (consistent with UltraVPR's path); the difference between secondary and primary is not the runtime but the build-flag ON/OFF in airborne. +- **No engine compilation in this task** — the `.trt` engine files come from AZ-321; this task consumes them via `config.vpr.backbone_weights_path`. +- **Both strategies hold engine IDs returned by `inference_runtime.load_engine`, NOT engines themselves**. +- **No GPU operations in `__init__` beyond engine load** — same constraint as UltraVPR. + +## Risks & Mitigation + +**Risk 1: MegaLoc and MixVPR upstream code drops use different ONNX op sets that TRT 10.3 partially supports** +- *Risk*: Engine compilation succeeds but with fallback layers that don't run on GPU; `embed_query` p95 inflates. +- *Mitigation*: AZ-321 (engine compile) is responsible for detecting fallback layers and reporting them. This task consumes the produced engine; if NFR-perf budgets are violated, AZ-321 escalates the upstream support gap. + +**Risk 2: Higher embedding dim (4096 for MixVPR) inflates corpus storage requirements** +- *Risk*: A research binary that switches between UltraVPR (D=512) and MixVPR (D=4096) needs to rebuild the FAISS corpus every swap; researchers may forget. +- *Mitigation*: AZ-336 factory's pre-flight `descriptor_dim` validation catches the mismatch at startup with a clear error message. Researchers must rebuild the corpus (C10) before swapping; the helpful error tells them so. + +**Risk 3: MegaLoc / MixVPR are research-only — operators may select them by mistake** +- *Risk*: A typo or copy-pasted research config selects MegaLoc / MixVPR on an airborne binary; cold start fails. +- *Mitigation*: AC-10 ensures fail-fast at composition-root with a clear message. Operators learn at startup, not after takeoff. + +**Risk 4: Test fixtures for MegaLoc / MixVPR engines don't exist in CI** +- *Risk*: Without TRT engines for these strategies, the unit tests cannot exercise the full `embed_query` path; they're stubbed via `FakeInferenceRuntime`. +- *Mitigation*: This is fine — Step 9 / E-BBT validates the real engine path against C2-IT-01 and the C2-PT-01 NFR. The unit tests validate Protocol conformance + invariants; they don't need real engines. + +**Risk 5: Preprocessing duplication across strategies invites subtle bugs** +- *Risk*: A bug fix to UltraVPR's centre-crop logic doesn't propagate to MegaLoc / MixVPR. +- *Mitigation*: This is the documented trade-off (description.md § 6). The duplication is intentional. If a bug fix is needed across strategies, each strategy's preprocessor is updated explicitly with a coordinated commit; cross-checking is part of code review. + +## Runtime Completeness + +- **Named capability**: secondary `VprStrategy` implementations for IT-12 comparative-study (architecture / E-C2 / `solution.md` "MegaLoc, MixVPR secondary backbones"). +- **Production code that must exist**: real `MegaLocStrategy` and `MixVprStrategy` classes calling real C7 TRT `InferenceRuntime.forward` with real loaded `.trt` engines; real concrete preprocessors with real OpenCV resize + ImageNet normalisation + FP16 cast; real L2-normalisation; real composition-root wiring paths. +- **Allowed external stubs**: tests MAY use `FakeInferenceRuntime` returning pre-computed embeddings; `FakeTileStore`; `FakeFdrClient`; production wiring uses real C7 + real engines + real C6. +- **Unacceptable substitutes**: NumPy-only forward passes (would not satisfy NFR budgets); skipping L2-normalisation (would break INV-3); shared preprocessors across strategies (would defeat description.md § 6 isolation); selecting these strategies in airborne binaries (must fail-fast per AC-10); engine load at first frame (would defer the engine-output-shape assertion past startup); per-strategy thread safety (the contract is single-thread). diff --git a/_docs/02_tasks/todo/AZ-340_c2_selavpr_eigenplaces_salad.md b/_docs/02_tasks/todo/AZ-340_c2_selavpr_eigenplaces_salad.md new file mode 100644 index 0000000..05b07ea --- /dev/null +++ b/_docs/02_tasks/todo/AZ-340_c2_selavpr_eigenplaces_salad.md @@ -0,0 +1,218 @@ +# C2 SelaVPR + EigenPlaces + SALAD Secondary Backbones + +**Task**: AZ-340_c2_selavpr_eigenplaces_salad +**Name**: C2 SelaVPR + EigenPlaces + SALAD Secondary Backbones (Research-only) +**Description**: Implement `SelaVprStrategy`, `EigenPlacesStrategy`, and `SaladStrategy` — three additional secondary `VprStrategy` backbones used for IT-12 comparative-study (research binary only). All run on the C7 TensorRT runtime (FP16 engines compiled by C10) and are gated OFF for airborne / operator-tooling per ADR-002. Each strategy ships its own concrete `BackbonePreprocessor` per upstream code drop. Embeddings: SelaVPR D=512, EigenPlaces D=2048, SALAD D=8448 (the largest in the C2 family — DINOv2-backed). All three produce L2-normalised embeddings; all three delegate `retrieve_topk` to the C6 TileStore Public API. +**Complexity**: 5 points +**Dependencies**: AZ-336_c2_vpr_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-298_c7_tensorrt_runtime, AZ-303_c6_storage_interfaces, AZ-283_descriptor_normaliser, AZ-281_engine_filename_schema, AZ-321_c10_engine_compiler, AZ-266_log_module, AZ-272_fdr_record_schema +**Component**: c2_vpr (epic AZ-255 / E-C2) +**Tracker**: AZ-340 +**Epic**: AZ-255 (E-C2) + +### Document Dependencies + +- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md` — Protocol contract; all three strategies satisfy every invariant. +- `_docs/02_document/components/02_c2_vpr/description.md` — § 1 secondary backbone designation; § 5 backbone library list (SALAD added per module-layout `BUILD_VPR_SALAD` row). +- `_docs/02_document/module-layout.md` — `c2_vpr.sela_vpr`, `c2_vpr.eigen_places`, `c2_vpr.salad` Internal entries; `BUILD_VPR_SELAVPR`, `BUILD_VPR_EIGENPLACES`, `BUILD_VPR_SALAD` rows (all OFF for airborne/operator-tooling, ON for research, replay-cli inherits research selection at config time). +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — `InferenceRuntime` interface (TRT runtime). +- `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md` — L2 normalisation. + +## Problem + +Without this task: + +- The IT-12 comparative-study cannot enumerate SelaVPR, EigenPlaces, or SALAD; researchers cannot compare these three modern backbones (SelaVPR introduced 2024, EigenPlaces a strong baseline since 2023, SALAD a DINOv2-backed 2024 candidate) against UltraVPR / NetVLAD / MegaLoc / MixVPR. +- The research binary's link surface is incomplete; the comparative-study CI matrix entry asserting the research binary contains every documented backbone fails. +- A future cycle that wants to swap one of these to PRIMARY (e.g., SALAD's DINOv2 backbone may eventually outperform UltraVPR; the research data informs that decision) has no migration path. +- SALAD specifically uses DINOv2 — a fundamentally different backbone family (vision transformer rather than CNN) — and adding it to the comparative-study is research-strategy critical. + +## Outcome + +- `src/gps_denied_onboard/components/c2_vpr/sela_vpr.py` defining `SelaVprStrategy` (Protocol-conforming) + `create(config, tile_store, inference_runtime)` factory. + - `backbone_label="sela_vpr"`, `descriptor_dim=512`. + - Constructor / `embed_query` / `retrieve_topk` / `descriptor_dim` follow the same pattern as MegaLoc / MixVPR. +- `src/gps_denied_onboard/components/c2_vpr/_preprocessor_sela_vpr.py` defining `SelaVprBackbonePreprocessor`: + - `input_shape() -> (224, 224)` per upstream SelaVPR default. + - Normalisation: ImageNet mean/std. + - Output dtype FP16, NCHW. +- `src/gps_denied_onboard/components/c2_vpr/eigen_places.py` defining `EigenPlacesStrategy`: + - `backbone_label="eigen_places"`, `descriptor_dim=2048`. + - Same pattern as SelaVPR. +- `src/gps_denied_onboard/components/c2_vpr/_preprocessor_eigen_places.py` defining `EigenPlacesBackbonePreprocessor`: + - `input_shape() -> (480, 480)` per upstream EigenPlaces default. + - Normalisation: ImageNet mean/std. +- `src/gps_denied_onboard/components/c2_vpr/salad.py` defining `SaladStrategy`: + - `backbone_label="salad"`, `descriptor_dim=8448`. + - Same pattern as the others; SALAD's DINOv2 backbone produces patch tokens that the SALAD aggregator turns into a single 8448-d descriptor. +- `src/gps_denied_onboard/components/c2_vpr/_preprocessor_salad.py` defining `SaladBackbonePreprocessor`: + - `input_shape() -> (322, 322)` per SALAD's published preprocessing (DINOv2-aligned input). + - Normalisation: ImageNet mean/std (DINOv2's default). +- Composition-root wiring paths for `config.vpr.strategy in {"sela_vpr", "eigen_places", "salad"}`. +- `BUILD_VPR_SELAVPR`, `BUILD_VPR_EIGENPLACES`, `BUILD_VPR_SALAD` CMake flags wired per ADR-002. +- Logging + FDR records identical pattern to UltraVPR / MegaLoc / MixVPR (per-backbone `backbone_label` distinguishes records). +- Engine output shape assertion at load for all three. +- Unit tests covering Protocol conformance + invariants + error paths for ALL THREE strategies. + +## Scope + +### Included + +- All three strategy classes (`SelaVprStrategy`, `EigenPlacesStrategy`, `SaladStrategy`) implementing the Protocol. +- All three concrete `BackbonePreprocessor` implementations. +- Module-level `create` factories for all three. +- Composition-root wiring for all three strategy choices. +- Engine output shape assertion at load for all three. +- Logging + FDR records identical pattern to other backbones. +- Unit tests for all three strategies covering invariants + error paths. +- `BUILD_VPR_SELAVPR`, `BUILD_VPR_EIGENPLACES`, `BUILD_VPR_SALAD` CMake flag wiring. + +### Excluded + +- The `VprStrategy` Protocol — owned by AZ-336. +- Shared `DescriptorNormaliser` — already AZ-283. +- C7 TensorRT runtime — owned by AZ-298. +- Engine compilation — owned by AZ-321. +- Other backbones — AZ-337 (UltraVPR), AZ-338 (NetVLAD), AZ-339 (MegaLoc + MixVPR). +- FAISS retrieve wiring — owned by AZ-341. +- Recall@10 acceptance tests for these secondary backbones — deferred to Step 9 / E-BBT (research-only, not engine-rule-binding). + +## Acceptance Criteria + +**AC-1 (per strategy): Protocol conformance** +Given a constructed instance of each strategy +When `isinstance(strategy, VprStrategy)` is evaluated +Then all three return `True` + +**AC-2 (per strategy): `embed_query` produces L2-normalised FP16 embedding of correct dim** +Given a valid `NavCameraFrame` and `CameraCalibration` +When `embed_query` is called on each strategy +Then SelaVPR returns shape (512,); EigenPlaces returns (2048,); SALAD returns (8448,); all `dtype == np.float16`; all have `||embedding||_2 == 1.0 ± 1e-3` + +**AC-3 (per strategy): Deterministic embeddings** +Given the same frame +When `embed_query` is called 3 times on each strategy +Then bit-exact embeddings (ULP-tolerant FP16) for each strategy + +**AC-4 (per strategy): `retrieve_topk` returns exactly k candidates with correct backbone_label** +Given a corpus of 100 tiles per strategy's `descriptor_dim` + a constructed `VprQuery` +When `retrieve_topk(query, k=10)` is called on each strategy +Then `len(candidates) == 10`, sorted ascending; correct `backbone_label` (`"sela_vpr"` / `"eigen_places"` / `"salad"`); correct `descriptor_dim` carried in candidates + +**AC-5 (per strategy): `descriptor_dim()` is stable** +Given a constructed strategy +When `descriptor_dim()` is called 100 times +Then SelaVPR returns 512; EigenPlaces returns 2048; SALAD returns 8448 + +**AC-6 (per strategy): Engine output shape mismatch → `ConfigurationError`** +Given a TRT engine whose output tensor shape does not match the strategy's expected `descriptor_dim` +When `create(...)` is called +Then `ConfigurationError` is raised; the strategy is NOT instantiated + +**AC-7 (per strategy): `VprBackboneError` on forward-pass failure** +Given an `InferenceRuntime` test double that raises +When `embed_query` is called +Then `VprBackboneError` is raised; ERROR log + FDR record emitted + +**AC-8 (per strategy): `VprPreprocessError` on corrupt image bytes** +Given a frame with malformed `image_bytes` +When `embed_query` is called +Then `VprPreprocessError` is raised; ERROR log + FDR record emitted + +**AC-9 (per strategy): Composition-root wiring** +Given `config.vpr.strategy = "sela_vpr"` (resp. `"eigen_places"`, `"salad"`) AND valid weights AND matching `descriptor_dim` +When `compose_root(config)` runs +Then the corresponding strategy is wired; AZ-336 factory's pre-flight `descriptor_dim` validation passes; INFO log `kind="c2.vpr.ready"` emitted with correct `{strategy, descriptor_dim}` + +**AC-10 (per strategy): Build-flag exclusion in airborne binary** +Given the strategy is selected AND its `BUILD_VPR_*` flag is OFF +When the binary tries to load +Then `ConfigurationError` is raised with the missing-flag message; fail-fast + +**AC-11 (per strategy): Preprocessing input shape** +Given the strategy's preprocessor instance +When `input_shape()` is called +Then SelaVPR returns `(224, 224)`; EigenPlaces returns `(480, 480)`; SALAD returns `(322, 322)` + +## Non-Functional Requirements + +**Performance** (research-only; looser than UltraVPR): +- SelaVPR `embed_query` p95 ≤ 60 ms (FP16 TRT; 224×224 input is light). +- EigenPlaces `embed_query` p95 ≤ 80 ms (480×480 input + ResNet50-class backbone). +- SALAD `embed_query` p95 ≤ 120 ms (DINOv2-Large backbone is the heaviest in the C2 family). +- `retrieve_topk` p95: SelaVPR ≤ 2 ms, EigenPlaces ≤ 3 ms, SALAD ≤ 6 ms (8448-d FAISS HNSW is significantly slower; this is the cost of DINOv2's large embedding space). +- GPU memory per strategy: SelaVPR ≤ 400 MB, EigenPlaces ≤ 700 MB, SALAD ≤ 1200 MB resident (DINOv2-Large is heavy). +- These NFRs are research-side guidance, not engine-rule blockers. + +**Compatibility** +- All three consume TRT engines produced by AZ-321 with the AZ-281 self-describing filename schema. +- Upstream code drops pinned per Plan-phase; SALAD specifically depends on a pinned DINOv2 weight set. + +**Reliability** +- All three single-threaded by contract. +- All three use unconditional L2-normalisation (INV-3). +- Errors do not crash the process; downstream falls back to VIO-only. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 (each) | `isinstance((...), VprStrategy)` | `True` for all three | +| AC-2 (SelaVPR) | `embed_query` output | shape (512,), float16, L2-norm ≈ 1.0 | +| AC-2 (EigenPlaces) | `embed_query` output | shape (2048,), float16, L2-norm ≈ 1.0 | +| AC-2 (SALAD) | `embed_query` output | shape (8448,), float16, L2-norm ≈ 1.0 | +| AC-3 (each) | `embed_query` × 3 same frame | bit-exact embeddings (ULP-tolerant) | +| AC-4 (each) | `retrieve_topk` against fixture corpus | `len == 10`, sorted, correct `backbone_label`, correct `descriptor_dim` | +| AC-5 (each) | `descriptor_dim()` × 100 | always returns the correct dim | +| AC-6 (each) | TRT engine with wrong output shape | `ConfigurationError` at create time | +| AC-7 (each) | `forward` raises | `VprBackboneError`; ERROR log + FDR | +| AC-8 (each) | malformed `image_bytes` | `VprPreprocessError`; ERROR log + FDR | +| AC-9 (each) | `compose_root(config=)` | wired; INFO log with correct backbone label and dim | +| AC-10 (each) | airborne binary + strategy chosen | `ConfigurationError` with missing-flag message; fail-fast | +| AC-11 (SelaVPR) | `input_shape()` | `(224, 224)` | +| AC-11 (EigenPlaces) | `input_shape()` | `(480, 480)` | +| AC-11 (SALAD) | `input_shape()` | `(322, 322)` | +| Preprocess-shape (each) | `preprocess(frame)` output | NCHW shape `(1, 3, H, W)`, dtype float16 | + +## Constraints + +- **Each strategy ships its own concrete preprocessor** — preprocessing parameters per upstream code drop. +- **Preprocessing parameters are weights-coupled** — hard-coded per strategy (SelaVPR 224×224, EigenPlaces 480×480, SALAD 322×322); ImageNet mean/std for all (DINOv2 also uses ImageNet mean/std for its DINOv2 weights). +- **Centre-crop logic duplicated, NOT shared** — same trade-off as MegaLoc / MixVPR. +- **All three use TensorRT runtime** (consistent with UltraVPR / MegaLoc / MixVPR). +- **No engine compilation in this task** — `.trt` engine files come from AZ-321. +- **All three hold engine IDs returned by `inference_runtime.load_engine`, NOT engines themselves**. +- **No GPU operations in `__init__` beyond engine load**. +- **SALAD's high embedding dim (8448) is non-negotiable** — it's the architectural output of the SALAD aggregator over DINOv2 patch tokens. Operators who want a smaller SALAD descriptor must apply PCA-whitening at corpus build time (C10), which produces a different `BUILD_VPR_SALAD_PCA` build flag (out of scope here). + +## Risks & Mitigation + +**Risk 1: SALAD's DINOv2 backbone is significantly heavier than other C2 backbones** +- *Risk*: GPU memory + latency budget for SALAD blows the research binary's resource envelope; researchers cannot run multi-strategy comparisons in a single session. +- *Mitigation*: SALAD's NFR-perf budget is documented at 120 ms / 1200 MB GPU — significantly looser than UltraVPR. Researchers run SALAD comparisons in single-strategy sessions. If multi-strategy comparison is required, the operator can disable SALAD via build flag for that specific session. + +**Risk 2: SALAD's 8448-d FAISS lookup is slow** +- *Risk*: FAISS HNSW with D=8448 may exceed budget on Tier-2 hardware. +- *Mitigation*: 6 ms p95 is the documented budget (4× the UltraVPR D=512 lookup); still well under 1 second per frame at 3 Hz. PCA-whitened SALAD (D=512 or D=1024) is the operator-side optimisation if needed; that's a corpus-build-time decision (C10), not a strategy change. + +**Risk 3: SelaVPR / EigenPlaces / SALAD upstream code drops use ONNX ops that TRT 10.3 cannot compile** +- *Risk*: Engine compilation succeeds with fallback layers; latency inflates beyond NFR. +- *Mitigation*: AZ-321 (engine compile) detects fallback layers. Each strategy is independently affected; one failure does not block others. + +**Risk 4: SALAD's DINOv2 backbone weights have a non-standard licence** +- *Risk*: DINOv2 weights' licence (CC-BY-NC) may be incompatible with project distribution. +- *Mitigation*: Licence check is operator's responsibility (Plan-phase pinning of upstream); this task implements the strategy assuming licensed weights are available. If licence prevents distribution, the operator does not select SALAD; the strategy class still exists for future use if licence changes. + +**Risk 5: Preprocessing duplication across 7 strategies invites drift** +- *Risk*: A bug in centre-crop logic doesn't propagate across the 7 strategies' preprocessors. +- *Mitigation*: Same trade-off as MegaLoc / MixVPR — duplication is intentional per description.md § 6. Code review catches cross-strategy bug fixes. + +**Risk 6: Test fixtures for these engines don't exist in CI** +- *Risk*: Without TRT engines, full `embed_query` cannot be tested via unit tests. +- *Mitigation*: Step 9 / E-BBT validates the real engine path. Unit tests use `FakeInferenceRuntime` for Protocol conformance + invariants; this is sufficient for the Step 6 task scope. + +## Runtime Completeness + +- **Named capability**: secondary `VprStrategy` implementations (SelaVPR, EigenPlaces, SALAD) for IT-12 comparative-study (architecture / E-C2 / `solution.md` "SelaVPR, EigenPlaces secondary backbones"; SALAD per `module-layout.md` `BUILD_VPR_SALAD` row). +- **Production code that must exist**: real `SelaVprStrategy`, `EigenPlacesStrategy`, `SaladStrategy` classes calling real C7 TRT `InferenceRuntime.forward`; real concrete preprocessors with real OpenCV resize + ImageNet normalisation + FP16 cast; real L2-normalisation; real composition-root wiring paths. +- **Allowed external stubs**: tests MAY use `FakeInferenceRuntime` returning pre-computed embeddings; `FakeTileStore`; `FakeFdrClient`; production wiring uses real C7 + real engines + real C6. +- **Unacceptable substitutes**: NumPy-only forward passes (would not satisfy NFR budgets, would defeat GPU-bound design); skipping L2-normalisation (would break INV-3); shared preprocessors across strategies (would defeat description.md § 6 isolation); selecting these strategies in airborne binaries (must fail-fast per AC-10); engine load at first frame; per-strategy thread safety; bypassing the Protocol contract for SALAD's high-dim case (e.g., not validating the (1, 8448) engine output shape). diff --git a/_docs/02_tasks/todo/AZ-341_c2_faiss_retrieve_wiring.md b/_docs/02_tasks/todo/AZ-341_c2_faiss_retrieve_wiring.md new file mode 100644 index 0000000..4edef42 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-341_c2_faiss_retrieve_wiring.md @@ -0,0 +1,197 @@ +# C2 FAISS HNSW Retrieve Wiring + +**Task**: AZ-341_c2_faiss_retrieve_wiring +**Name**: C2 FAISS HNSW Retrieve Wiring (TileStore Bridge + Stale-Handle Defence) +**Description**: Implement the bridge between every C2 `VprStrategy.retrieve_topk` and the C6 `TileStore.faiss_topk(query_embedding, k) -> (distances, tile_ids)` Public API. The bridge handles handle invalidation defence (C2-ST-01: out-of-band corpus replacement caught via mmap inode/sidecar check; raises `IndexUnavailableError` instead of returning stale candidates), per-frame DEBUG logging of top-K distances (gated by config), and the WARN-threshold check on top-1 distance. The wiring lives in a small `c2_vpr/_faiss_bridge.py` module that every concrete `VprStrategy` constructor accepts via injection — keeps each strategy's `retrieve_topk` body to one line (`return self._faiss_bridge.retrieve(query, k, backbone_label="")`). +**Complexity**: 3 points +**Dependencies**: AZ-336_c2_vpr_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-303_c6_storage_interfaces, AZ-305_c6_postgres_filesystem_store, AZ-306_c6_faiss_descriptor_index, AZ-266_log_module, AZ-272_fdr_record_schema +**Component**: c2_vpr (epic AZ-255 / E-C2) +**Tracker**: AZ-341 +**Epic**: AZ-255 (E-C2) + +### Document Dependencies + +- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md` — `retrieve_topk` Protocol contract; INV-4 (exactly k candidates, sorted ascending), INV-5 (`backbone_label` non-empty), INV-6 (deterministic). +- `_docs/02_document/components/02_c2_vpr/description.md` — § 1 retrieval boundary; § 4 `FAISS HNSW top-K=10 search` query; § 7 race conditions; § 9 logging strategy (WARN top-1 distance threshold). +- `_docs/02_document/components/02_c2_vpr/tests.md` — C2-ST-01 (index handle invalidation safety; the strategy MUST raise `IndexUnavailableError` rather than return stale candidates after out-of-band corpus replacement). +- `_docs/02_document/contracts/c6_tile_cache/descriptor_index.md` — `TileStore.faiss_topk` query surface; sidecar / mmap-inode invariants used by the bridge's stale-handle defence. +- `_docs/02_document/module-layout.md` — `c2_vpr` Imports from `components.c6_tile_cache` (Public API only); Layering Layer 3 → Layer 2 (allowed). + +## Problem + +Without this task: + +- Every concrete `VprStrategy` (UltraVPR, NetVLAD, MegaLoc, MixVPR, SelaVPR, EigenPlaces, SALAD) would re-implement the same `retrieve_topk` body — calling `tile_store.faiss_topk`, building `VprCandidate` list, applying WARN threshold, emitting DEBUG log, handling `IndexUnavailableError`. Seven copies = seven places to drift. +- The C2-ST-01 safety check (out-of-band corpus replacement → `IndexUnavailableError`) would be re-implemented per strategy → seven places to break the safety invariant. Defended-in-depth at the bridge layer is correct. +- WARN-threshold tuning (the `config.vpr.warn_top1_threshold` knob) would have to be threaded through every strategy's constructor; centralising in the bridge means one config touchpoint. +- DEBUG per-frame distance logging — useful for forensic investigation of suspicious flights — would either be off everywhere or duplicated everywhere. + +The bridge also gives a single place to enforce INV-4 (exactly k candidates returned) defensively: if C6's `faiss_topk` ever returns fewer or unordered, the bridge catches it and raises `IndexUnavailableError` with diagnostic context. + +## Outcome + +- `src/gps_denied_onboard/components/c2_vpr/_faiss_bridge.py` defining `FaissBridge`: + - Constructor: `__init__(self, tile_store: TileStore, normaliser: DescriptorNormaliser, descriptor_dim: int, warn_top1_threshold: float, debug_log_per_frame_distances: bool, fdr_client: FdrClient)`. + - Method `retrieve(self, query: VprQuery, k: int, backbone_label: str) -> VprResult`: + 1. `distances, tile_ids = self._tile_store.faiss_topk(query.embedding, k)` — propagates `IndexUnavailableError` from C6 unchanged. + 2. **Defensive INV-4 check**: assert `len(distances) == k` AND `distances == sorted(distances)`; on violation raise `IndexUnavailableError(f"corpus returned {len(distances)} candidates (expected {k}) or unordered distances")`. + 3. Build `[VprCandidate(tile_id=tid, descriptor_distance=d, descriptor_dim=self._descriptor_dim) for tid, d in zip(tile_ids, distances)]`. + 4. Construct `result = VprResult(query.frame_id, candidates, retrieved_at=monotonic_ns(), backbone_label=backbone_label)`. + 5. **WARN-threshold check**: if `distances[0] > self._warn_top1_threshold`, emit ONE WARN log `kind="c2.vpr.top1_distance_above_threshold"` with structured `{distance, threshold, backbone_label}`. + 6. **DEBUG per-frame distances**: if `self._debug_log_per_frame_distances`, emit ONE DEBUG log `kind="c2.vpr.frame_distances"` with `{frame_id, top10_distances}`. + 7. Return `result`. +- Each concrete `VprStrategy`'s `retrieve_topk` is one line: `return self._faiss_bridge.retrieve(query, k, backbone_label=self.BACKBONE_LABEL)`. +- The bridge is constructor-injected into every strategy; every strategy's `create(...)` factory builds the bridge with strategy-specific `descriptor_dim` and `BACKBONE_LABEL`. +- C2-ST-01 stale-handle defence: the bridge does NOT re-implement the inode / sidecar check itself — that lives in C6's `TileStore.faiss_topk` (per `_docs/02_document/contracts/c6_tile_cache/descriptor_index.md`). The bridge's role is to NOT swallow the `IndexUnavailableError`; propagating it unchanged is the contractual behaviour. Tests assert the propagation. +- Logging per description.md § 9 (WARN top-1, DEBUG per-frame distances). +- FDR record `kind="vpr.retrieve_topk"` emitted per `retrieve` call with `{frame_id, backbone_label, top10_distances, latency_us}` for post-flight forensics. + +## Scope + +### Included + +- `FaissBridge` class with the `retrieve` method exactly as specified above. +- Defensive INV-4 check (exactly k candidates, sorted ascending) at the bridge layer; raises `IndexUnavailableError` on violation. +- `IndexUnavailableError` propagation (NOT wrapping) from C6 `TileStore.faiss_topk`. +- WARN-threshold check on top-1 distance (config-driven `config.vpr.warn_top1_threshold`). +- DEBUG per-frame distances log (config-driven `config.vpr.debug_per_frame_distances`). +- FDR record emission per `retrieve` call. +- Latency measurement (monotonic_ns delta) for the FDR record. +- Unit tests covering: happy path, INV-4 violation handling, IndexUnavailableError propagation, WARN-threshold trigger, DEBUG log gating, FDR record fields, latency capture. +- Strategy-side wire-in: each concrete strategy's `create(...)` updated to construct and inject the bridge; each `retrieve_topk` shrunk to a one-line delegation. + +### Excluded + +- The C6 `TileStore.faiss_topk` query implementation — owned by AZ-303 / AZ-305 / AZ-306. The bridge consumes the Public API. +- The C6 mmap inode + sidecar stale-handle check — owned by AZ-306 (FAISS descriptor index). The bridge consumes the resulting `IndexUnavailableError` and propagates it. +- The `VprStrategy` Protocol and DTOs — owned by AZ-336. +- Concrete strategy implementations — owned by AZ-337..AZ-340 (strategy-side wire-in code is part of those tasks; this task delivers the bridge module + the wire-in pattern they follow). +- C2-ST-01 acceptance test (suite-level) — deferred to Step 9 / E-BBT (and partially covered by the C6 stale-handle test owned by AZ-306). + +## Acceptance Criteria + +**AC-1: Happy-path retrieve** +Given a `FaissBridge` constructed with a fake `TileStore` returning `(distances=[0.05, 0.10, 0.15, ...], tile_ids=[t1, t2, t3, ...])` for a `query` AND `k=10` +When `bridge.retrieve(query, k=10, backbone_label="ultra_vpr")` is called +Then a `VprResult` is returned with `len(candidates) == 10`, sorted ascending, `backbone_label == "ultra_vpr"`, `descriptor_dim` matches the bridge's configured value; ONE FDR record `kind="vpr.retrieve_topk"` emitted with the correct fields + +**AC-2: INV-4 violation — wrong count** +Given a fake `TileStore` returning `(distances=[0.05, 0.10], tile_ids=[t1, t2])` for `k=10` (only 2 candidates returned) +When `bridge.retrieve(query, k=10, backbone_label="ultra_vpr")` is called +Then `IndexUnavailableError` is raised with message containing `"corpus returned 2 candidates (expected 10)"`; NO FDR record is emitted (the failure is the corpus, not the retrieval); ERROR log `kind="c2.vpr.invariant_violation"` emitted + +**AC-3: INV-4 violation — unordered distances** +Given a fake `TileStore` returning `(distances=[0.05, 0.20, 0.10, 0.15, ...], tile_ids=[...])` (out of ascending order) +When `bridge.retrieve(query, k=10, backbone_label="ultra_vpr")` is called +Then `IndexUnavailableError` is raised with message containing `"unordered distances"`; ERROR log emitted + +**AC-4: `IndexUnavailableError` from C6 propagated unchanged** +Given a fake `TileStore` whose `faiss_topk` raises `IndexUnavailableError("stale handle")` +When `bridge.retrieve(query, k=10, backbone_label="ultra_vpr")` is called +Then `IndexUnavailableError("stale handle")` propagates unchanged (NOT wrapped, NOT re-raised with new message); the bridge does not catch and re-raise + +**AC-5: WARN-threshold trigger** +Given `warn_top1_threshold = 0.30` AND a fake `TileStore` returning `distances[0] = 0.42` +When `bridge.retrieve(...)` is called +Then ONE WARN log `kind="c2.vpr.top1_distance_above_threshold"` with structured `{distance: 0.42, threshold: 0.30, backbone_label: "ultra_vpr"}` is emitted; the `VprResult` is still returned + +**AC-6: WARN-threshold not triggered when top-1 is below threshold** +Given `warn_top1_threshold = 0.30` AND `distances[0] = 0.15` +When `bridge.retrieve(...)` is called +Then NO WARN log is emitted; the `VprResult` is returned + +**AC-7: DEBUG per-frame distances ON** +Given `debug_log_per_frame_distances = True` +When `bridge.retrieve(...)` is called +Then ONE DEBUG log `kind="c2.vpr.frame_distances"` is emitted with `{frame_id, top10_distances: [0.05, 0.10, ...]}` + +**AC-8: DEBUG per-frame distances OFF (default)** +Given `debug_log_per_frame_distances = False` +When `bridge.retrieve(...)` is called +Then NO DEBUG log is emitted + +**AC-9: FDR record carries correct fields** +Given a happy-path retrieve +When `bridge.retrieve(...)` returns +Then ONE FDR record `kind="vpr.retrieve_topk"` is emitted with structured fields `{frame_id, backbone_label, top10_distances, latency_us}`; `latency_us > 0`; `top10_distances == [0.05, 0.10, ...]` + +**AC-10: Strategy-side wire-in — `retrieve_topk` is one line** +Given any concrete `VprStrategy` post-wire-in +When the strategy's source code is inspected +Then the body of `retrieve_topk` is exactly one return statement delegating to `self._faiss_bridge.retrieve(query, k, backbone_label=...)`; no candidate-building or distance-checking logic remains in the strategy + +**AC-11: Per-strategy `descriptor_dim` carried through to candidates** +Given two bridges — one for UltraVPR (descriptor_dim=512), one for NetVLAD (descriptor_dim=4096) +When `retrieve` is called on each +Then UltraVPR's candidates have `descriptor_dim == 512`; NetVLAD's candidates have `descriptor_dim == 4096`; the bridge does NOT mix them up + +## Non-Functional Requirements + +**Performance** +- `bridge.retrieve` overhead p95 ≤ 0.5 ms — bounded by the INV-4 sorted check (linear in k=10), the candidate-list construction (10 dataclass instantiations), and the FDR record emission (a single SPSC enqueue). The C6 `faiss_topk` time is excluded from this budget — that's C6's NFR. +- The DEBUG per-frame log adds ≤ 50 µs when ON; default-OFF eliminates the cost. + +**Compatibility** +- The bridge is a thin Python module; no native code, no compile flag. +- The `TileStore.faiss_topk` API surface is the only C6 contract consumed; changes to that surface require a coordinated bridge update. + +**Reliability** +- The defensive INV-4 check provides a backstop against silent C6 bugs (e.g., a future `faiss_topk` regression that returns fewer candidates). +- `IndexUnavailableError` propagation preserves C2-ST-01's "raise rather than return stale" invariant. +- The bridge is stateless across calls (no per-call state mutates the bridge instance); thread safety is per-strategy (single ingest thread per strategy → single bridge use). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|-------------|-----------------| +| AC-1 | Happy-path retrieve | `VprResult` with 10 candidates, sorted, correct label, FDR record emitted | +| AC-2 | TileStore returns 2/10 candidates | `IndexUnavailableError("corpus returned 2 candidates (expected 10)")`; ERROR log; no FDR record | +| AC-3 | TileStore returns unordered distances | `IndexUnavailableError("unordered distances")`; ERROR log | +| AC-4 | TileStore raises `IndexUnavailableError("stale handle")` | propagated unchanged | +| AC-5 | top-1 = 0.42, threshold = 0.30 | WARN log with `{distance, threshold, backbone_label}` | +| AC-6 | top-1 = 0.15, threshold = 0.30 | NO WARN log | +| AC-7 | DEBUG ON | DEBUG log with `{frame_id, top10_distances}` | +| AC-8 | DEBUG OFF | NO DEBUG log | +| AC-9 | FDR record fields | `{frame_id, backbone_label, top10_distances, latency_us}` with `latency_us > 0` | +| AC-10 | Source-code lint of strategies | `retrieve_topk` body is exactly one return statement (verifiable via AST inspection) | +| AC-11 | Two bridges with different `descriptor_dim` | each carries through to its candidates' `descriptor_dim` | +| NFR-perf | Microbench `bridge.retrieve` × 1000 with mock TileStore | p95 ≤ 0.5 ms (excluding C6 time) | + +## Constraints + +- **The bridge does NOT implement stale-handle detection** — that's C6's responsibility (AZ-306). The bridge's role is propagation. +- **The defensive INV-4 check is mandatory** — even though C6 should already enforce this, a defended-in-depth check at the consumer side catches future C6 regressions before they propagate to downstream C2.5 ReRanker. +- **The bridge is constructor-injected** — strategies do NOT import `_faiss_bridge` directly; the strategy's `create(...)` factory constructs the bridge and injects it. +- **WARN-threshold and DEBUG flag are config-driven** — no hard-coded values inside the bridge. +- **Latency measurement uses monotonic_ns** — wall-clock would drift if the system clock adjusts mid-flight. +- **The bridge is bound to one strategy's `descriptor_dim`** — a single bridge cannot serve multiple strategies with different dims; this is enforced by the per-strategy `create(...)` constructing a bridge with that strategy's dim. +- **No GPU operations in the bridge** — pure CPU; the GPU work is the strategy's `embed_query`. + +## Risks & Mitigation + +**Risk 1: The bridge's defensive INV-4 check duplicates C6's invariant enforcement** +- *Risk*: Two enforcement points may diverge; one might be stricter than the other. +- *Mitigation*: The bridge's check is "trust but verify" (defended-in-depth). C6 is the primary enforcement; the bridge's check exists to catch C6 regressions early. If the contracts diverge, that's a coordinated update across both contracts. + +**Risk 2: Per-frame DEBUG log volume at 3 Hz could flood journald** +- *Risk*: 3 Hz × 10 distances per record × 24 hours = ~2.6M log lines per flight day. +- *Mitigation*: DEBUG is OFF by default (AC-8); operators enable for forensic investigation only. Test asserts default-OFF. + +**Risk 3: WARN-threshold default (0.30) is uncalibrated** +- *Risk*: 0.30 is a placeholder; production-tuned values come from FT-P-19 telemetry. Excessive WARNs early in deployment. +- *Mitigation*: `warn_top1_threshold` is config-driven; operators tune per deployment. Default 0.30 is conservative starting point. + +**Risk 4: AST inspection in AC-10 is brittle** +- *Risk*: AC-10's "exactly one return statement" check via AST inspection may break if a strategy adds a docstring. +- *Mitigation*: AC-10's AST check ignores docstrings, comments, and whitespace; only the AST body of `retrieve_topk` is inspected — body must be a single `Return` node. + +**Risk 5: FDR record volume at 3 Hz × 10 distances** +- *Risk*: Per-frame FDR emission at 3 Hz adds 3 records/sec; with `top10_distances` array (~80 bytes), that's ~1 MB / hour of flight. +- *Mitigation*: This is well within the C13 FDR writer's budget (≤ 64 GB cap per AZ-291..AZ-296); 1 MB/hour is negligible. The retrieval-level provenance is critical for post-flight forensics; the cost is justified. + +## Runtime Completeness + +- **Named capability**: cross-strategy bridge between every C2 `VprStrategy.retrieve_topk` and the C6 `TileStore.faiss_topk` query surface; centralises invariant enforcement, WARN-threshold, DEBUG logging, and FDR provenance (architecture / E-C2 / `solution.md` "FAISS HNSW lookup wiring" / C2-ST-01). +- **Production code that must exist**: real `FaissBridge` class with real `tile_store.faiss_topk` calls; real defensive INV-4 check; real `IndexUnavailableError` propagation; real WARN-threshold + DEBUG flag config plumbing; real FDR record emission with latency_us; real per-strategy wire-in shrinking each `retrieve_topk` to one delegation line. +- **Allowed external stubs**: tests MAY use `FakeTileStore` returning controllable `(distances, tile_ids)` tuples (AC-1..AC-9), `FakeFdrClient` (AC-9), `FakeLogger` capturing log records by `kind`; production wiring uses the real C6 `TileStore` + real C13 FDR. +- **Unacceptable substitutes**: per-strategy duplicated retrieve logic (would defeat the entire purpose of this task); skipping the defensive INV-4 check (would let future C6 regressions silently propagate); wrapping `IndexUnavailableError` in another exception (would defeat C2-ST-01); hard-coded WARN threshold (would break operator tuning); always-ON DEBUG logging (would flood journald); skipping FDR record emission (would lose post-flight retrieval provenance); a stateful bridge that caches between calls (the contract is stateless per call). diff --git a/_docs/02_tasks/todo/AZ-342_c2_5_rerank_strategy_protocol.md b/_docs/02_tasks/todo/AZ-342_c2_5_rerank_strategy_protocol.md new file mode 100644 index 0000000..f677f31 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-342_c2_5_rerank_strategy_protocol.md @@ -0,0 +1,199 @@ +# C2.5 ReRank Strategy Protocol + Factory + Composition + +**Task**: AZ-342_c2_5_rerank_strategy_protocol +**Name**: C2.5 `ReRankStrategy` Protocol + Factory + Composition +**Description**: Define the public `ReRankStrategy` Protocol (PEP 544 structural interface), the C2.5 DTOs (`RerankCandidate`, `RerankResult`), the error hierarchy (`RerankError` family with `RerankBackboneError`, `RerankAllCandidatesFailedError`), and the composition-root factory `build_rerank_strategy(config, tile_store, lightglue_runtime) -> ReRankStrategy` that selects the concrete re-ranker at startup based on `config.rerank.strategy` with lazy import + `BUILD_RERANK_` flag gating per ADR-002. The shared `LightGlueRuntime` helper (AZ-278 / E-CC-HELPERS) is constructor-injected — neither C2.5 nor C3 owns its lifecycle (R14 fix). This task delivers the foundational scaffolding `InlierCountReRanker` (AZ-343) depends on; no concrete re-ranker is implemented here. +**Complexity**: 2 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-270_compose_root, AZ-278_lightglue_runtime (helper handle consumed via composition root), AZ-303_c6_storage_interfaces (for `TileStore` Public API), AZ-266_log_module +**Component**: c2_5_rerank (epic AZ-256 / E-C2.5) +**Tracker**: AZ-342 +**Epic**: AZ-256 (E-C2.5) + +### Document Dependencies + +- `_docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md` — the public contract this task implements (Protocol surface + DTOs + error hierarchy + factory signature + invariants + test cases). +- `_docs/02_document/components/03_c2_5_rerank/description.md` — § 1 architectural pattern (Strategy); § 2 `ReRankStrategy` interface + DTOs; § 5 error handling; § 6 helper ownership (R14 resolution); § 9 logging. +- `_docs/02_document/module-layout.md` — § Per-Component Mapping `c2_5_rerank` (Public API + Internal + Owns + Imports from); § shared/helpers/lightglue_runtime row (R14 helper-ownership decision); § Layering — Layer 3. +- `_docs/02_document/architecture.md` — ADR-001 (Strategy + composition root), ADR-002 (build-time exclusion via CMake `BUILD_*` flags), ADR-009 (interface-first DI; composition root the only place that imports concrete strategies). +- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md` — `VprResult` DTO (consumed by `rerank`). +- `_docs/02_document/contracts/shared_helpers/lightglue_runtime.md` — `LightGlueRuntime` helper handle consumed by the factory (constructor-injected, NOT instantiated here). +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — `TileStore.get_tile_pixels` Public API consumed by the strategy (page-cache-backed reference, not a copy). + +## Problem + +Without this task, `InlierCountReRanker` (AZ-343) and the downstream consumer C3 CrossDomainMatcher (AZ-257) would each invent their own ad-hoc interface, breaking three architectural invariants: + +- **ADR-001 (Strategy)**: re-rank algorithms must be swappable at composition time; without a shared Protocol, swapping (e.g., adding a learned re-ranker in a future cycle) requires rewriting every consumer. +- **ADR-002 (build-time exclusion)**: each re-ranker is gated by `BUILD_RERANK_`; without the lazy-import factory, any single missing module cascades into a hard import error at runtime, defeating per-binary exclusion. +- **ADR-009 (interface-first DI)**: the composition root must be the single place that knows about concrete re-ranker classes; consumers (C3, runtime root) hold typed references to the Protocol only. Without the Protocol, every consumer would import the concrete `InlierCountReRanker` directly. + +The drop-and-continue contract (Invariant 8) also matters: without it codified in the Protocol's docstring, an implementer might let a per-candidate failure abort the whole `rerank` call, breaking C3's expectation of partial input tolerance and pushing more flights into the `RerankAllCandidatesFailedError` → VIO-only fallback path than necessary. + +## Outcome + +- `src/gps_denied_onboard/components/c2_5_rerank/interface.py` defining: + - `ReRankStrategy` Protocol with `rerank(frame, vpr_result, n, calibration) -> RerankResult` (PEP 544 structural with `@runtime_checkable`). + - All eight invariants from the contract documented in the Protocol's docstring. +- `src/gps_denied_onboard/components/c2_5_rerank/__init__.py` re-exporting the Protocol + DTOs (Public API per module-layout `c2_5_rerank` mapping: `ReRankStrategy`, `RerankResult`). +- `src/gps_denied_onboard/_types/rerank.py` defining the two frozen + slotted dataclasses: `RerankCandidate`, `RerankResult`. Added under shared `_types/` because `RerankResult` is consumed cross-component (by C3 CrossDomainMatcher). +- `src/gps_denied_onboard/components/c2_5_rerank/errors.py` defining `RerankError`, `RerankBackboneError`, `RerankAllCandidatesFailedError`. +- `src/gps_denied_onboard/runtime_root/rerank_factory.py` exporting `build_rerank_strategy(config, tile_store, lightglue_runtime) -> ReRankStrategy`. The function: + 1. Reads `config.rerank.strategy` (currently only `"inlier_count"` is defined). + 2. Lazy-imports the concrete module via `importlib.import_module(f"gps_denied_onboard.components.c2_5_rerank.{module_name}")` per the strategy resolution table in the contract. + 3. ImportError where `e.msg` contains "No module named" → `ConfigurationError(f"BUILD_RERANK_{strategy.upper()} is OFF for this binary; cannot select strategy={strategy}")`. Other ImportErrors (native library load failures) re-raised unchanged. + 4. Constructs the strategy via its module-level `create(config, tile_store, lightglue_runtime)` factory function (each concrete re-ranker module exports `create` as its public entry-point — keeps `__init__.py` re-exports minimal). + 5. Returns the instance. The runtime root binds it to one ingest thread. +- Composition-root `compose_root` extension: invoke `build_rerank_strategy` after `LightGlueRuntime` is constructed; bind the result to the same C2.5 ingest thread that was bound to C2 (single-thread invariant per INV-1; same thread as C3 since both share `LightGlueRuntime`). +- Config schema extension to AZ-269: `config.rerank.strategy` (enum, default `"inlier_count"`), `config.rerank.top_n` (int, default 3), validated at config load. +- INFO log on every successful `build_rerank_strategy`: `kind="c2_5.rerank.strategy_loaded"` with strategy name + `top_n`. ERROR log on `ConfigurationError` (with the missing flag detail). + +## Scope + +### Included + +- The `ReRankStrategy` Protocol + its docstring encoding all eight invariants from the contract. +- The two DTOs in `_types/rerank.py` (`RerankCandidate`, `RerankResult`). +- The three-class error hierarchy in `c2_5_rerank/errors.py`. +- The composition-root factory `build_rerank_strategy` with lazy-import + ImportError → `ConfigurationError` mapping. +- Config schema extension for `config.rerank.{strategy, top_n}`. +- Strategy resolution table comment in `rerank_factory.py` matching the contract's table verbatim. +- Composition-root wiring path that constructs `LightGlueRuntime` ONCE and passes the same reference to both `build_rerank_strategy` and `build_matcher_strategy` (the C3 factory; cross-task coordination point with AZ-257's protocol task). +- Unit tests covering: Protocol conformance for a fake strategy, factory rejection on missing flag (lazy-import → ImportError → `ConfigurationError`), factory acceptance for the valid `"inlier_count"` value, INFO log emission, DTO immutability + slot enforcement, error hierarchy catchability. +- INFO / ERROR log emission per description.md § 9. + +### Excluded + +- Any concrete re-ranker implementation — owned by AZ-343 (`InlierCountReRanker`). +- The `LightGlueRuntime` helper itself — already AZ-278 (E-CC-HELPERS); this task consumes the constructor-injected handle. +- The C6 `TileStore` interface itself — owned by AZ-303; this task references the Public API in the factory signature. +- Component-internal tests beyond Protocol-conformance + factory-validation: C2.5-IT-01 (top-1 promotion rate), C2.5-IT-02 (drop-and-continue smoke), C2.5-IT-03 (helper serial-access), C2.5-PT-01 (latency NFR) are deferred to Step 9 / E-BBT. +- C3 matcher's protocol task and factory — owned by AZ-257's component decomposition. + +## Acceptance Criteria + +**AC-1: Protocol conformance — fake strategy passes `runtime_checkable`** +Given a `FakeReRankStrategy` test double implementing `rerank` +When `isinstance(fake, ReRankStrategy)` is evaluated +Then the result is `True`; the same evaluation against an object missing `rerank` returns `False` + +**AC-2: DTO immutability + slots** +Given a constructed `RerankCandidate`, `RerankResult` +When attempting to mutate any field via attribute assignment +Then `FrozenInstanceError` is raised; `__slots__` is non-empty (verified via `cls.__slots__`); the dataclasses use `frozen=True, slots=True` + +**AC-3: Factory rejects missing build flag — ImportError → ConfigurationError** +Given `config.rerank.strategy = "nonexistent_reranker"` (a non-existent module that simulates a missing build flag) AND a `tile_store` test double AND a `lightglue_runtime` test double +When `build_rerank_strategy(config, tile_store, lightglue_runtime)` is called +Then `ConfigurationError` is raised with message containing `"BUILD_RERANK_NONEXISTENT_RERANKER is OFF"`; ONE ERROR log `kind="c2_5.rerank.build_flag_off"` is emitted + +**AC-4: Factory rejects unknown strategy at config-load time** +Given `config.rerank.strategy = "garbage"` AND the strategy resolution table does NOT contain "garbage" +When `load_config(...)` is called +Then `ConfigurationError` is raised at config-load time (the enum validation), NOT at factory time; the factory is never invoked + +**AC-5: Successful factory load emits INFO log** +Given `config.rerank.strategy = "inlier_count"` AND `config.rerank.top_n = 3` AND a valid lazy-importable `inlier_based_reranker` test double module +When `build_rerank_strategy(...)` is called +Then a `ReRankStrategy` instance is returned; ONE INFO log `kind="c2_5.rerank.strategy_loaded"` is emitted with structured fields `{strategy: "inlier_count", top_n: 3}` + +**AC-6: Strategy resolution table — every entry resolves to its module path** +Given each valid `config.rerank.strategy` value (currently only `"inlier_count"`) +When `build_rerank_strategy` is called (assuming the module exists as a test double) +Then the call returns a `ReRankStrategy` instance; the resolved module path matches the contract's strategy resolution table verbatim (`gps_denied_onboard.components.c2_5_rerank.inlier_based_reranker`) + +**AC-7: Error hierarchy — every concrete error is catchable as `RerankError`** +Given test instances of `RerankBackboneError`, `RerankAllCandidatesFailedError` +When caught by `except RerankError` +Then both are caught; `isinstance(err, RerankError)` is `True` for each + +**AC-8: Public API surface — `__init__.py` re-exports** +Given `from gps_denied_onboard.components.c2_5_rerank import ReRankStrategy, RerankResult` +When the import is evaluated +Then both names resolve; internal names (e.g., `_validate_inputs`, factory-private helpers) are NOT in the Public API (`__all__` exposes only `ReRankStrategy`, `RerankResult`) + +**AC-9: Strategy bound to single ingest thread by composition root** +Given a `compose_root(config)` invocation that wires C2.5 +When the resulting strategy is bound +Then the strategy is bound to exactly one ingest thread (verifiable via the runtime root's thread-binding registry); a second binding attempt to the same strategy raises `RuntimeError` + +**AC-10: Composition root passes the SAME `LightGlueRuntime` instance to both C2.5 and C3** +Given a `compose_root(config)` invocation that wires both C2.5 and C3 +When the resulting strategies are inspected +Then `c2_5_strategy._lightglue_runtime is c3_strategy._lightglue_runtime` (identity, not equality); ONE INFO log `kind="runtime_root.lightglue_runtime.shared"` is emitted at composition time confirming the shared binding + +**AC-11: `RerankCandidate.tile_pixels_handle` is opaque** +Given a constructed `RerankCandidate(tile_pixels_handle=some_obj)` +When the field is accessed +Then it returns the same `some_obj` (identity); the Protocol does NOT type-restrict the handle (it's `object` by design — C6 owns the actual type) + +## Non-Functional Requirements + +**Performance** +- `build_rerank_strategy` p99 ≤ 50 ms — the factory itself is a config read + lazy import + one constructor call. The constructor cost lives inside the concrete re-ranker (TRT engine warm-up — owned by AZ-343), NOT in this task. + +**Compatibility** +- The `ReRankStrategy` Protocol is a major API surface; any change to method signature is a breaking change requiring a coordinated update of every implementation (lockstep — see Versioning in the contract). +- DTO field additions follow the standard "frozen dataclass + new optional field with default" pattern. +- The drop-and-continue contract (Invariant 8) is non-negotiable; documented in the Protocol's docstring as a contract clause that implementations MUST satisfy. + +**Reliability** +- Lazy-import via `importlib.import_module` — a build-time-excluded re-ranker's import never executes (no native library load attempted, no CUDA initialisation). +- Single-thread invariant enforced by composition root binding (AC-9); the strategy itself is not responsible for thread safety. +- Identity-shared `LightGlueRuntime` (AC-10) ensures C2.5 and C3 cannot accidentally use different helper instances (which would either double GPU memory or break the serial-access invariant). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | `runtime_checkable` Protocol conformance | Fake strategy passes; partial fake fails | +| AC-2 | DTO immutability + slots | `FrozenInstanceError` on mutation; `__slots__` non-empty | +| AC-3 | Factory + nonexistent re-ranker module | `ConfigurationError("BUILD_RERANK_ is OFF")`; ERROR log emitted | +| AC-4 | Config load + invalid enum | `ConfigurationError` at config-load time; factory never invoked | +| AC-5 | Factory + valid load | Strategy instance returned; INFO log emitted with structured fields | +| AC-6 | Strategy resolution to `inlier_based_reranker` module | Resolves to correct module path | +| AC-7 | Error catchability | Both concrete errors caught by `except RerankError` | +| AC-8 | Public API re-exports | `ReRankStrategy`, `RerankResult` resolve; internals not in `__all__` | +| AC-9 | Single-thread binding | First binding succeeds; second on same instance raises `RuntimeError` | +| AC-10 | `LightGlueRuntime` identity-shared between C2.5 and C3 | `c2_5._lightglue_runtime is c3._lightglue_runtime`; INFO log emitted | +| AC-11 | `tile_pixels_handle` opaqueness | Identity preserved; Protocol does not constrain type | +| NFR-perf-factory | Microbench `build_rerank_strategy` × 100 with mock concrete | p99 ≤ 50 ms | + +## Constraints + +- **No business logic beyond Protocol + factory + DTOs + errors.** The factory does NOT call `lightglue_runtime` or `tile_store` methods at construction time; those calls happen during `rerank` (per-frame), owned by AZ-343. +- **Lazy import is mandatory** — direct `from gps_denied_onboard.components.c2_5_rerank.inlier_based_reranker import InlierCountReRanker` in the factory is forbidden (would defeat ADR-002 build-time exclusion). +- **`@runtime_checkable` MUST be used** — INV-1 isolates the binding-side enforcement of single-thread invariant; runtime_checkable lets composition root assert via `isinstance` without forcing every consumer to import the Protocol. +- **DTOs MUST be `frozen=True, slots=True`** — immutability prevents accidental mutation across thread boundaries; slots reduces memory footprint. +- **Concrete re-ranker modules export `create(config, tile_store, lightglue_runtime)` as their entry-point** — keeps the factory's lazy-import surface uniform; per-strategy constructors stay private. +- **Config schema field `config.rerank.strategy` is an enum** validated at config load — typo'd values fail before the factory runs. +- **The factory does NOT instantiate `LightGlueRuntime`** — that is the runtime root's responsibility, BEFORE this factory runs. AC-10 enforces the identity-share with C3. + +## Risks & Mitigation + +**Risk 1: `runtime_checkable` Protocol checks have known performance cost** +- *Risk*: `isinstance(obj, RuntimeCheckableProtocol)` walks the method table; called per-frame at 3 Hz it could add measurable overhead. +- *Mitigation*: `isinstance` is called ONCE at composition-root binding time (AC-9), NOT per-frame. The per-frame path uses the bound concrete reference. Test asserts the binding-time check is the only `isinstance` call site against `ReRankStrategy`. + +**Risk 2: Lazy-import error message obscures the real failure mode** +- *Risk*: A native library (e.g., LightGlue TRT engine) failing to load triggers `ImportError` from the lazy import, which the factory currently maps to "BUILD flag OFF" — but the actual cause may be a missing `.so` or version mismatch. +- *Mitigation*: The factory catches `ImportError`, inspects `e.msg`; if the message contains "No module named" → "BUILD flag OFF" (the build-time-excluded case); otherwise re-raises the original ImportError preserving the native-library context. AC-3 covers the build-flag case; a separate test covers the native-library load case. + +**Risk 3: `compose_root` thread-binding registry / `LightGlueRuntime` identity-share contract is not yet implemented** +- *Risk*: AC-9 + AC-10 reference a "thread-binding registry" and a shared-helper composition that AZ-270 (`compose_root`) and AZ-278 (helper) may not yet provide. +- *Mitigation*: This task's Public API is the factory; the runtime root is responsible for thread binding and helper sharing. If AZ-270 has not yet implemented the registry, this task delivers AC-1..AC-8 + AC-11 + a stub `bind_to_thread(strategy)` interface that AZ-270 fills in. AC-9 / AC-10 are gated on AZ-270's progress and may move to a follow-up task if the registry isn't ready. **Decision**: keep AC-9 / AC-10 in this task; if AZ-270 lacks the registry by implementation time, AZ-270 is the upstream blocker — escalate via the standard tracker dependency mechanism. + +**Risk 4: A future learned re-ranker may need a different constructor signature** +- *Risk*: A future `LearnedReRanker` may need additional dependencies (e.g., a separate `ReRankInferenceRuntime`) that don't fit `create(config, tile_store, lightglue_runtime)`. +- *Mitigation*: The `create` factory pattern is per-module — each module owns its own `create` function. The composition-root factory `build_rerank_strategy` selects the module and invokes its `create`; if a future module needs different deps, the composition root passes them through. Today's signature is `create(config, tile_store, lightglue_runtime)` because every C2.5 strategy will plausibly need those three; if that ever changes, the factory's signature evolves. + +## Runtime Completeness + +- **Named capability**: `ReRankStrategy` Protocol + composition-root factory + ADR-002 build-time exclusion enforcement (architecture / E-C2.5 / `solution.md` "K=10 → N=3 by single-pair LightGlue inlier count" / ADR-001 + ADR-002 + ADR-009). +- **Production code that must exist**: real `ReRankStrategy` Protocol + real DTOs + real error hierarchy + real `build_rerank_strategy` factory with real lazy-import + real ImportError mapping + real config schema extension + real composition-root wiring path that identity-shares `LightGlueRuntime` with C3. +- **Allowed external stubs**: tests MAY use `FakeReRankStrategy`, `FakeTileStore`, `FakeLightGlueRuntime`. Production wiring uses the real `InlierCountReRanker` (selected from AZ-343 at composition time) + the real C6 `TileStore` + the real shared `LightGlueRuntime` helper. +- **Unacceptable substitutes**: direct `from gps_denied_onboard.components.c2_5_rerank.inlier_based_reranker import InlierCountReRanker` in the factory (would defeat ADR-002); a `Type[ReRankStrategy]` registry that pre-imports all re-rankers (would defeat lazy-import); skipping the identity-share enforcement (AC-10) and constructing a SECOND `LightGlueRuntime` for C2.5 (would double GPU memory and break the serial-access invariant the helper relies on). + +## Contract + +This task produces/implements the contract at `_docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-343_c2_5_inlier_count_reranker.md b/_docs/02_tasks/todo/AZ-343_c2_5_inlier_count_reranker.md new file mode 100644 index 0000000..97f39e3 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-343_c2_5_inlier_count_reranker.md @@ -0,0 +1,227 @@ +# C2.5 InlierCountReRanker — single-pair LightGlue inlier count K=10 → N=3 + +**Task**: AZ-343_c2_5_inlier_count_reranker +**Name**: C2.5 InlierCountReRanker (drop-and-continue) +**Description**: Implement `InlierCountReRanker`, the production-default `ReRankStrategy`. For each candidate in C2's top-K=10 `VprResult`, fetch tile pixels from C6, run a single-pair LightGlue forward via the shared `LightGlueRuntime` helper (AZ-278), record the inlier count, then sort descending by inlier count and return the top-N=3 as a `RerankResult`. Implements the drop-and-continue contract (Invariant 8 from the Protocol contract): per-candidate `RerankBackboneError` (LightGlue forward failure) and `TileFetchError` (C6 read failure) are caught inside the loop, the candidate is dropped, an ERROR log + FDR record is emitted, and the success path continues. Zero survivors raise `RerankAllCandidatesFailedError`. Includes the concrete `InlierCountReRankerPreprocessor` if any pre-LightGlue cropping/resizing is needed (single-pair LightGlue input contract MUST be satisfied — owned by AZ-278 helper, but the per-frame side prep happens here). Composition-root wired via the AZ-342 factory (this task's `create` entry-point). +**Complexity**: 3 points +**Dependencies**: AZ-342_c2_5_rerank_strategy_protocol (Protocol + factory + DTOs + errors + composition wiring), AZ-263_initial_structure, AZ-269_config_loader, AZ-278_lightglue_runtime (shared LightGlue helper), AZ-303_c6_storage_interfaces (`TileStore.get_tile_pixels` Public API), AZ-266_log_module, AZ-272_fdr_record_schema +**Component**: c2_5_rerank (epic AZ-256 / E-C2.5) +**Tracker**: AZ-343 +**Epic**: AZ-256 (E-C2.5) + +### Document Dependencies + +- `_docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md` — Protocol contract this task implements (every invariant MUST be satisfied; drop-and-continue is INV-8). +- `_docs/02_document/components/03_c2_5_rerank/description.md` — § 1 architectural pattern; § 2 `RerankResult` semantics (length = N=3 ranked descending by inlier_count); § 5 error handling (drop-and-continue + zero-survivors fallback); § 7 caveats (shared helper serial access, no concurrency); § 9 logging. +- `_docs/02_document/module-layout.md` — `c2_5_rerank` Per-Component Mapping (`inlier_based_reranker.py` Internal); `BUILD_RERANK_INLIER_COUNT` row in build-time exclusion map (ON for airborne / research / replay-cli; OFF for operator-tooling). +- `_docs/02_document/contracts/shared_helpers/lightglue_runtime.md` — `LightGlueRuntime.match_single_pair(query_image, support_image) -> InlierCount` (or equivalent helper API; this task's calls go through that interface). +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — `TileStore.get_tile_pixels(tile_id) -> page-cache-backed handle` semantics (no copy). +- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md` — `VprResult` and `VprCandidate` DTOs (consumed at the input boundary). +- `_docs/02_document/components/03_c2_5_rerank/tests.md` — C2.5-IT-01 (top-1 promotion rate ≥ 0.98); C2.5-IT-02 (drop-and-continue smoke); C2.5-IT-03 (helper serial-access invariant); C2.5-PT-01 (`rerank` p95 ≤ 80 ms for 10 single-pair LightGlue passes; GPU mem ≤ 300 MB shared engine). + +## Problem + +Without this task: + +- The Protocol from AZ-342 has no concrete implementation; the airborne binary cannot start because `compose_root` cannot construct a `ReRankStrategy` for `config.rerank.strategy = "inlier_count"` (the only legal value today). +- C3 CrossDomainMatcher (AZ-257) has no input source; F3 / F6 cannot run. +- AC-2.5-IT-01 (top-1 promotion rate ≥ 0.98) — the primary C2.5 acceptance criterion — has no producer; the boundary between cheap retrieval (C2) and expensive matching (C3) is undefended; F3 sees N=10 instead of N=3 candidates and overshoots its latency budget by 3.3×. +- The drop-and-continue contract is the ONLY thing standing between a single LightGlue CUDA OOM and a full VIO-only fallback (AC-3.5). Without robust per-candidate error handling, AC-NEW-7 cache-poisoning safety budget can be triggered by transient backbone errors that have nothing to do with the corpus. +- The shared `LightGlueRuntime` helper (AZ-278) is constructed but unused unless this task wires it into the per-candidate inlier-counting loop. R14 (apparent C2.5↔C3 cycle) was resolved by helper ownership; that resolution is moot if neither sibling consumer ships. + +## Outcome + +- `src/gps_denied_onboard/components/c2_5_rerank/inlier_based_reranker.py` defining: + - `InlierCountReRanker` class implementing the `ReRankStrategy` Protocol (AZ-342). + - Constructor signature: `__init__(self, tile_store: TileStore, lightglue_runtime: LightGlueRuntime, fdr_client: FdrClient, top_n: int = 3)`. The strategy holds references to the constructor-injected `LightGlueRuntime` (NOT a copy); the helper's lifecycle is owned by the runtime root (per the helper-ownership R14 fix). + - `rerank(frame, vpr_result, n, calibration)`: + 1. `surviving: list[RerankCandidate] = []` + 2. `dropped = 0` + 3. For each `VprCandidate` in `vpr_result.candidates`: + a. Try `tile_pixels_handle = self._tile_store.get_tile_pixels(candidate.tile_id)`. + On `TileFetchError`: emit ERROR log `kind="c2_5.rerank.tile_fetch_error"` + FDR record `kind="rerank.tile_fetch_error"`; `dropped += 1`; continue. + b. Try `inlier_count = self._lightglue_runtime.match_single_pair(query_image=frame.image_bytes_or_decoded, support_image=tile_pixels_handle, calibration=calibration).inlier_count`. + On `LightGlueError` / underlying CUDA / RuntimeError: wrap as `RerankBackboneError`; emit ERROR log `kind="c2_5.rerank.backbone_error"` + FDR record `kind="rerank.backbone_error"` with `tile_id` field; `dropped += 1`; continue. + c. If `inlier_count == 0`: emit DEBUG log `kind="c2_5.rerank.zero_inliers"` (NOT an error — just a no-match candidate); `dropped += 1`; continue. + d. Else: append `RerankCandidate(tile_id=candidate.tile_id, inlier_count=inlier_count, descriptor_distance=candidate.descriptor_distance, descriptor_dim=candidate.descriptor_dim, tile_pixels_handle=tile_pixels_handle)` to `surviving`. + 4. If `len(surviving) == 0`: emit ERROR log `kind="c2_5.rerank.all_failed"` + FDR record `kind="rerank.all_failed"` with `frame_id` + `candidates_input` + `candidates_dropped`; raise `RerankAllCandidatesFailedError(...)`. + 5. Sort `surviving` descending by `inlier_count`; ties broken by `descriptor_distance` ascending (per Invariant 3 deterministic tie-break). + 6. Truncate to `surviving[:n]`. + 7. If `len(surviving[:n]) < n`: emit WARN log `kind="c2_5.rerank.fewer_than_n_survivors"` with `{requested: n, returned: len(surviving[:n]), dropped: dropped}` (matches description.md § 9 WARN row). + 8. Emit INFO log `kind="c2_5.rerank.frame_done"` (gated by `config.rerank.debug_per_frame_log`; default false to avoid 3 Hz log volume) with the inlier-count vector. Emit FDR record `kind="rerank.frame_done"` (always, NOT gated) with `{frame_id, candidates_input, candidates_dropped, top_inlier_count, top_tile_id}`. + 9. Return `RerankResult(frame_id=vpr_result.frame_id, candidates=surviving[:n], reranked_at=monotonic_ns(), rerank_label="inlier_count", candidates_input=len(vpr_result.candidates), candidates_dropped=dropped)`. + - Module-level `create(config, tile_store, lightglue_runtime) -> ReRankStrategy`: + 1. Read `top_n = config.rerank.top_n` (default 3). + 2. Construct `InlierCountReRanker(tile_store=tile_store, lightglue_runtime=lightglue_runtime, fdr_client=, top_n=top_n)`. + 3. Return the instance. +- Composition-root wiring: `runtime_root.compose_root` includes a path that, after constructing the shared `LightGlueRuntime`, invokes `build_rerank_strategy(...)` (the AZ-342 factory) which dispatches to this task's `create`. +- Logging per description.md § 9: + - INFO `kind="c2_5.rerank.ready"` with `{strategy: "inlier_count", N: 3, K: 10}` after construction. + - WARN `kind="c2_5.rerank.fewer_than_n_survivors"` per frame when survivors < N. + - ERROR `kind="c2_5.rerank.all_failed"` on zero survivors. + - ERROR `kind="c2_5.rerank.backbone_error"` per LightGlue failure. + - ERROR `kind="c2_5.rerank.tile_fetch_error"` per C6 read failure. + - DEBUG `kind="c2_5.rerank.zero_inliers"` per candidate with zero inliers (gated). + - DEBUG `kind="c2_5.rerank.frame_done"` per frame with inlier vector (gated). +- FDR records emitted: `kind="rerank.frame_done"` (always, per frame), `kind="rerank.backbone_error"` (per error), `kind="rerank.tile_fetch_error"` (per error), `kind="rerank.all_failed"` (per zero-survivors event). + +## Scope + +### Included + +- `InlierCountReRanker` class implementing the `ReRankStrategy` Protocol exactly per the AZ-342 contract (every invariant satisfied). +- Drop-and-continue per-candidate error handling for `RerankBackboneError` AND `TileFetchError`. +- Zero-survivors → `RerankAllCandidatesFailedError` path. +- Deterministic top-N sort: descending by `inlier_count`, ties broken ascending by `descriptor_distance`. +- `RerankCandidate` construction with `tile_pixels_handle` carried as a reference (no copy). +- Module-level `create(config, tile_store, lightglue_runtime)` factory entry-point. +- Composition-root wiring path for `config.rerank.strategy == "inlier_count"` (consumed by AZ-342's factory). +- Logging per description.md § 9 (INFO ready, WARN fewer-than-N, ERROR error paths, DEBUG per-frame distances + zero-inliers). +- FDR record emission for frame-done, error paths, and all-failed. +- Unit tests covering Invariants 1–8, the drop-and-continue contract, the zero-survivors path, the tie-break determinism, the `tile_pixels_handle` reference semantics, the composition-root wiring path. +- `BUILD_RERANK_INLIER_COUNT` CMake flag wiring (per ADR-002): the strategy module is excluded from the operator-tooling binary (operator tooling does not run the per-frame pipeline). + +### Excluded + +- The `ReRankStrategy` Protocol + DTOs + errors + factory — owned by AZ-342 (`AZ-342_c2_5_rerank_strategy_protocol`). +- The `LightGlueRuntime` helper itself — already AZ-278 (E-CC-HELPERS); this task consumes the constructor-injected handle and calls `match_single_pair`. +- The C6 `TileStore` interface — owned by AZ-303; this task consumes the Public API. +- The C2 `VprResult` / `VprCandidate` DTOs — owned by AZ-336 (`c2_vpr_strategy_protocol`); this task consumes them at the input boundary. +- LightGlue engine compile (`.onnx` → `.trt`) — owned by AZ-321 (`c10_engine_compiler`); the helper handle wraps the produced engine. +- C3 CrossDomainMatcher — separate epic / task (AZ-257's component decomposition). +- Component-internal acceptance tests beyond Protocol + invariants + drop-and-continue smoke: C2.5-IT-01 (top-1 promotion rate ≥ 0.98 against a fixture corpus), C2.5-PT-01 (latency NFR `rerank` p95 ≤ 80 ms), are deferred to Step 9 / E-BBT. +- Any cross-component re-rank tuning (e.g., learned re-rankers) — future task in a follow-up cycle. + +## Acceptance Criteria + +**AC-1: Protocol conformance** +Given a constructed `InlierCountReRanker` instance +When `isinstance(strategy, ReRankStrategy)` is evaluated +Then the result is `True`; the instance has `rerank` + +**AC-2: Top-N ordering — descending by inlier_count, ties broken ascending by descriptor_distance** +Given a `VprResult` with K=10 candidates whose inlier counts (after rerank) are [412, 198, 287, 153, 287, 0, 65, 412, 89, 234] and descriptor_distances [0.1, 0.4, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] AND `n=3` +When `rerank(frame, vpr_result, n=3, calibration)` is called +Then `RerankResult.candidates[0].inlier_count == 412 AND descriptor_distance == 0.1` (tie-break: lower distance ranked first); `candidates[1].inlier_count == 412 AND descriptor_distance == 0.8`; `candidates[2].inlier_count == 287 AND descriptor_distance == 0.2`; `len(candidates) == 3`; the candidate with `inlier_count == 0` is dropped (DEBUG log emitted) + +**AC-3: Drop-and-continue on `RerankBackboneError`** +Given a `VprResult` with K=10 candidates AND a `LightGlueRuntime` test double that raises `LightGlueError` on the 4th call (4th candidate) and succeeds on all others +When `rerank(...)` is called with `n=3` +Then the call returns successfully; `RerankResult.candidates` has 3 survivors selected from the 9 successful candidates; `candidates_dropped == 1` (or higher if zero-inlier candidates were also present); ONE ERROR log `kind="c2_5.rerank.backbone_error"` is emitted with `tile_id` of the 4th candidate; ONE FDR record `kind="rerank.backbone_error"` is emitted + +**AC-4: Drop-and-continue on `TileFetchError`** +Given a `VprResult` with K=10 candidates AND a `TileStore` test double that raises `TileFetchError` on the 7th candidate's `get_tile_pixels` call +When `rerank(...)` is called with `n=3` +Then the call returns successfully; `RerankResult.candidates` has 3 survivors from the 9 fetched candidates; `candidates_dropped >= 1`; ONE ERROR log `kind="c2_5.rerank.tile_fetch_error"` is emitted; ONE FDR record `kind="rerank.tile_fetch_error"` is emitted + +**AC-5: Zero survivors → `RerankAllCandidatesFailedError`** +Given a `VprResult` with K=10 candidates AND a `LightGlueRuntime` test double that raises `LightGlueError` on EVERY call +When `rerank(...)` is called with `n=3` +Then `RerankAllCandidatesFailedError` is raised with message containing the input candidate count; TEN ERROR logs `kind="c2_5.rerank.backbone_error"` are emitted (one per candidate); ONE final ERROR log `kind="c2_5.rerank.all_failed"` is emitted; ONE FDR record `kind="rerank.all_failed"` is emitted with `{candidates_input: 10, candidates_dropped: 10}` + +**AC-6: Fewer than N survivors → WARN log + partial result** +Given a `VprResult` with K=10 candidates AND a configuration where 8 candidates fail (mix of `RerankBackboneError` + zero-inliers) and 2 succeed +When `rerank(...)` is called with `n=3` +Then `RerankResult.candidates` has 2 survivors (NOT padded; NOT raised); `candidates_dropped == 8`; ONE WARN log `kind="c2_5.rerank.fewer_than_n_survivors"` with `{requested: 3, returned: 2, dropped: 8}` is emitted + +**AC-7: `tile_pixels_handle` is a reference, NOT a copy** +Given a `RerankResult` returned from `rerank(...)` +When the underlying tile pixel buffer (in the C6 page-cache-backed `tile_pixels_handle`) is mutated externally +Then a re-read via the same `tile_pixels_handle` reflects the mutation (proves identity, not a copy); `RerankResult.candidates[0].tile_pixels_handle is original_handle_returned_by_tile_store_get_tile_pixels` + +**AC-8: `descriptor_distance` carried forward unchanged** +Given a `VprResult` whose top candidate has `descriptor_distance == 0.123456789` +When `rerank(...)` is called and the candidate survives +Then `RerankResult.candidates[i].descriptor_distance == 0.123456789` (bit-exact for the FP type used in `VprCandidate`) + +**AC-9: Deterministic — same inputs → bit-identical RerankResult** +Given the same `(frame, vpr_result, n, calibration)` AND a `LightGlueRuntime` test double whose `match_single_pair` is deterministic +When `rerank(...)` is called 3 times +Then all three returns have identical `candidates` (same `tile_id`s, same `inlier_count`s, same order); `frame_id` matches `vpr_result.frame_id` in all three; `reranked_at` differs across calls (monotonic_ns) but `candidates` does not + +**AC-10: Composition-root wiring — `config.rerank.strategy = "inlier_count"`** +Given `config.rerank.strategy = "inlier_count"` AND `config.rerank.top_n = 3` AND a constructed shared `LightGlueRuntime` AND a constructed `TileStore` +When `compose_root(config)` runs +Then an `InlierCountReRanker` instance is wired into the runtime root; ONE INFO log `kind="c2_5.rerank.ready"` with `{strategy: "inlier_count", N: 3, K: 10}` is emitted; the strategy's `_lightglue_runtime` is identity-equal to the runtime root's shared helper + +**AC-11: FDR emission — frame_done record per frame** +Given a successful `rerank(...)` call returning 3 survivors with inlier counts [412, 287, 198] and `candidates_dropped == 2` +When the call completes +Then ONE FDR record `kind="rerank.frame_done"` is emitted with structured fields `{frame_id: , candidates_input: 10, candidates_dropped: 2, top_inlier_count: 412, top_tile_id: }` + +**AC-12: Single-pair LightGlue invocation — frame ↔ tile only** +Given a `VprResult` with K=10 candidates +When `rerank(...)` is called with all candidates succeeding +Then `lightglue_runtime.match_single_pair` is called EXACTLY 10 times (once per candidate); each call's `query_image` is the same `frame.image_bytes_or_decoded` reference; each call's `support_image` differs (one per candidate's `tile_pixels_handle`) + +## Non-Functional Requirements + +**Performance** (deferred validation to C2.5-PT-01 / E-BBT; this task delivers the implementation): +- `rerank` p95 ≤ 80 ms for 10 single-pair LightGlue passes — bounded by 10 × LightGlue forward time (~6-7 ms each on TRT 10.3 FP16 per AZ-278's helper benchmarks) + Python overhead. The Python-side overhead per candidate (fetch handle + log emit + sort) MUST be ≤ 1 ms p95 to keep the LightGlue compute path on budget. +- GPU memory: ≤ 300 MB resident for the shared LightGlue engine — owned by AZ-278 helper; this task consumes one engine instance and does NOT reload. + +**Compatibility** +- The `LightGlueRuntime.match_single_pair` API is owned by AZ-278; this task consumes the published method signature. If AZ-278's API evolves (additional args, different return type), this task is the upstream caller that must update — surfaced by the standard tracker dependency mechanism. +- The `TileStore.get_tile_pixels` Public API is owned by AZ-303; same pattern. + +**Reliability** +- Drop-and-continue is the primary reliability mechanism — a transient LightGlue CUDA OOM on one candidate must NOT propagate to the whole frame. +- The strategy is single-threaded by contract (INV-1, AZ-342); composition root binds it to the same ingest thread as C3 (because they share `LightGlueRuntime`). +- Zero-survivors raises `RerankAllCandidatesFailedError`; downstream C5 falls back to VIO-only with provenance `visual_propagated` (AC-3.5 / description.md § 5 hard-failure path). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | `isinstance(InlierCountReRanker(...), ReRankStrategy)` | `True` | +| AC-2 | Top-N ordering with mixed inlier counts + ties + zeros | Sorted descending by inlier_count; tie-break ascending by descriptor_distance; zero-inliers dropped | +| AC-3 | LightGlue raises on 4th candidate; 9 succeed | 3 survivors from 9; `candidates_dropped >= 1`; ERROR log + FDR record emitted | +| AC-4 | TileStore raises on 7th candidate | 3 survivors from 9; ERROR log + FDR record emitted | +| AC-5 | Every candidate fails | `RerankAllCandidatesFailedError`; 10 ERROR logs + 1 final ERROR log + 1 FDR record `kind="rerank.all_failed"` | +| AC-6 | 8 fail, 2 succeed | 2 survivors (NOT padded); WARN log emitted | +| AC-7 | `tile_pixels_handle` reference semantics | Identity preserved; mutation visible across reads | +| AC-8 | `descriptor_distance` carried forward | Bit-exact match with input | +| AC-9 | Deterministic — 3 calls with same inputs | Identical `candidates`; differing `reranked_at` | +| AC-10 | `compose_root(config="inlier_count")` | Wired; INFO log emitted; helper identity-shared with C3 | +| AC-11 | FDR `kind="rerank.frame_done"` emission | Emitted once per successful call with correct fields | +| AC-12 | Single-pair LightGlue invocation count | Exactly K calls; query_image identity-shared across calls | +| Drop-and-continue mixed | Mixed `RerankBackboneError` + `TileFetchError` + zero-inliers + successes on K=10 | All non-success candidates dropped; logs and FDR records per failure type | + +## Constraints + +- **Single-pair LightGlue ONLY** — this strategy does NOT use multi-pair LightGlue or batched inference. Per description.md § 5, the inlier count is a SINGLE-PAIR forward pass. Batched LightGlue is a different optimisation path (deferred to a future cycle if K=10 proves too slow). +- **Drop-and-continue is mandatory** — Invariant 8 from the contract is non-negotiable; any per-candidate exception MUST be caught and converted to a drop event. Re-raising a per-candidate exception is forbidden; the only escape from `rerank` is `RerankAllCandidatesFailedError` (zero survivors) or success. +- **`tile_pixels_handle` reference, not copy** — Invariant 6; copying would defeat AC-4.1 latency budget. The C6 contract guarantees the page-cache-backed handle is valid for the duration of the `rerank` call (TTL covers a per-frame window). +- **Constructor injection only** — no `import gps_denied_onboard.config` inside the strategy module; config is consumed via the `create` factory. +- **`LightGlueRuntime` is constructor-injected, NOT instantiated here** — the runtime root constructs ONE shared instance and passes it to both this strategy AND the C3 matcher (per the helper-ownership R14 fix). +- **Logging respects DEBUG-gating** — per-frame DEBUG logs (zero-inliers, frame-done) are gated by `config.rerank.debug_per_frame_log` (default false); flooding journald at 3 Hz × K=10 = 30 events/sec by default would violate the spirit of description.md § 9. +- **FDR `kind="rerank.frame_done"` is NOT gated** — it is the primary forensic record for AC-NEW-7 cache-poisoning post-flight analysis; emission rate is 3 Hz which fits FDR's 200 Hz aggregate budget (AC-NEW-3 / E-C13 NFR). + +## Risks & Mitigation + +**Risk 1: `LightGlueRuntime.match_single_pair` API does not yet exist on AZ-278's helper** +- *Risk*: AZ-278 is in `_docs/02_tasks/todo/`; its API surface may not include a `match_single_pair` method explicitly — only a generic `match` or `match_batch`. +- *Mitigation*: This task documents the expected API surface (`match_single_pair(query_image, support_image, calibration) -> InlierCount`); if AZ-278 ships only a batched API, a thin per-call wrapper around the batched API can stay inside this strategy (one batch of size 1). Surface to AZ-278 implementer at decompose-step-4 cross-verification time as a coordination point. + +**Risk 2: 10 single-pair LightGlue calls saturate the GPU stream and serialise behind C3's per-pair work** +- *Risk*: The shared `LightGlueRuntime` requires serial access (Invariant 1 + helper contract); if C3's matcher is also calling the helper in parallel from the same thread (which it shouldn't be), deadlock or cross-frame data corruption could result. +- *Mitigation*: Composition root binds C2.5 and C3 to the SAME single ingest thread (per AZ-342 AC-10); the helper's serial-access invariant is satisfied by single-thread binding. The helper itself MAY add an internal assertion that the calling thread matches the binding thread (owned by AZ-278). C2.5-IT-03 verifies the serial-access invariant. + +**Risk 3: `tile_pixels_handle` lifetime exceeds the C6 page-cache TTL** +- *Risk*: The handle is a reference; if C6 evicts the page before C3 reads from it (in a future frame), C3 sees stale or zero pixels. +- *Mitigation*: C6's contract guarantees the handle is valid for the duration of the per-frame pipeline window (~333 ms at 3 Hz). C3 must consume the handle within the same frame; the per-frame pipeline orchestration (runtime root) enforces no cross-frame retention. `RerankResult` is consumed-once. + +**Risk 4: `inlier_count == 0` is treated as a drop, but a legitimately-low-overlap match might still have value to C3** +- *Risk*: Dropping zero-inlier candidates may discard a candidate that C3 could rescue with its more powerful cross-domain matcher. +- *Mitigation*: Per description.md § 7 caveats, the re-rank correctness depends on inlier count being a meaningful proxy. Zero inliers means the LightGlue forward pass found NO geometric agreement at all — C3's cross-domain matcher would also fail because it operates on an inferior fixed-domain prior. Dropping zero-inliers is the right call. If a future cycle finds counter-examples, the threshold (`inlier_count > 0` → `inlier_count >= 1` → `inlier_count >= MIN_RERANK_INLIERS`) becomes a config knob. + +**Risk 5: `monotonic_ns()` call on the hot path is non-trivial in CPython** +- *Risk*: 3 Hz × N timestamps = 12 timestamp calls per second; `time.monotonic_ns()` is ~50 ns each; negligible. +- *Mitigation*: No mitigation needed; called out for completeness in case profiling later identifies it. + +## Runtime Completeness + +- **Named capability**: `InlierCountReRanker` — production-default `ReRankStrategy` for K=10 → N=3 by single-pair LightGlue inlier count (architecture / E-C2.5 / `solution.md` "single-pair LightGlue inlier count" / AC-2.5-IT-01 + AC-4.1). +- **Production code that must exist**: real `InlierCountReRanker` calling real `LightGlueRuntime.match_single_pair` with the real shared TRT-compiled LightGlue engine; real `TileStore.get_tile_pixels` page-cache-backed handle fetch; real composition-root wiring through the AZ-342 factory. +- **Allowed external stubs**: tests MAY use `FakeLightGlueRuntime` returning pre-computed inlier counts (AC-2..AC-9), `FakeTileStore` returning a fake handle (AC-4 / AC-7 / AC-10), `FakeFdrClient` (verifying FDR record emission), a synthetic frame fixture; production wiring uses the real C6 + AZ-278 helper + LightGlue engine. +- **Unacceptable substitutes**: a pure-Python NumPy implementation of LightGlue inlier counting (would not satisfy C2.5-PT-01 latency at 80 ms p95; would defeat the GPU-bound architectural choice); skipping the drop-and-continue contract and propagating per-candidate exceptions (would break Invariant 8 from the contract); copying tile pixels into the `RerankCandidate` instead of holding the C6 page-cache handle (would violate Invariant 6 and inflate per-frame allocations); calling LightGlue in batched mode without a per-candidate inlier breakdown (would lose the inlier-per-candidate signal needed for ranking); instantiating a SECOND `LightGlueRuntime` for C2.5 instead of consuming the runtime-root-shared one (would double GPU memory and break the helper-ownership R14 fix); ignoring zero-survivors and returning an empty `RerankResult` instead of raising `RerankAllCandidatesFailedError` (would propagate empty input to C3 instead of triggering the C5 VIO-only fallback). diff --git a/_docs/02_tasks/todo/AZ-344_c3_matcher_protocol.md b/_docs/02_tasks/todo/AZ-344_c3_matcher_protocol.md new file mode 100644 index 0000000..5b5b005 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-344_c3_matcher_protocol.md @@ -0,0 +1,190 @@ +# C3 CrossDomainMatcher Protocol + Factory + Composition + +**Task**: AZ-344_c3_matcher_protocol +**Name**: C3 `CrossDomainMatcher` Protocol + Factory + Composition +**Description**: Define the public `CrossDomainMatcher` Protocol (PEP 544 structural interface), the C3 DTOs (`CandidateMatchSet`, `MatchResult`, `MatcherHealth`), the error hierarchy (`MatcherError` family with `MatcherBackboneError`, `InsufficientInliersError`), and the composition-root factory `build_matcher_strategy(config, lightglue_runtime, ransac_filter, inference_runtime) -> CrossDomainMatcher` that selects the concrete matcher at startup based on `config.matcher.strategy` with lazy import + `BUILD_MATCHER_` flag gating per ADR-002. Includes the rolling-window `MatcherHealth` accumulator infrastructure (constructor-injected into every concrete matcher; updated inside `match` after each frame). The shared `LightGlueRuntime` (AZ-278) and `RansacFilter` (AZ-282) helpers are constructor-injected — neither owned by C3. This task delivers the foundational scaffolding every concrete matcher (AZ-345..AZ-347) depends on; no concrete backbone is implemented here. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-270_compose_root, AZ-278_lightglue_runtime, AZ-282_ransac_filter, AZ-297_c7_runtime_protocol (for `InferenceRuntime` interface), AZ-266_log_module +**Component**: c3_matcher (epic AZ-257 / E-C3) +**Tracker**: AZ-344 +**Epic**: AZ-257 (E-C3) + +### Document Dependencies + +- `_docs/02_document/contracts/c3_matcher/cross_domain_matcher_protocol.md` — the public contract this task implements (Protocol surface + DTOs + error hierarchy + factory signature + 9 invariants + test cases). +- `_docs/02_document/components/04_c3_matcher/description.md` — § 1 architectural pattern (Strategy); § 2 `CrossDomainMatcher` interface + DTOs; § 5 error handling (drop-and-continue + below-threshold + all-failed); § 7 caveats (shared helper serial access); § 9 logging. +- `_docs/02_document/module-layout.md` — `c3_matcher` Per-Component Mapping; `BUILD_MATCHER_` rows; § Layer 3. +- `_docs/02_document/architecture.md` — ADR-001, ADR-002, ADR-009. +- `_docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md` — `RerankResult` DTO consumed at the input boundary. +- `_docs/02_document/contracts/shared_helpers/lightglue_runtime.md` — helper handle. +- `_docs/02_document/contracts/shared_helpers/ransac_filter.md` — helper API consumed for inlier filtering. +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — `InferenceRuntime` interface consumed by every concrete backbone. + +## Problem + +Without this task, every concrete matcher (AZ-345..AZ-347) and the downstream C3.5 ConditionalRefiner (AZ-258 component) would each invent their own ad-hoc interface, breaking ADR-001 (Strategy), ADR-002 (build-time exclusion), and ADR-009 (interface-first DI). The rolling `MatcherHealth` accumulator also needs ONE owner; without this task, every concrete matcher would re-implement the rolling window → drift between strategies → C5's spoof-promotion gate (AC-NEW-2 / AC-NEW-7) would behave inconsistently across matchers. + +## Outcome + +- `src/gps_denied_onboard/components/c3_matcher/interface.py` defining the `CrossDomainMatcher` Protocol (`@runtime_checkable`) with `match` and `health_snapshot`. Docstring encodes all 9 invariants from the contract. +- `src/gps_denied_onboard/components/c3_matcher/__init__.py` re-exporting the Protocol + DTOs (`CrossDomainMatcher`, `MatchResult`, `MatcherHealth`). +- `src/gps_denied_onboard/_types/matcher.py` defining the three frozen + slotted dataclasses: `CandidateMatchSet`, `MatchResult`, `MatcherHealth`. Cross-component-consumed → lives in shared `_types/`. +- `src/gps_denied_onboard/components/c3_matcher/errors.py` defining `MatcherError`, `MatcherBackboneError`, `InsufficientInliersError`. +- `src/gps_denied_onboard/components/c3_matcher/_health_window.py` defining `RollingHealthWindow` — the 60 s rolling window with O(1) accumulators (`consecutive_low_inlier`, `mean_inliers_60s`, `backbone_error_count_60s`). Provided to every concrete matcher via constructor injection so they share semantics. +- `src/gps_denied_onboard/runtime_root/matcher_factory.py` exporting `build_matcher_strategy(config, lightglue_runtime, ransac_filter, inference_runtime) -> CrossDomainMatcher`. The factory: + 1. Reads `config.matcher.strategy` (one of: `"disk_lightglue"`, `"aliked_lightglue"`, `"xfeat"`). + 2. Lazy-imports per the strategy resolution table. + 3. ImportError "No module named" → `ConfigurationError(f"BUILD_MATCHER_{strategy.upper()} is OFF...")`. Other ImportErrors re-raised. + 4. Constructs the strategy via its module-level `create(config, lightglue_runtime, ransac_filter, inference_runtime, health_window)` factory function. + 5. Returns the instance. +- Composition-root `compose_root` extension: invoke `build_matcher_strategy` AFTER `LightGlueRuntime` + `RansacFilter` are constructed; bind the result to the same C2.5 ingest thread. Identity-share the `LightGlueRuntime` instance with C2.5 (per AZ-342 AC-10). +- Config schema extension to AZ-269: `config.matcher.strategy` (enum), `config.matcher.min_inliers_threshold` (int, default 60), `config.matcher.residual_warn_threshold_px` (float, default 2.5). +- INFO log on every successful `build_matcher_strategy`: `kind="c3.matcher.strategy_loaded"` with strategy name + thresholds. +- ERROR log on `ConfigurationError` (specific missing flag). + +## Scope + +### Included +- The `CrossDomainMatcher` Protocol with both `match` and `health_snapshot` methods. +- The three DTOs in `_types/matcher.py`. +- The three-class error hierarchy in `c3_matcher/errors.py`. +- The `RollingHealthWindow` accumulator in `_health_window.py` (constructor-injected into every matcher). +- The composition-root factory with lazy-import + `ConfigurationError` mapping. +- Config schema extension for `config.matcher.{strategy, min_inliers_threshold, residual_warn_threshold_px}`. +- Strategy resolution table comment matching the contract verbatim. +- Composition-root wiring path that identity-shares `LightGlueRuntime` with C2.5. +- Unit tests covering: Protocol conformance (`runtime_checkable`), DTO immutability + slots, factory rejection on missing flag, factory acceptance for valid values, rolling window O(1) accumulator correctness, INFO log emission, error hierarchy catchability. +- INFO / ERROR log emission per description.md § 9. + +### Excluded +- Any concrete matcher implementation — owned by AZ-345 (DISK+LightGlue), AZ-346 (ALIKED+LightGlue), AZ-347 (XFeat). +- The `LightGlueRuntime` helper — already AZ-278. +- The `RansacFilter` helper — already AZ-282. +- The C7 `InferenceRuntime` — owned by AZ-297. +- The C2.5 `RerankResult` DTO — consumed; produced by AZ-342. +- Component-internal acceptance tests beyond Protocol-conformance + factory-validation: C3-IT-01..05 + C3-PT-01 deferred to Step 9 / E-BBT. + +## Acceptance Criteria + +**AC-1: Protocol conformance — `runtime_checkable`** +Given a `FakeMatcher` test double implementing both `match` and `health_snapshot` +When `isinstance(fake, CrossDomainMatcher)` is evaluated +Then result is `True`; an object missing either method returns `False` + +**AC-2: DTO immutability + slots** +All three DTOs use `frozen=True, slots=True`; mutation raises `FrozenInstanceError`; `__slots__` non-empty. + +**AC-3: Factory rejects missing build flag** +Given `config.matcher.strategy = "nonexistent_matcher"` +When `build_matcher_strategy(...)` is called +Then `ConfigurationError("BUILD_MATCHER_NONEXISTENT_MATCHER is OFF...")` is raised; ONE ERROR log `kind="c3.matcher.build_flag_off"` is emitted. + +**AC-4: Factory rejects unknown strategy at config-load time** +Given `config.matcher.strategy = "garbage"` (not in the resolution table) +When `load_config(...)` is called +Then `ConfigurationError` raised at config-load time; the factory is never invoked. + +**AC-5: Successful factory load emits INFO log** +Given `config.matcher.strategy = "disk_lightglue"` AND a valid lazy-importable test double module +When `build_matcher_strategy(...)` is called +Then a `CrossDomainMatcher` instance is returned; ONE INFO log `kind="c3.matcher.strategy_loaded"` is emitted with `{strategy, min_inliers_threshold, residual_warn_threshold_px}`. + +**AC-6: Strategy resolution — every entry resolves to its module path** +Given each of three valid `config.matcher.strategy` values +When `build_matcher_strategy` is called for each +Then resolved module path matches the contract's table verbatim. + +**AC-7: Error hierarchy catchability** +Test instances of `MatcherBackboneError` + `InsufficientInliersError` caught by `except MatcherError`. + +**AC-8: Public API surface — `__init__.py` re-exports** +Given `from gps_denied_onboard.components.c3_matcher import CrossDomainMatcher, MatchResult, MatcherHealth` +When the import is evaluated +Then all three names resolve; internal names (`RollingHealthWindow`, `_health_window`) are NOT in `__all__`. + +**AC-9: Strategy bound to single ingest thread by composition root** +Single-thread binding enforced; second binding attempt raises `RuntimeError`. + +**AC-10: `LightGlueRuntime` is identity-shared between C3 and C2.5** +Given a `compose_root(config)` invocation that wires both C2.5 and C3 +When the resulting strategies are inspected +Then `c3_strategy._lightglue_runtime is c2_5_strategy._lightglue_runtime` (identity); ONE INFO log confirming the shared binding is emitted (the SAME log line as AZ-342 AC-10 — emitted ONCE). + +**AC-11: `RollingHealthWindow` O(1) accumulator correctness** +Given a `RollingHealthWindow` (60 s) AND a sequence of (frame_id, inlier_count, had_backbone_error) events spanning 90 s +When `health_snapshot()` is called at t=60s, t=70s, t=90s +Then `consecutive_low_inlier`, `mean_inliers_60s`, `backbone_error_count_60s` match an independent sliding-window computation; each `health_snapshot()` call is O(1) (microbench p99 ≤ 50 µs). + +**AC-12: `RollingHealthWindow.update()` API** +Given the window has incremental update entry-points (called from inside concrete matchers' `match` after each frame) +When `update(timestamp_ns, best_inlier_count, had_backbone_error)` is called +Then accumulators are updated incrementally; `mean_inliers_60s` is the rolling mean; `consecutive_low_inlier` resets to 0 when a frame's `inlier_count >= min_inliers_threshold`. + +## Non-Functional Requirements + +**Performance** +- `build_matcher_strategy` p99 ≤ 50 ms (factory itself; concrete-strategy construction cost owned by AZ-345..AZ-347). +- `RollingHealthWindow.update` p99 ≤ 5 µs; `health_snapshot` p99 ≤ 50 µs. + +**Compatibility** +- Protocol method-signature changes are major version bumps (lockstep update). +- DTO field additions are minor; field removals are major. + +**Reliability** +- Lazy-import via `importlib.import_module`; build-time-excluded matchers never load CUDA / TensorRT. +- Single-thread invariant enforced at composition-root binding time (AC-9). +- `RollingHealthWindow` is a non-thread-safe single-thread structure; matches the single-thread binding invariant. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Protocol conformance | Fake passes; partial fake fails | +| AC-2 | DTO immutability + slots | `FrozenInstanceError`; non-empty `__slots__` | +| AC-3 | Factory + missing flag | `ConfigurationError`; ERROR log | +| AC-4 | Config load + unknown strategy | `ConfigurationError` at load time | +| AC-5 | Factory + valid load | Strategy returned; INFO log with structured fields | +| AC-6 | All 3 strategy values resolve | Module paths match resolution table | +| AC-7 | Error catchability | All errors caught by `except MatcherError` | +| AC-8 | Public API re-exports | Public names resolve; internals not in `__all__` | +| AC-9 | Single-thread binding | Second binding raises `RuntimeError` | +| AC-10 | Identity-share with C2.5 | `is` identity preserved | +| AC-11 | Rolling window correctness | Matches independent sliding-window computation | +| AC-12 | `update` semantics | `consecutive_low_inlier` resets on a high-inlier frame; `mean_inliers_60s` is rolling | +| NFR-perf-window | `RollingHealthWindow.update` × 100k | p99 ≤ 5 µs | + +## Constraints + +- **Lazy import is mandatory** (ADR-002). +- **`@runtime_checkable` MUST be used.** +- **DTOs MUST be `frozen=True, slots=True`.** +- **Concrete matcher modules export `create(...)` as their entry-point.** +- **`config.matcher.strategy` is an enum** validated at config load. +- **The factory does NOT instantiate `LightGlueRuntime` or `RansacFilter`** — runtime root constructs ONCE and shares with C2.5 (AC-10). +- **`RollingHealthWindow` is single-thread** — no locks; matches single-thread binding invariant. Adding locks would mask binding bugs. + +## Risks & Mitigation + +**Risk 1: `runtime_checkable` Protocol checks have known performance cost** +- *Mitigation*: `isinstance` only at composition-root binding (AC-9), not per-frame. + +**Risk 2: Lazy-import error message obscures real failure mode** +- *Mitigation*: factory catches `ImportError`, inspects message; "No module named" → "BUILD flag OFF"; otherwise re-raises preserving native context. + +**Risk 3: `RollingHealthWindow` 60 s window data structure choice (deque vs ring buffer)** +- *Mitigation*: implementation detail; AC-11 + AC-12 + NFR-perf-window are the contract. Any structure satisfying O(1) update + O(1) snapshot is acceptable. + +**Risk 4: `compose_root` thread-binding registry / shared-helper composition not yet implemented in AZ-270** +- *Mitigation*: same as AZ-342 Risk 4. Keep AC-9 / AC-10; if AZ-270 lacks the registry, escalate via tracker dependency mechanism. + +## Runtime Completeness + +- **Named capability**: `CrossDomainMatcher` Protocol + composition-root factory + `RollingHealthWindow` accumulator + ADR-002 build-time exclusion. +- **Production code that must exist**: real Protocol + real DTOs + real error hierarchy + real `build_matcher_strategy` factory + real `RollingHealthWindow` + real config schema extension + real composition-root wiring path that identity-shares `LightGlueRuntime` with C2.5. +- **Allowed external stubs**: `FakeMatcher`, `FakeLightGlueRuntime`, `FakeRansacFilter`, `FakeInferenceRuntime` for tests. Production wiring uses real concretes. +- **Unacceptable substitutes**: direct `from .disk_lightglue import DiskLightGlueMatcher` in the factory (defeats ADR-002); a `Type[CrossDomainMatcher]` registry that pre-imports all matchers (defeats lazy-import); making `RollingHealthWindow` thread-safe with locks (would mask single-thread binding bugs); skipping the identity-share with C2.5 (would double GPU memory). + +## Contract + +This task produces/implements the contract at `_docs/02_document/contracts/c3_matcher/cross_domain_matcher_protocol.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-345_c3_disk_lightglue.md b/_docs/02_tasks/todo/AZ-345_c3_disk_lightglue.md new file mode 100644 index 0000000..ae5b0a4 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-345_c3_disk_lightglue.md @@ -0,0 +1,209 @@ +# C3 DISK+LightGlue Primary Matcher + +**Task**: AZ-345_c3_disk_lightglue +**Name**: C3 DISK+LightGlue Primary Matcher +**Description**: Implement `DiskLightGlueMatcher`, the production-default `CrossDomainMatcher` (per D-C3-1 = (a)). For each top-N=3 candidate in a `RerankResult`: extract DISK keypoints + descriptors from the nav-camera frame and the candidate tile via the C7 `InferenceRuntime` (TensorRT 10.3 FP16 primary, ONNX-Runtime fallback); match keypoints via the shared `LightGlueRuntime` helper (AZ-278); filter inliers + compute median reprojection residual via the shared `RansacFilter` helper (AZ-282); record the result in a `CandidateMatchSet`. Sort surviving candidates descending by inlier count (tie-break: lower median residual ranked higher); return the best as `MatchResult.best_candidate_idx`. Implements the drop-and-continue contract (Invariant 4) for per-candidate `MatcherBackboneError`. Updates the constructor-injected `RollingHealthWindow` after each frame. Composition-root wired via the AZ-344 factory. +**Complexity**: 5 points +**Dependencies**: AZ-344 (Protocol + factory + DTOs + errors + RollingHealthWindow), AZ-263_initial_structure, AZ-269_config_loader, AZ-278_lightglue_runtime (shared LightGlue helper), AZ-282_ransac_filter (shared RANSAC helper), AZ-298_c7_tensorrt_runtime (DISK forward via TRT), AZ-299_c7_onnxrt_fallback (DISK forward via ONNX-RT fallback), AZ-303_c6_storage_interfaces (`tile_pixels_handle` from `RerankResult`; tile pixel decode), AZ-281_engine_filename_schema (DISK engine self-describing filename), AZ-321_c10_engine_compiler (DISK + LightGlue engine compile path), AZ-266_log_module, AZ-272_fdr_record_schema +**Component**: c3_matcher (epic AZ-257 / E-C3) +**Tracker**: AZ-345 +**Epic**: AZ-257 (E-C3) + +### Document Dependencies + +- `_docs/02_document/contracts/c3_matcher/cross_domain_matcher_protocol.md` — Protocol contract (every invariant satisfied; drop-and-continue is INV-4). +- `_docs/02_document/components/04_c3_matcher/description.md` — § 1 D-C3-1 = (a) production-default; § 5 error handling; § 7 shared helper serial access; § 9 logging. +- `_docs/02_document/module-layout.md` — `c3_matcher` Per-Component Mapping (`disk_lightglue.py` Internal); `BUILD_MATCHER_DISK_LIGHTGLUE` row (ON for airborne / research / replay-cli). +- `_docs/02_document/contracts/shared_helpers/lightglue_runtime.md` — single-pair / multi-pair API. +- `_docs/02_document/contracts/shared_helpers/ransac_filter.md` — RANSAC + median residual API. +- `_docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md` — `RerankResult` consumed at input boundary. +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — DISK forward via `InferenceRuntime`. +- `_docs/02_document/components/04_c3_matcher/tests.md` — C3-IT-01 (best-candidate inlier count p5 ≥ 80); C3-IT-02 (deterministic best_candidate_idx); C3-IT-03 (cross-domain MRE p95 < 2.5 px); C3-IT-04 (tilt ±20° + 350m outliers); C3-IT-05 (`InsufficientInliersError` propagation); C3-PT-01 (latency p95 ≤ 180 ms; per-candidate ≤ 60 ms; GPU mem ≤ 800 MB). + +## Problem + +Without this task: `compose_root` cannot wire when `config.matcher.strategy = "disk_lightglue"` (the default value); F3 / F6 cannot run; AC-1.1 (best-candidate inlier count p5 ≥ 80) has no producer; AC-2.2 (cross-domain MRE p95 < 2.5 px) is unmeasurable; AC-NEW-7 cache-poisoning safety budget loses its primary detection signal (low-inlier frames in MatcherHealth). The DISK+LightGlue choice is locked per Mode B Fact #110 / D-C3-1; without this task the locked decision is unrealised. + +## Outcome + +- `src/gps_denied_onboard/components/c3_matcher/disk_lightglue.py` defining: + - `DiskLightGlueMatcher` class implementing the `CrossDomainMatcher` Protocol (AZ-344). + - Constructor: `__init__(self, runtime: InferenceRuntime, lightglue_runtime: LightGlueRuntime, ransac_filter: RansacFilter, fdr_client: FdrClient, health_window: RollingHealthWindow, config: MatcherConfig)`. The strategy holds the DISK engine ID (returned by `runtime.load_engine`) plus references to the constructor-injected `LightGlueRuntime` + `RansacFilter`. + - `match(frame, rerank_result, calibration)`: + 1. Decode + preprocess the nav-camera frame ONCE (resize / normalise per DISK's input contract). + 2. Run DISK forward on the query frame → `(query_keypoints, query_descriptors)`. + 3. `survivors: list[CandidateMatchSet] = []`, `dropped = 0`. + 4. For each `RerankCandidate` in `rerank_result.candidates`: + a. Decode + preprocess the candidate tile (from `tile_pixels_handle`). + b. Try DISK forward on the tile → `(tile_keypoints, tile_descriptors)`. On failure: wrap as `MatcherBackboneError`; emit ERROR log + FDR record `kind="matcher.backbone_error"` with `tile_id` + `phase="disk_forward"`; `dropped += 1`; continue. + c. Try `lightglue_runtime.match_pair(query_keypoints, query_descriptors, tile_keypoints, tile_descriptors)` → `correspondences` (raw matches before RANSAC). On failure: wrap as `MatcherBackboneError`; phase="lightglue_match"; drop; continue. + d. `ransac_result = ransac_filter.filter(correspondences, threshold_px=config.ransac_threshold_px)` → `RansacResult(inlier_correspondences, ransac_outlier_count, per_candidate_residual_px)`. The helper handles RANSAC + median residual computation. + e. If `ransac_result.inlier_correspondences.shape[0] == 0`: emit DEBUG log `kind="c3.matcher.zero_inliers"`; `dropped += 1`; continue. + f. Append `CandidateMatchSet(tile_id=candidate.tile_id, inlier_count=ransac_result.inlier_correspondences.shape[0], inlier_correspondences=ransac_result.inlier_correspondences, ransac_outlier_count=ransac_result.ransac_outlier_count, per_candidate_residual_px=ransac_result.per_candidate_residual_px)` to `survivors`. + 5. Determine `survivor_max_inliers = max(s.inlier_count for s in survivors)` (or 0 if empty). + 6. If `len(survivors) == 0` OR `survivor_max_inliers < config.min_inliers_threshold`: emit ERROR log `kind="c3.matcher.insufficient_inliers"` + FDR record `kind="matcher.insufficient_inliers"`; `health_window.update(now, best_inlier_count=0, had_backbone_error=(dropped > 0))`; raise `InsufficientInliersError`. + 7. Sort `survivors` descending by `inlier_count`; ties broken by `per_candidate_residual_px` ascending. The first survivor is the best. + 8. `best = survivors[0]`. If `best.per_candidate_residual_px > config.residual_warn_threshold_px`: emit WARN log `kind="c3.matcher.residual_above_threshold"` (will trigger AdHoP at C3.5). + 9. `health_window.update(now, best_inlier_count=best.inlier_count, had_backbone_error=(dropped > 0))`. + 10. Emit FDR record `kind="matcher.frame_done"` with `{frame_id, candidates_input, candidates_dropped, best_inlier_count, best_residual_px, best_tile_id}`. + 11. Return `MatchResult(frame_id=rerank_result.frame_id, per_candidate=survivors, best_candidate_idx=0, reprojection_residual_px=best.per_candidate_residual_px, matched_at=monotonic_ns(), matcher_label="disk_lightglue", candidates_input=len(rerank_result.candidates), candidates_dropped=dropped)`. + - `health_snapshot()`: returns `self._health_window.snapshot()`. + - Module-level `create(config, lightglue_runtime, ransac_filter, inference_runtime, health_window) -> CrossDomainMatcher`: + 1. `disk_weights_path = config.matcher.disk_weights_path` (TRT engine produced by AZ-321). + 2. Load DISK engine via `inference_runtime.load_engine(disk_weights_path)`. + 3. Construct `DiskLightGlueMatcher(...)`. +- Composition-root wiring path for `config.matcher.strategy == "disk_lightglue"`. +- Logging per description.md § 9: INFO ready; WARN residual-above-threshold; ERROR insufficient-inliers + backbone-error; DEBUG per-frame inlier+residual list (gated). +- FDR records: `matcher.frame_done` (always per frame), `matcher.backbone_error` (per error), `matcher.insufficient_inliers` (per all-failed event). + +## Scope + +### Included +- `DiskLightGlueMatcher` class implementing `CrossDomainMatcher` exactly per the AZ-344 contract. +- DISK forward via C7 `InferenceRuntime` (TRT primary; ONNX-RT fallback chain owned by C7 — this task consumes the unified interface). +- LightGlue matching via shared helper. +- RANSAC + median residual via shared `RansacFilter` helper. +- Drop-and-continue per-candidate error handling (Invariant 4). +- Below-threshold all-failed → `InsufficientInliersError`. +- Deterministic best-candidate selection (Invariant 3). +- `RollingHealthWindow.update` after each frame. +- Composition-root wiring path. +- Logging + FDR record emission per description.md § 9. +- Unit tests covering Invariants 1–9, drop-and-continue, below-threshold, deterministic ordering, `tile_pixels_handle` reference semantics, composition-root wiring path. +- `BUILD_MATCHER_DISK_LIGHTGLUE` flag wiring (ON in airborne / research / replay-cli; OFF in operator-tooling). + +### Excluded +- The Protocol + DTOs + errors + factory + `RollingHealthWindow` — owned by AZ-344. +- The `LightGlueRuntime` helper — already AZ-278. +- The `RansacFilter` helper — already AZ-282. +- The C7 `InferenceRuntime` — owned by AZ-297..AZ-300. +- DISK engine compile (.onnx → .trt) — owned by AZ-321; this task consumes the produced engine. +- ALIKED+LightGlue (AZ-346) and XFeat (AZ-347). +- Component-internal acceptance tests beyond Invariants 1–9 + drop-and-continue smoke: C3-IT-01 (recall floor), C3-IT-03 (cross-domain MRE), C3-IT-04 (tilt outliers), C3-PT-01 (latency NFR), are deferred to Step 9 / E-BBT. + +## Acceptance Criteria + +**AC-1: Protocol conformance** +`isinstance(DiskLightGlueMatcher(...), CrossDomainMatcher)` returns `True`. + +**AC-2: Best-candidate selection — argmax(inlier_count) + tie-break** +Given a `RerankResult` with N=3 candidates whose computed inlier counts are [120, 80, 120] and median residuals [1.4, 1.0, 1.1] +When `match(...)` is called +Then `best_candidate_idx == 0` (the candidate with `inlier_count=120` AND `residual=1.1` (lower than the other 120-inlier candidate's 1.4)); `per_candidate[0].inlier_count == 120 AND per_candidate_residual_px == 1.1`; `per_candidate[1].inlier_count == 120 AND per_candidate_residual_px == 1.4`; `per_candidate[2].inlier_count == 80`. + +**AC-3: Drop-and-continue on per-candidate `MatcherBackboneError`** +Given an `InferenceRuntime` test double that raises `RuntimeError` on the 2nd candidate's DISK forward and succeeds on others +When `match(...)` is called +Then `len(per_candidate) == 2`; `candidates_dropped == 1`; ONE ERROR log `kind="c3.matcher.backbone_error"` is emitted with `tile_id` + `phase="disk_forward"`; ONE FDR record `kind="matcher.backbone_error"` is emitted; success path continues. + +**AC-4: Drop-and-continue on per-candidate LightGlue failure** +Given a `LightGlueRuntime` test double that raises on the 1st candidate's match call +When `match(...)` is called +Then the candidate is dropped with `phase="lightglue_match"`; ERROR log + FDR record emitted; remaining candidates processed. + +**AC-5: Below-threshold → `InsufficientInliersError`** +Given `config.matcher.min_inliers_threshold = 60` AND every candidate's RANSAC inlier count is < 60 +When `match(...)` is called +Then `InsufficientInliersError` is raised; ONE ERROR log `kind="c3.matcher.insufficient_inliers"` + ONE FDR record `kind="matcher.insufficient_inliers"` are emitted; `health_window.update(now, best_inlier_count=0, had_backbone_error=False)` is invoked. + +**AC-6: All-failed → `InsufficientInliersError`** +Given every candidate's DISK forward raises +When `match(...)` is called +Then `InsufficientInliersError` is raised; per-candidate ERROR logs + final ERROR log emitted; `health_window.update(now, best_inlier_count=0, had_backbone_error=True)` is invoked. + +**AC-7: WARN log on residual above threshold** +Given the best candidate's `per_candidate_residual_px = 4.2` AND `config.matcher.residual_warn_threshold_px = 2.5` +When `match(...)` returns +Then ONE WARN log `kind="c3.matcher.residual_above_threshold"` with `{residual_px: 4.2, threshold_px: 2.5}` is emitted. + +**AC-8: `health_window.update` invoked after every `match` (success or failure)** +Given any `match(...)` call (success, partial drop, all-failed) +When the call completes (returns normally OR raises `InsufficientInliersError`) +Then `health_window.update(...)` is invoked exactly ONCE for that frame; `best_inlier_count` matches the actual best inlier count (0 on all-failed); `had_backbone_error == True` if any candidate dropped due to backbone failure. + +**AC-9: `inlier_correspondences` shape contract** +Given a successful `match(...)` +When inspecting any `CandidateMatchSet` +Then `inlier_correspondences.shape == (inlier_count, 4)`; `dtype == float32`. + +**AC-10: Deterministic — same inputs → bit-identical MatchResult** +Given fixed inputs and deterministic test doubles +When `match(...)` is called 3 times +Then all three returns have identical `per_candidate` content (same inlier_counts, same residuals, same best_candidate_idx). + +**AC-11: Composition-root wiring** +Given `config.matcher.strategy = "disk_lightglue"` AND a constructed shared `LightGlueRuntime` AND `RansacFilter` AND `InferenceRuntime` +When `compose_root(config)` runs +Then a `DiskLightGlueMatcher` instance is wired; ONE INFO log `kind="c3.matcher.ready"` with `{strategy: "disk_lightglue", min_inliers_threshold, residual_warn_threshold_px}` is emitted; the strategy's `_lightglue_runtime` is identity-equal to the runtime root's shared helper. + +**AC-12: FDR `matcher.frame_done` per frame** +Given a successful `match(...)` returning best candidate with inlier_count=120 and residual=1.1, dropped=1 +When the call completes +Then ONE FDR record `kind="matcher.frame_done"` is emitted with structured fields `{frame_id, candidates_input: 3, candidates_dropped: 1, best_inlier_count: 120, best_residual_px: 1.1, best_tile_id: }`. + +## Non-Functional Requirements + +**Performance** (deferred validation to C3-PT-01): +- `match` p95 ≤ 180 ms (3 candidates × ~60 ms DISK forward + LightGlue match + RANSAC). +- Per-candidate p95 ≤ 60 ms. +- GPU memory ≤ 800 MB combined (DISK engine + LightGlue engine resident). + +**Compatibility** +- DISK engine file format owned by C10 + C7; this task consumes via `config.matcher.disk_weights_path`. +- Upstream DISK research code drop pinned per Plan-phase; weight changes require C10 rebuild + C3-IT-03 re-run. + +**Reliability** +- Drop-and-continue per candidate (Invariant 4). +- Single-thread by contract (INV-1). +- `InsufficientInliersError` triggers C5 VIO-only fallback (AC-3.5); does NOT crash. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Protocol conformance | `isinstance` returns `True` | +| AC-2 | Best-candidate + tie-break | Lower residual wins among tied inliers | +| AC-3 | DISK forward fails on 2nd | 2 survivors; ERROR log + FDR record | +| AC-4 | LightGlue fails on 1st | 2 survivors; phase="lightglue_match" | +| AC-5 | All below threshold | `InsufficientInliersError`; health update | +| AC-6 | All forwards fail | `InsufficientInliersError`; per-candidate logs | +| AC-7 | Residual > warn threshold | WARN log emitted | +| AC-8 | Health update invoked once per `match` | One update per call regardless of outcome | +| AC-9 | Correspondences shape | (I, 4) float32; I == inlier_count | +| AC-10 | Determinism | 3 calls return identical content | +| AC-11 | `compose_root` wiring | Wired; INFO log; helper identity-shared | +| AC-12 | FDR `frame_done` emission | Correct structured fields | + +## Constraints + +- **Drop-and-continue is mandatory** — Invariant 4; per-candidate exceptions never propagate. +- **Median residual, not mean** — Invariant 8; computed inside `RansacFilter`. +- **Constructor injection only** — no `import gps_denied_onboard.config` inside the strategy module. +- **`LightGlueRuntime` and `RansacFilter` are constructor-injected** — never instantiated here. +- **DISK engine load at `create` time, NOT at first frame** — engine-output assertion fires at startup. +- **Tile pixel decode is per-call** — but the underlying `tile_pixels_handle` is page-cache-backed (not copied into the strategy). +- **`RollingHealthWindow.update` is called EXACTLY once per `match`** — including the all-failed path. + +## Risks & Mitigation + +**Risk 1: DISK upstream code drop ships an unsupported ONNX op for TRT 10.3** +- *Mitigation*: engine compile is C10's responsibility (AZ-321). If C10 cannot build the engine, this task is blocked upstream — surface via tracker dependency mechanism. + +**Risk 2: `LightGlueRuntime.match_pair` API not yet defined** +- *Mitigation*: AZ-278 defines the helper API; this task consumes whatever AZ-278 ships. If only single-pair is provided, this task wraps single-pair calls in a per-candidate loop (already structured that way). Surface to AZ-278 implementer at decompose-step-4. + +**Risk 3: Tile pixel decode is non-trivial cost on hot path** +- *Mitigation*: tile pixels arrive as page-cache-backed handles from C6; decode (JPEG → ndarray) happens once per candidate. If profiling shows this is a bottleneck, a future optimization pre-decodes adjacent tiles in C6's mmap layer. + +**Risk 4: Deterministic best-candidate tie-break depends on stable sort** +- *Mitigation*: Python's `list.sort()` is stable; the implementation uses `sorted(survivors, key=lambda s: (-s.inlier_count, s.per_candidate_residual_px))` which is deterministic. Test AC-2 asserts the exact ordering on a tie scenario. + +**Risk 5: `RollingHealthWindow` drift between matcher implementations** +- *Mitigation*: ONE `RollingHealthWindow` class owned by AZ-344; constructor-injected into every concrete matcher. AZ-345/AZ-346/AZ-347 use the same instance type via the same constructor injection. + +## Runtime Completeness + +- **Named capability**: `DiskLightGlueMatcher` — production-default `CrossDomainMatcher` for cross-domain feature matching (architecture / E-C3 / `solution.md` / D-C3-1 / AC-1.1 + AC-2.2 + AC-3.1). +- **Production code that must exist**: real `DiskLightGlueMatcher` calling real C7 `InferenceRuntime` with real TRT-compiled DISK engine; real shared `LightGlueRuntime` calls; real shared `RansacFilter` for inlier filtering + median residual; real `RollingHealthWindow.update` after each frame; real composition-root wiring. +- **Allowed external stubs**: `FakeInferenceRuntime`, `FakeLightGlueRuntime`, `FakeRansacFilter`, `FakeFdrClient`, synthetic frame fixtures for unit tests. +- **Unacceptable substitutes**: a Python+NumPy implementation of DISK forward (would not satisfy C3-PT-01 latency); a different RANSAC implementation per matcher (would defeat AZ-282 helper); skipping `RollingHealthWindow.update` on the all-failed path (would lose the health signal C5 needs); calling `LightGlueRuntime` in batch mode without per-candidate inlier breakdown; using the mean residual instead of the median (would violate INV-8). diff --git a/_docs/02_tasks/todo/AZ-346_c3_aliked_lightglue.md b/_docs/02_tasks/todo/AZ-346_c3_aliked_lightglue.md new file mode 100644 index 0000000..e1f57f0 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-346_c3_aliked_lightglue.md @@ -0,0 +1,120 @@ +# C3 ALIKED+LightGlue Secondary Matcher + +**Task**: AZ-346_c3_aliked_lightglue +**Name**: C3 ALIKED+LightGlue Secondary Matcher +**Description**: Implement `AlikedLightGlueMatcher`, the secondary `CrossDomainMatcher`. Same architecture as `DiskLightGlueMatcher` (AZ-345) — DISK is replaced by ALIKED for the per-frame keypoint+descriptor extraction step; LightGlue + RANSAC stages are unchanged. Selectable via `config.matcher.strategy = "aliked_lightglue"`. ALIKED is the candidate alternative if D-C3-1 IT-12 verdict shifts away from DISK; until then it ships as the secondary path linked into airborne / research binaries (per ADR-002, both backbones can be linked; only one is selected at runtime). +**Complexity**: 3 points +**Dependencies**: AZ-344 (Protocol + factory + DTOs + errors + RollingHealthWindow), AZ-263_initial_structure, AZ-269_config_loader, AZ-278_lightglue_runtime, AZ-282_ransac_filter, AZ-298_c7_tensorrt_runtime, AZ-299_c7_onnxrt_fallback, AZ-303_c6_storage_interfaces, AZ-281_engine_filename_schema (ALIKED engine self-describing filename), AZ-321_c10_engine_compiler (ALIKED engine compile path), AZ-266_log_module, AZ-272_fdr_record_schema +**Component**: c3_matcher (epic AZ-257 / E-C3) +**Tracker**: AZ-346 +**Epic**: AZ-257 (E-C3) + +### Document Dependencies + +- `_docs/02_document/contracts/c3_matcher/cross_domain_matcher_protocol.md` — Protocol contract (every invariant satisfied; mirrors AZ-345's contract behavior). +- `_docs/02_document/components/04_c3_matcher/description.md` — § 1 ALIKED secondary; § 5 same error handling; § 9 logging. +- `_docs/02_document/module-layout.md` — `c3_matcher` Per-Component Mapping (`aliked_lightglue.py` Internal); `BUILD_MATCHER_ALIKED_LIGHTGLUE` row. +- `_docs/02_document/contracts/shared_helpers/lightglue_runtime.md`. +- `_docs/02_document/contracts/shared_helpers/ransac_filter.md`. +- `_docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md`. +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md`. + +## Problem + +Without this task: D-C3-1 IT-12 evaluation has no comparison point against DISK; if a future cycle's IT-12 verdict shifts the production-default to ALIKED, the airborne binary cannot be re-configured without a new task; the ADR-002 build-time exclusion machinery is under-tested (only one matcher would exist). ALIKED is also the documented fallback if DISK's licensing or upstream maintenance changes mid-cycle. + +## Outcome + +- `src/gps_denied_onboard/components/c3_matcher/aliked_lightglue.py` defining: + - `AlikedLightGlueMatcher` class implementing the `CrossDomainMatcher` Protocol. + - Constructor identical shape to `DiskLightGlueMatcher` (AZ-345); the only differences are: ALIKED engine loaded instead of DISK, `matcher_label = "aliked_lightglue"`, ALIKED-specific preprocessor (resize / normalise per the upstream ALIKED contract). + - `match` method: identical control flow to AZ-345's `match` — drop-and-continue, RANSAC + median residual, deterministic best-candidate selection, `RollingHealthWindow.update`, FDR `matcher.frame_done`. The ONLY difference is the keypoint+descriptor extraction step calls the ALIKED engine instead of DISK. + - `health_snapshot()` delegates to the constructor-injected `RollingHealthWindow`. + - Module-level `create(config, lightglue_runtime, ransac_filter, inference_runtime, health_window) -> CrossDomainMatcher`: + 1. `aliked_weights_path = config.matcher.aliked_weights_path` (TRT engine produced by AZ-321). + 2. Load ALIKED engine via `inference_runtime.load_engine(...)`. + 3. Construct `AlikedLightGlueMatcher(...)`. +- Composition-root wiring path for `config.matcher.strategy == "aliked_lightglue"`. +- `BUILD_MATCHER_ALIKED_LIGHTGLUE` flag wiring (per ADR-002): ON in airborne + research binaries; OFF in operator-tooling. +- ALIKED-specific preprocessor lives next to the strategy in the same module (NOT in `helpers/` — preprocessing parameters are weights-coupled per the same rule applied in AZ-337 / AZ-345). +- All logging + FDR records identical structure to AZ-345 with `matcher_label = "aliked_lightglue"`. + +## Scope + +### Included +- `AlikedLightGlueMatcher` implementation per the `CrossDomainMatcher` Protocol. +- ALIKED forward via C7 `InferenceRuntime`. +- LightGlue matching via shared helper. +- RANSAC + median residual via `RansacFilter`. +- Same drop-and-continue + below-threshold + best-candidate selection semantics as AZ-345. +- Same `RollingHealthWindow.update` invocation pattern. +- Composition-root wiring path. +- ALIKED-specific preprocessor inline. +- Unit tests covering Invariants 1–9 + drop-and-continue + below-threshold + deterministic ordering, parametrised so they share fixtures with AZ-345's tests where possible. +- `BUILD_MATCHER_ALIKED_LIGHTGLUE` flag wiring. + +### Excluded +- The Protocol + DTOs + errors + factory + `RollingHealthWindow` — owned by AZ-344. +- `LightGlueRuntime` (AZ-278) and `RansacFilter` (AZ-282) helpers. +- C7 runtime stack (AZ-297..AZ-300). +- ALIKED engine compile (AZ-321). +- Component-internal acceptance tests beyond Protocol + invariants smoke: deferred to Step 9 / E-BBT. +- DISK matcher (AZ-345) and XFeat matcher (AZ-347). + +## Acceptance Criteria + +**AC-1 through AC-12**: identical contract to AZ-345 AC-1..AC-12 with `matcher_label = "aliked_lightglue"` and ALIKED-specific tile preprocessing. The Protocol invariants are the same; the implementation is the same modulo backbone. Tests parametrise across both backbones so any divergence is caught. + +**AC-special-1: ALIKED engine output schema is asserted at `create` time** +Given a TRT engine whose ALIKED output dimensionality differs from the upstream-published value (e.g., descriptor_dim != expected) +When `AlikedLightGlueMatcher.create(...)` is called +Then `ConfigurationError` is raised with the offending shape; the strategy is NOT instantiated. + +**AC-special-2: Strategy selection — `config.matcher.strategy == "aliked_lightglue"`** +Given the runtime composition with `config.matcher.strategy = "aliked_lightglue"` AND `BUILD_MATCHER_ALIKED_LIGHTGLUE = ON` +When `compose_root(config)` runs +Then an `AlikedLightGlueMatcher` is instantiated; ONE INFO log `kind="c3.matcher.ready"` with `{strategy: "aliked_lightglue", ...}` is emitted; `_lightglue_runtime` identity-equal to the runtime root's shared helper. + +## Non-Functional Requirements + +**Performance** (deferred validation to C3-PT-01): +- Same envelope as AZ-345: `match` p95 ≤ 180 ms; per-candidate ≤ 60 ms; GPU mem ≤ 800 MB. + +**Compatibility** +- ALIKED engine file format owned by C10 + C7; consumed via `config.matcher.aliked_weights_path`. + +**Reliability** +- Same as AZ-345: drop-and-continue, single-thread by contract, `InsufficientInliersError` triggers VIO-only fallback. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1..AC-12 | Identical to AZ-345 AC-1..AC-12 with ALIKED label | Same outcomes; `matcher_label = "aliked_lightglue"` | +| AC-special-1 | ALIKED engine output shape mismatch | `ConfigurationError` at create time | +| AC-special-2 | `compose_root(config="aliked_lightglue")` | Wired; INFO log emitted; helper identity-shared | +| Parametrised drop-and-continue | Run AZ-345's drop-and-continue tests against ALIKED matcher fixture | Same drop-and-continue semantics | + +## Constraints + +- **Same constraints as AZ-345** — drop-and-continue mandatory, median residual, constructor injection, helpers constructor-injected, ALIKED engine load at `create` time, `RollingHealthWindow.update` called exactly once per `match`. +- **ALIKED-specific preprocessing parameters are hard-coded** — weights-coupled (same rule as DISK and UltraVPR); making them config-knobs would let an operator silently break the AC-1.1 inlier floor. +- **Both DISK and ALIKED engines may be linked into the same binary** — ADR-002 allows multiple backbones at link time; only `config.matcher.strategy` selects which is instantiated. NOT mutually exclusive at build time (operator-tooling excludes both via `BUILD_MATCHER_*` flags OFF). + +## Risks & Mitigation + +**Risk 1: ALIKED upstream code drop preprocessing differs from DISK in non-obvious ways** +- *Mitigation*: ALIKED preprocessor lives next to the strategy with hard-coded parameters; tests assert the preprocessor matches the upstream-published values; engine compile (AZ-321) consumes the same parameters. + +**Risk 2: ALIKED's keypoint count distribution differs from DISK** (e.g., ALIKED returns more or fewer keypoints by default) +- *Mitigation*: LightGlue and RANSAC are agnostic to keypoint count distribution; the median residual + inlier count metrics are normalised. C3-IT-01 (deferred) measures this empirically. + +**Risk 3: Switching from DISK to ALIKED at runtime requires a corpus rebuild** +- *Mitigation*: NO. C2's descriptor index (built by C10) is for VPR retrieval, not for cross-domain matching. C3 operates per-frame on raw tile pixels; switching matcher backbones does not require corpus rebuild. Documented in description.md § 8 (independent paths). + +## Runtime Completeness + +- **Named capability**: `AlikedLightGlueMatcher` — secondary `CrossDomainMatcher` (architecture / E-C3 / `solution.md` / AC-1.1 partition). +- **Production code that must exist**: real `AlikedLightGlueMatcher` calling real C7 `InferenceRuntime` with real TRT-compiled ALIKED engine; same shared `LightGlueRuntime` + `RansacFilter` + `RollingHealthWindow` invocation pattern as AZ-345. +- **Allowed external stubs**: same as AZ-345 — `FakeInferenceRuntime`, `FakeLightGlueRuntime`, `FakeRansacFilter`, `FakeFdrClient`. +- **Unacceptable substitutes**: same as AZ-345 — Python+NumPy ALIKED forward; per-strategy RANSAC; skipping `RollingHealthWindow.update` on all-failed path; using mean residual instead of median. diff --git a/_docs/02_tasks/todo/AZ-347_c3_xfeat.md b/_docs/02_tasks/todo/AZ-347_c3_xfeat.md new file mode 100644 index 0000000..13ad720 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-347_c3_xfeat.md @@ -0,0 +1,140 @@ +# C3 XFeat Alternate Lightweight Matcher + +**Task**: AZ-347_c3_xfeat +**Name**: C3 XFeat Alternate Lightweight Matcher +**Description**: Implement `XFeatMatcher`, the lightweight alternate `CrossDomainMatcher`. XFeat combines feature extraction AND matching in a single forward pass (no separate LightGlue stage); selectable via `config.matcher.strategy = "xfeat"`. Target use case: low-power / thermal-throttled scenarios where DISK+LightGlue's combined cost (~180 ms p95) exceeds the C4 hybrid's degraded budget. Drop-and-continue + below-threshold + best-candidate selection contracts inherited from the Protocol unchanged. RANSAC + median residual still computed via the shared `RansacFilter`. +**Complexity**: 3 points +**Dependencies**: AZ-344 (Protocol + factory + DTOs + errors + RollingHealthWindow), AZ-263_initial_structure, AZ-269_config_loader, AZ-282_ransac_filter, AZ-298_c7_tensorrt_runtime, AZ-299_c7_onnxrt_fallback, AZ-303_c6_storage_interfaces, AZ-281_engine_filename_schema (XFeat engine self-describing filename), AZ-321_c10_engine_compiler (XFeat engine compile path), AZ-266_log_module, AZ-272_fdr_record_schema +**Component**: c3_matcher (epic AZ-257 / E-C3) +**Tracker**: AZ-347 +**Epic**: AZ-257 (E-C3) + +### Document Dependencies + +- `_docs/02_document/contracts/c3_matcher/cross_domain_matcher_protocol.md` — Protocol contract. +- `_docs/02_document/components/04_c3_matcher/description.md` — § 1 XFeat alternate (lightweight); § 5 error handling; § 9 logging. +- `_docs/02_document/module-layout.md` — `BUILD_MATCHER_XFEAT` row. +- `_docs/02_document/contracts/shared_helpers/ransac_filter.md` — RANSAC filtering API. +- `_docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md`. +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md`. + +## Problem + +Without this task: there is no lightweight matcher option for thermal-throttled scenarios; if the C4 hybrid switches to Jacobian (per ADR-006 / D-CROSS-LATENCY-1) but C3's per-frame budget still allows the heavy DISK+LightGlue path, the system has no mechanism to reduce C3's cost too. XFeat is also the documented mandatory simple-baseline alternative for IT-12 comparative study (AC-2.1a engine rule applied at the matcher level, with NetVLAD acting at the VPR level). + +## Outcome + +- `src/gps_denied_onboard/components/c3_matcher/xfeat.py` defining: + - `XFeatMatcher` class implementing the `CrossDomainMatcher` Protocol. + - Constructor: `__init__(self, runtime: InferenceRuntime, ransac_filter: RansacFilter, fdr_client: FdrClient, health_window: RollingHealthWindow, config: MatcherConfig)`. Note: NO `lightglue_runtime` argument — XFeat does not use LightGlue. + - `match(frame, rerank_result, calibration)`: + 1. Decode + preprocess the nav-camera frame ONCE. + 2. For each `RerankCandidate` in `rerank_result.candidates`: + a. Decode + preprocess the candidate tile. + b. Run XFeat forward (single pass: outputs combined `correspondences` directly — XFeat fuses extraction + matching). + c. On failure: drop-and-continue (`MatcherBackboneError`, `phase="xfeat_forward"`). + d. RANSAC + median residual via `ransac_filter.filter(correspondences, threshold_px=...)` — same helper as DISK+LightGlue. + e. Append `CandidateMatchSet` if survivors > 0. + 3. Below-threshold / all-failed → `InsufficientInliersError` (same semantics as AZ-345). + 4. Sort survivors descending by `inlier_count`; ties broken by `per_candidate_residual_px` ascending. + 5. WARN on residual above threshold; INFO on ready; FDR `matcher.frame_done` per frame. + 6. `RollingHealthWindow.update` after each frame (success or failure). + 7. `matcher_label = "xfeat"`. + - Module-level `create(config, lightglue_runtime, ransac_filter, inference_runtime, health_window) -> CrossDomainMatcher`: + 1. `lightglue_runtime` is accepted in the signature for factory uniformity but NOT stored / used. + 2. `xfeat_weights_path = config.matcher.xfeat_weights_path` (TRT engine produced by AZ-321). + 3. Load XFeat engine via `inference_runtime.load_engine(...)`. + 4. Construct `XFeatMatcher(...)`. +- Composition-root wiring path for `config.matcher.strategy == "xfeat"`. +- `BUILD_MATCHER_XFEAT` flag wiring (ON in research; ON in airborne if config selects it; OFF in operator-tooling). +- All logging + FDR records identical structure to AZ-345 with `matcher_label = "xfeat"`. + +## Scope + +### Included +- `XFeatMatcher` implementation per the `CrossDomainMatcher` Protocol. +- XFeat forward via C7 `InferenceRuntime`. +- RANSAC + median residual via shared `RansacFilter` (NO LightGlue). +- Same drop-and-continue + below-threshold + best-candidate selection as AZ-345. +- Same `RollingHealthWindow.update` invocation pattern. +- Composition-root wiring path. +- XFeat-specific preprocessor inline. +- Unit tests covering Invariants 1–9 + drop-and-continue + below-threshold + deterministic ordering. Parametrised across XFeat-specific test fixtures (lightweight model output is different shape from DISK). +- `BUILD_MATCHER_XFEAT` flag wiring. + +### Excluded +- The Protocol + DTOs + errors + factory + `RollingHealthWindow` — owned by AZ-344. +- `RansacFilter` (AZ-282). +- `LightGlueRuntime` (AZ-278) — XFeat does NOT consume this helper; the factory's signature includes it for uniformity but XFeat's `create` ignores the parameter. +- C7 runtime stack (AZ-297..AZ-300). +- XFeat engine compile (AZ-321). +- Component-internal acceptance tests beyond Protocol + invariants smoke. +- DISK matcher (AZ-345) and ALIKED matcher (AZ-346). + +## Acceptance Criteria + +**AC-1 through AC-10**: identical contract to AZ-345 AC-1..AC-10 (Protocol conformance, best-candidate selection, drop-and-continue, below-threshold, residual WARN, health update, correspondences shape, determinism). `matcher_label = "xfeat"`. + +**AC-11: Composition-root wiring** +Given `config.matcher.strategy = "xfeat"` AND `BUILD_MATCHER_XFEAT = ON` +When `compose_root(config)` runs +Then an `XFeatMatcher` instance is wired; ONE INFO log `kind="c3.matcher.ready"` with `{strategy: "xfeat", ...}` is emitted. The strategy does NOT hold a reference to `LightGlueRuntime` (verifiable via `not hasattr(strategy, "_lightglue_runtime")` OR `strategy._lightglue_runtime is None`). + +**AC-12: FDR `matcher.frame_done` per frame** +Same shape as AZ-345 AC-12 with `matcher_label = "xfeat"`. + +**AC-special-1: XFeat single-pass forward — no LightGlue call** +Given a `match(...)` call where the `LightGlueRuntime` test double is provided to the factory +When the call completes +Then `lightglue_runtime.match_*` is NEVER invoked (verified by mock assertion `lightglue_runtime.match_pair.assert_not_called()`). + +**AC-special-2: XFeat lower latency than DISK+LightGlue (informational, not gated)** +Given identical hardware and identical inputs +When `match(...)` is microbenchmarked × 100 frames +Then XFeat's per-call p95 is < AZ-345's per-call p95 (informational metric; if XFeat is NOT faster, that's a backbone misconfiguration, not a contract violation. Documented in the test report; does NOT block this AC). + +## Non-Functional Requirements + +**Performance** (deferred to C3-PT-01): +- `match` p95 ≤ 100 ms (informational target; XFeat is the lightweight option). NOT a hard gate; the hard gate is C3-PT-01's overall envelope. +- GPU memory ≤ 300 MB (XFeat single engine; smaller than DISK+LightGlue). + +**Compatibility** +- XFeat engine file format owned by C10 + C7. + +**Reliability** +- Same as AZ-345. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1..AC-10 | Identical to AZ-345 AC-1..AC-10 with `matcher_label = "xfeat"` | Same outcomes | +| AC-11 | `compose_root(config="xfeat")` | Wired; INFO log; no LightGlue dependency | +| AC-12 | FDR `frame_done` emission | Correct fields; `matcher_label = "xfeat"` | +| AC-special-1 | LightGlue NOT invoked | `lightglue_runtime.match_pair.assert_not_called()` | +| AC-special-2 | Latency comparison | (Informational; not gated) | + +## Constraints + +- **Same constraints as AZ-345** — drop-and-continue mandatory, median residual, constructor injection, helpers constructor-injected, engine load at `create` time, `RollingHealthWindow.update` called exactly once per `match`. +- **`LightGlueRuntime` is NOT consumed** — the factory's `create` signature accepts it for uniformity (so AZ-344's factory can call all three matchers' `create` with the same args) but XFeatMatcher does NOT store or use it. Test AC-special-1 enforces this. +- **XFeat-specific preprocessing parameters are hard-coded** (weights-coupled, same rule as DISK and ALIKED). + +## Risks & Mitigation + +**Risk 1: XFeat output schema differs from DISK+LightGlue output (correspondences format)** +- *Mitigation*: XFeat outputs `correspondences` ndarray of shape `(M, 4)` with columns `(px_query, py_query, px_tile, py_tile)` — same as the post-LightGlue output of DISK+LightGlue. The shared `RansacFilter` consumes this format identically. If XFeat's upstream output differs, this task adapts inside the strategy. + +**Risk 2: XFeat's RANSAC inlier counts may be systematically lower** (lighter-weight model produces noisier matches) +- *Mitigation*: AC-2.1a engine rule applies (XFeat is the simple baseline at the matcher level); the ≥ 80 inlier count floor (AC-1.1) may not hold for XFeat. C3-IT-01 measures this; if XFeat fails AC-1.1 on Derkachi, it remains as the "engine rule" comparison baseline NOT the production-default — same engine-rule semantics as NetVLAD at C2. + +**Risk 3: Linking three backbones into one binary inflates GPU memory headroom** +- *Mitigation*: per ADR-002 / D-C7-13, only the SELECTED backbone's engine is loaded at `create` time. Linking does NOT load engines; loading happens lazily in each backbone's `create`. The factory only invokes ONE `create` per binary lifetime. + +## Runtime Completeness + +- **Named capability**: `XFeatMatcher` — alternate lightweight `CrossDomainMatcher` (architecture / E-C3 / `solution.md` / AC-2.1a engine rule at matcher level). +- **Production code that must exist**: real `XFeatMatcher` calling real C7 `InferenceRuntime` with real TRT-compiled XFeat engine; real shared `RansacFilter` for inlier filtering + median residual; real `RollingHealthWindow.update` after each frame; real composition-root wiring. +- **Allowed external stubs**: `FakeInferenceRuntime`, `FakeRansacFilter`, `FakeFdrClient`, `FakeLightGlueRuntime` (passed but unused). +- **Unacceptable substitutes**: a Python+NumPy XFeat forward (would not satisfy the lightweight-target latency); using a different RANSAC implementation; storing/calling `LightGlueRuntime` (would defeat XFeat's single-pass design). diff --git a/_docs/02_tasks/todo/AZ-348_c3_5_refiner_protocol.md b/_docs/02_tasks/todo/AZ-348_c3_5_refiner_protocol.md new file mode 100644 index 0000000..b68fe28 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-348_c3_5_refiner_protocol.md @@ -0,0 +1,215 @@ +# C3.5 ConditionalRefiner Protocol + Factory + PassthroughRefiner + Composition + +**Task**: AZ-348_c3_5_refiner_protocol +**Name**: C3.5 `ConditionalRefiner` Protocol + Factory + `PassthroughRefiner` + Composition +**Description**: Define the public `ConditionalRefiner` Protocol (PEP 544 `@runtime_checkable`), the C3.5 DTO additions (`refinement_label`, `refinement_added_latency_ms` extending `MatchResult`), the error hierarchy (`RefinerError`, `RefinerBackboneError`, `RefinerConfigError`), the composition-root factory `build_refiner_strategy(config, ransac_filter, inference_runtime) -> ConditionalRefiner` selecting between strategies at startup based on `config.refiner.strategy`, AND the trivial `PassthroughRefiner` concrete strategy (always-passthrough; non-conditional baseline used by smoke tests + IT-12 comparison). Both refiner strategies are linked into the production binary unconditionally (NO `BUILD_REFINER_*` flag — runtime selection only per ADR-001; ADR-002 build-time exclusion does NOT apply because both strategies are tiny and the AdHoP TRT engine is shared C7 infrastructure). The shared `RansacFilter` (AZ-282) is constructor-injected. This task delivers the foundational scaffolding `AdHoPRefiner` (TBD AZ-?) depends on; the AdHoP backbone implementation is NOT in scope here. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-270_compose_root, AZ-282_ransac_filter (helper handle), AZ-297_c7_runtime_protocol (for `InferenceRuntime` interface), AZ-344 (for `MatchResult` DTO defined in `_types/matcher.py` — extended in-place by this task), AZ-266_log_module +**Component**: c3_5_adhop (epic AZ-258 / E-C3.5) +**Tracker**: AZ-348 +**Epic**: AZ-258 (E-C3.5) + +### Document Dependencies + +- `_docs/02_document/contracts/c3_5_adhop/conditional_refiner_protocol.md` — the public contract this task implements (Protocol surface + DTO extension + error hierarchy + factory signature + 9 invariants + producer/consumer split). +- `_docs/02_document/components/05_c3_5_adhop/description.md` — § 1 architectural pattern (Strategy with two concrete impls); § 2 `ConditionalRefiner` interface + DTO enrichments (`refinement_label`, `refinement_added_latency_ms`); § 5 error handling (passthrough fall-through on `RefinerBackboneError`); § 9 logging. +- `_docs/02_document/module-layout.md` — `c3_5_adhop` Per-Component Mapping (this task ALSO updates the canonical Public API symbol from `AdHoPRefinementStrategy` to `ConditionalRefiner` so the document agrees with `description.md` § 2 and the contract). +- `_docs/02_document/architecture.md` — ADR-001 (Strategy + composition root), ADR-009 (interface-first DI). ADR-002 explicitly does NOT apply here (no `BUILD_REFINER_*` flag). +- `_docs/02_document/contracts/c3_matcher/cross_domain_matcher_protocol.md` — `MatchResult` DTO consumed AND extended by this task. +- `_docs/02_document/contracts/shared_helpers/ransac_filter.md` — helper API consumed (held by reference; not invoked by `PassthroughRefiner`; held for parity + future use by `AdHoPRefiner`). +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — `InferenceRuntime` interface; held by reference but NOT invoked by `PassthroughRefiner`. + +## Problem + +Without this task, `AdHoPRefiner` and the downstream C4 PoseEstimator would each invent their own ad-hoc interface for the conditional-refinement boundary, breaking ADR-001 (Strategy is the documented architectural pattern) and ADR-009 (consumers must hold typed references to a Protocol, not a concrete class). Additionally, the `MatchResult` DTO must grow two NEW fields (`refinement_label`, `refinement_added_latency_ms`) so the per-frame FDR records carry refinement provenance and the NFT-PERF-01 invocation-rate accounting (test C3.5-IT-03) is well-defined; doing this DTO extension in the producer task (AZ-344, C3 Protocol) would couple C3 to a C3.5 concept and violate the layering invariant (C3 is upstream of C3.5). + +The `PassthroughRefiner` is bundled into this task because (a) it is a 1-pt no-op trivially defined alongside the Protocol, (b) the Protocol's tests need a real `ConditionalRefiner` instance to verify `runtime_checkable` conformance, and (c) it serves as the documented IT-12 "no-refinement" comparison baseline at the refinement level (engine rule per AC-2.1a applied at the refinement layer). + +## Outcome + +- `src/gps_denied_onboard/components/c3_5_adhop/interface.py` defining the `ConditionalRefiner` Protocol with `refine_if_needed` and `was_invoked`. Docstring encodes all 9 invariants from the contract verbatim. +- `src/gps_denied_onboard/components/c3_5_adhop/__init__.py` re-exporting the Protocol + the new MatchResult fields' default-construction helper (no separate DTO; the fields live on `MatchResult`). +- `src/gps_denied_onboard/_types/matcher.py` (existing file from AZ-344) **extended in-place** to add the two NEW fields: + - `refinement_label: str = "passthrough"` + - `refinement_added_latency_ms: float = 0.0` + Both with default values so AZ-344's `MatchResult(...)` construction without C3.5 still produces a valid downstream-readable instance. The task ALSO updates AZ-344's frozen-dataclass tests (slot count, repr) to reflect the two new fields without changing AZ-344's per-field assertions. +- `src/gps_denied_onboard/components/c3_5_adhop/errors.py` defining `RefinerError`, `RefinerBackboneError`, `RefinerConfigError`. +- `src/gps_denied_onboard/components/c3_5_adhop/passthrough_refiner.py` defining: + - `PassthroughRefiner` class implementing the `ConditionalRefiner` Protocol. + - Constructor: `__init__(self, ransac_filter, inference_runtime)`. Both helpers held by reference; neither invoked. + - `refine_if_needed(frame, mr, residual_threshold_px)`: + - Validate `residual_threshold_px > 0` (raise `ValueError` per Invariant 9; defensive). + - Set `_was_invoked = False`. + - Return `mr` unchanged (input == output, byte-identical correspondences ndarray references). `refinement_label` already defaults to `"passthrough"` and `refinement_added_latency_ms` defaults to `0.0` — no `dataclasses.replace` needed. + - `was_invoked() -> bool`: return `self._was_invoked`. + - Module-level `create(config, ransac_filter, inference_runtime) -> ConditionalRefiner` factory function. +- `src/gps_denied_onboard/runtime_root/refiner_factory.py` exporting `build_refiner_strategy(config, ransac_filter, inference_runtime) -> ConditionalRefiner`. The factory: + 1. Reads `config.refiner.strategy` (one of: `"adhop"`, `"passthrough"`). + 2. Imports the concrete module via the strategy resolution table (NOT lazy — both modules are linked unconditionally per ADR-001). + 3. Constructs via the module-level `create(config, ransac_filter, inference_runtime)` factory function. + 4. Returns the instance. + 5. Validates `config.refiner.residual_threshold_px > 0`; rejects with `RefinerConfigError` otherwise. +- Composition-root `compose_root` extension: invoke `build_refiner_strategy` AFTER `RansacFilter` and the C7 `InferenceRuntime` are constructed; bind the result to the same C-frame ingest thread. +- Config schema extension to AZ-269: + - `config.refiner.strategy` (enum: `"adhop"` | `"passthrough"`; default `"adhop"`). + - `config.refiner.residual_threshold_px` (float, default `2.5`). + - `config.refiner.invocation_rate_warn_threshold` (float, default `0.25`). +- INFO log on every successful `build_refiner_strategy`: `kind="c3_5.refiner.strategy_loaded"` with `{strategy, residual_threshold_px}`. +- ERROR log on `RefinerConfigError`. +- Documentation update: update `module-layout.md` `c3_5_adhop` Public API symbol from `AdHoPRefinementStrategy` to `ConditionalRefiner` (the canonical name per `description.md` and the contract). + +## Scope + +### Included +- The `ConditionalRefiner` Protocol with both `refine_if_needed` and `was_invoked` methods. +- DTO extension to `MatchResult` (two NEW default-valued fields). +- The three-class error hierarchy. +- The trivial `PassthroughRefiner` concrete strategy. +- The composition-root factory. +- Config schema extension for `config.refiner.{strategy, residual_threshold_px, invocation_rate_warn_threshold}`. +- Composition-root wiring path that identity-shares `RansacFilter` with C3 and C4. +- Strategy resolution table comment matching the contract verbatim. +- Documentation correction in `module-layout.md` (rename `AdHoPRefinementStrategy` → `ConditionalRefiner`). +- Unit tests covering: Protocol conformance (`runtime_checkable`); DTO field defaults + slots; `PassthroughRefiner` byte-identical-correspondences passthrough (Invariant 5); `PassthroughRefiner` `was_invoked()` always False (Invariant 8); factory rejection on unknown strategy (`RefinerConfigError`); factory rejection on `residual_threshold_px <= 0` (`RefinerConfigError`); INFO + ERROR log emission. + +### Excluded +- The `AdHoPRefiner` concrete strategy — owned by the AdHoP task (TBD AZ-?). +- The AdHoP TRT engine compile path — owned by AZ-321. +- The `RansacFilter` helper — already AZ-282. +- The C7 `InferenceRuntime` — owned by AZ-297. +- Component-internal acceptance tests beyond Protocol-conformance + factory-validation: C3.5-IT-01..03 + C3.5-PT-01 deferred to E-BBT (AZ-262). + +## Acceptance Criteria + +**AC-1: Protocol conformance — `runtime_checkable`** +Given a `FakeRefiner` test double implementing both `refine_if_needed` and `was_invoked` +When `isinstance(fake, ConditionalRefiner)` is evaluated +Then result is `True`; an object missing either method returns `False`. + +**AC-2: `MatchResult` DTO field additions are backward-compatible** +Given AZ-344's `MatchResult(...)` call sites that do NOT pass the new fields +When the construction is evaluated post-extension +Then construction succeeds; `refinement_label == "passthrough"`; `refinement_added_latency_ms == 0.0`. AZ-344's existing tests still pass. + +**AC-3: `MatchResult` immutability + slots preserved** +After the extension, `MatchResult` is still `frozen=True, slots=True`; mutation raises `FrozenInstanceError`; `__slots__` includes the two new field names. + +**AC-4: Factory rejects unknown strategy** +Given `config.refiner.strategy = "garbage"` +When `build_refiner_strategy(...)` is called +Then `RefinerConfigError("Unknown refiner strategy: garbage")` is raised; ONE ERROR log `kind="c3_5.refiner.strategy_unknown"` is emitted. + +**AC-5: Factory rejects invalid threshold** +Given `config.refiner.residual_threshold_px = 0` (or negative) +When `build_refiner_strategy(...)` is called +Then `RefinerConfigError` is raised; ONE ERROR log `kind="c3_5.refiner.invalid_threshold"` is emitted. + +**AC-6: Successful factory load emits INFO log** +Given `config.refiner.strategy = "passthrough"` AND `residual_threshold_px = 2.5` +When `build_refiner_strategy(...)` is called +Then a `ConditionalRefiner` instance is returned; ONE INFO log `kind="c3_5.refiner.strategy_loaded"` with `{strategy: "passthrough", residual_threshold_px: 2.5}` is emitted. + +**AC-7: Strategy resolution table** +Given each of `"adhop"` AND `"passthrough"` for `config.refiner.strategy` +When `build_refiner_strategy(...)` is called for each +Then resolved module path matches the contract's table verbatim. (For `"adhop"`, the factory imports the AdHoP module successfully iff that module exists; in this task the AdHoP module is a placeholder + raises `NotImplementedError` so AC-7 for `"adhop"` reaches the import + class lookup but NOT the `__init__`. The full-success path for `"adhop"` belongs to the AdHoP task.) + +**AC-8: Public API surface — `__init__.py` re-exports** +Given `from gps_denied_onboard.components.c3_5_adhop import ConditionalRefiner` +When the import is evaluated +Then `ConditionalRefiner` resolves; `PassthroughRefiner`, `RefinerError`, etc. are NOT in `__all__` (kept internal). + +**AC-9: Strategy bound to single ingest thread by composition root** +Given a `compose_root(config)` invocation +When the resulting strategy is bound +Then a second binding attempt from a different thread raises `RuntimeError`. + +**AC-10: `PassthroughRefiner` byte-identical correspondences (Invariant 5)** +Given a `MatchResult` with non-empty `inlier_correspondences` ndarrays +When `PassthroughRefiner.refine_if_needed(frame, mr, threshold)` is invoked +Then for every candidate `i`, `np.array_equal(out.per_candidate[i].inlier_correspondences, mr.per_candidate[i].inlier_correspondences) is True` AND dtypes match exactly. The output's ndarray IS the input's ndarray (same object reference). `out.refinement_label == "passthrough"`. `out.refinement_added_latency_ms == 0.0`. + +**AC-11: `PassthroughRefiner.was_invoked()` always False (Invariant 8)** +Given a fresh `PassthroughRefiner` +When `refine_if_needed` is called any number of times +Then every subsequent `was_invoked()` call returns False. + +**AC-12: Threshold validation in `refine_if_needed` (Invariant 9, defensive)** +Given `PassthroughRefiner.refine_if_needed(frame, mr, residual_threshold_px=0)` +When called +Then `ValueError` is raised. (The composition root MUST also have caught this earlier; this in-method check is defensive.) + +**AC-13: Error hierarchy catchability** +Test instances of `RefinerBackboneError` + `RefinerConfigError` are caught by `except RefinerError`. + +**AC-14: `module-layout.md` symbol rename applied** +After this task completes, `_docs/02_document/module-layout.md` § c3_5_adhop Public API rows reference `ConditionalRefiner` (NOT `AdHoPRefinementStrategy`). + +## Non-Functional Requirements + +**Performance** +- `build_refiner_strategy` p99 ≤ 20 ms (factory-only; no engine load happens here for `"passthrough"`; AdHoP engine load owned by the AdHoP task). +- `PassthroughRefiner.refine_if_needed` p99 ≤ 0.5 ms (per Invariant 7 + C3.5-PT-01 passthrough target). + +**Compatibility** +- Protocol method-signature changes are MAJOR version bumps (lockstep update of all consumers + concrete strategies). +- `MatchResult` field additions are MINOR; field removals are MAJOR. + +**Reliability** +- Single-thread invariant enforced at composition-root binding time (AC-9). +- `PassthroughRefiner` is stateless except for the `_was_invoked` flag; concurrent calls are unsafe (single-thread invariant covers). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Protocol conformance | Fake passes; partial fake fails | +| AC-2 | `MatchResult` backward-compatible defaults | Existing call sites still produce valid instances | +| AC-3 | `MatchResult` slots + immutability | `FrozenInstanceError`; new fields in `__slots__` | +| AC-4 | Unknown-strategy rejection | `RefinerConfigError`; ERROR log | +| AC-5 | Invalid-threshold rejection | `RefinerConfigError`; ERROR log | +| AC-6 | Successful factory load | INFO log with structured fields | +| AC-7 | Strategy resolution paths | Both `"adhop"` and `"passthrough"` resolve to expected module paths | +| AC-8 | Public API re-exports | `ConditionalRefiner` in `__all__`; internals not | +| AC-9 | Single-thread binding | Second binding raises `RuntimeError` | +| AC-10 | Passthrough byte-identical | `np.array_equal` + same `dtype`; `refinement_label == "passthrough"` | +| AC-11 | `was_invoked()` always False on Passthrough | Every call returns False | +| AC-12 | Threshold validation | `ValueError` on `<= 0` | +| AC-13 | Error catchability | All caught by `except RefinerError` | +| AC-14 | `module-layout.md` symbol rename | `ConditionalRefiner` referenced; `AdHoPRefinementStrategy` not | + +## Constraints + +- **`@runtime_checkable` MUST be used** on the Protocol. +- **`MatchResult` MUST remain `frozen=True, slots=True`** after the extension. +- **`PassthroughRefiner.refine_if_needed` MUST return the input `MatchResult` unchanged** (same object reference) when the field defaults already hold. No `dataclasses.replace` is needed; if the input already has `refinement_label != "passthrough"` (impossible from a C3 producer but possible in tests), the strategy MAY rewrite via `replace` — but the task ships the simpler same-reference path because every C3 producer outputs the default values. +- **Both refiner modules linked unconditionally** — no `BUILD_REFINER_*` flag (NOT ADR-002 territory). The composition root validates the config-load-time strategy enum. +- **The factory does NOT instantiate `RansacFilter` or `InferenceRuntime`** — runtime root constructs ONCE. +- **`config.refiner.strategy` is an enum** validated at config load. +- **`module-layout.md` symbol rename is part of this task** — fixes the documented Public API symbol to match the contract. + +## Risks & Mitigation + +**Risk 1: Extending `MatchResult` in-place creates churn for AZ-344 tests** +- *Mitigation*: the two new fields are default-valued, so every existing AZ-344 construction-call site stays valid. AZ-344's frozen-dataclass tests assert specific fields; this task updates them to reflect the additions without changing per-field semantics. + +**Risk 2: `PassthroughRefiner` returning the input by reference is a sharp tool** +- *Mitigation*: documented in the contract (Invariant 5 explicitly states "byte-identical"). Downstream consumers of `MatchResult` MUST treat it as immutable (which `frozen=True` enforces). If a future consumer mutates ndarrays in-place (NumPy doesn't honour dataclass frozen for mutable members), it will corrupt the C3 producer's view — but no current consumer does this, and the contract codifies the expectation. + +**Risk 3: `module-layout.md` symbol rename creates churn for unrelated documentation readers** +- *Mitigation*: the symbol rename is a pure documentation correction; no production code references `AdHoPRefinementStrategy` because AZ-258 has not been implemented yet. This task is the right place to fix the inconsistency before any consumer encodes the wrong name. + +**Risk 4: `compose_root` thread-binding registry not yet implemented in AZ-270** +- *Mitigation*: same as AZ-342 / AZ-344 Risk 4. Keep AC-9; if AZ-270 lacks the registry, escalate via tracker dependency mechanism. + +## Runtime Completeness + +- **Named capability**: `ConditionalRefiner` Protocol + `PassthroughRefiner` reference impl + composition-root factory + `MatchResult` DTO extension. +- **Production code that must exist**: real Protocol + real DTO extension + real error hierarchy + real `PassthroughRefiner` (byte-identical passthrough) + real `build_refiner_strategy` factory + real config schema extension + real composition-root wiring path that identity-shares `RansacFilter` with C3 and C4. +- **Allowed external stubs**: `FakeRefiner`, `FakeRansacFilter`, `FakeInferenceRuntime` for tests. Production wiring uses real concretes. +- **Unacceptable substitutes**: making `PassthroughRefiner` return a deep copy of the input (defeats Invariant 5's byte-identical guarantee + adds latency to the steady-state path); deferring the `module-layout.md` rename to "later" (would leave the documented symbol diverged from the contract for the AdHoP task to consume). + +## Contract + +This task produces/implements the contract at `_docs/02_document/contracts/c3_5_adhop/conditional_refiner_protocol.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-349_c3_5_adhop_refiner.md b/_docs/02_tasks/todo/AZ-349_c3_5_adhop_refiner.md new file mode 100644 index 0000000..3d83693 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-349_c3_5_adhop_refiner.md @@ -0,0 +1,209 @@ +# C3.5 AdHoPRefiner — TRT engine + perspective preconditioning + conditional gate + +**Task**: AZ-349_c3_5_adhop_refiner +**Name**: C3.5 `AdHoPRefiner` — production-default conditional refiner +**Description**: Implement `AdHoPRefiner`, the production-default `ConditionalRefiner` strategy. The strategy applies the conditional gate (`mr.reprojection_residual_px <= residual_threshold_px` → passthrough; otherwise → invoke), runs the OrthoLoC AdHoP TRT engine forward via the C7 `InferenceRuntime` to perform perspective preconditioning, recomputes inlier correspondences via the shared `RansacFilter` (AZ-282) on the preconditioned features, and writes the refined correspondences + new median residual back into a fresh `MatchResult` with `refinement_label = "adhop"`. On `RefinerBackboneError` (TRT exception, OOM, NaN, shape mismatch), the strategy CATCHES the exception inside `refine_if_needed`, logs ERROR + emits FDR record, and returns the input `MatchResult` unchanged with `refinement_label = "passthrough"` AND `was_invoked()` = True (Invariant 4 — passthrough fall-through). The strategy ALSO maintains a 60 s rolling invocation-rate counter for the WARN log when the rate exceeds `config.refiner.invocation_rate_warn_threshold` (per description.md § 9). Composition-root wired via the `build_refiner_strategy` factory (TBD AZ-? from the C3.5 Protocol task) when `config.refiner.strategy = "adhop"`. +**Complexity**: 5 points +**Dependencies**: AZ-348 (Protocol + factory + DTOs + errors + `PassthroughRefiner`), AZ-263_initial_structure, AZ-269_config_loader, AZ-282_ransac_filter (shared RANSAC helper), AZ-298_c7_tensorrt_runtime (AdHoP forward via TRT), AZ-299_c7_onnxrt_fallback (AdHoP forward via ONNX-RT fallback), AZ-281_engine_filename_schema (AdHoP engine self-describing filename), AZ-321_c10_engine_compiler (AdHoP engine compile path), AZ-266_log_module, AZ-272_fdr_record_schema +**Component**: c3_5_adhop (epic AZ-258 / E-C3.5) +**Tracker**: AZ-349 +**Epic**: AZ-258 (E-C3.5) + +### Document Dependencies + +- `_docs/02_document/contracts/c3_5_adhop/conditional_refiner_protocol.md` — the public contract this task implements; producer/consumer split assigns this task as the AdHoPRefiner consumer. +- `_docs/02_document/components/05_c3_5_adhop/description.md` — § 1 architectural pattern; § 2 `ConditionalRefiner` interface + DTO enrichments (`refinement_label`, `refinement_added_latency_ms`); § 5 error handling (passthrough fall-through on `RefinerBackboneError`); § 7 caveats (threshold tuning); § 9 logging (per-frame DEBUG, rolling-rate WARN at 0.25, ERROR on backbone failure). +- `_docs/02_document/components/05_c3_5_adhop/tests.md` — C3.5-IT-01 (residual reduction ≥ 90% of invocations); C3.5-IT-02 (passthrough fall-through bit-identical); C3.5-IT-03 (invocation rate < 0.30 on Derkachi normal); C3.5-PT-01 (latency budget). +- `_docs/02_document/contracts/shared_helpers/ransac_filter.md` — RANSAC filtering API. +- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — `InferenceRuntime` API. +- `_docs/02_document/architecture.md` — ADR-001, R10 (latency under thermal throttle). + +## Problem + +Without this task, C3.5 has no real refinement path; the system would always run the `PassthroughRefiner`, defeating the AC-2.2 hard-frame portion (cross-domain MRE < 2.5 px after refinement). The conditional gate also keeps AC-4.1's E2E latency budget intact on the steady-state path (passthrough costs ~0.5 ms while the AdHoP invocation costs ~30–90 ms; running it every frame would blow the F3 budget). Without proper passthrough fall-through on `RefinerBackboneError`, a single TRT failure would cascade into a frame skip → C5 visual_propagated → false health degradation → false F6 satellite re-localisation trigger. + +## Outcome + +- `src/gps_denied_onboard/components/c3_5_adhop/adhop_refiner.py` defining: + - `AdHoPRefiner` class implementing the `ConditionalRefiner` Protocol. + - Constructor: `__init__(self, runtime: InferenceRuntime, ransac_filter: RansacFilter, fdr_client: FdrClient, config: RefinerConfig, adhop_engine_handle)`. The `adhop_engine_handle` is the loaded TRT engine returned by `runtime.load_engine(...)` at `create` time. + - `_was_invoked: bool` (private flag; reset to False at every call). + - `_invocation_window`: a rolling-60s counter of (timestamp_ns, was_invoked) tuples for the rate WARN log. Bounded to ~180 entries at 3 Hz; pruned by lazy expiry. + - `refine_if_needed(frame, mr, residual_threshold_px) -> MatchResult`: + 1. `start_ns = time.monotonic_ns()`. + 2. Defensive: `if residual_threshold_px <= 0: raise ValueError(...)`. + 3. **Gate**: `if mr.reprojection_residual_px <= residual_threshold_px:` + - `self._was_invoked = False`. + - Append `(now, False)` to `_invocation_window`; prune. + - Return `mr` unchanged (object reference; refinement_label stays at default `"passthrough"`). + 4. **Invoked path**: `self._was_invoked = True`. Append `(now, True)` to `_invocation_window`; prune. + 5. Compute current invocation rate over the 60s window. If rate > `config.refiner.invocation_rate_warn_threshold` (default 0.25), emit ONE WARN log per minute (rate-limited): `kind="c3_5.refiner.invocation_rate_high"` with `{rate, target_threshold, frame_id}`. + 6. **Try**: + a. Decode + preprocess the nav-camera frame ONCE. + b. For the BEST candidate only (`mr.per_candidate[mr.best_candidate_idx]`): decode + preprocess the candidate tile pixels via the existing `tile_pixels_handle`. + c. Run AdHoP TRT engine forward (single forward; output: perspective-preconditioned correspondences). Implementation detail: AdHoP takes the original correspondences AS INPUT and produces refined correspondences with the perspective preconditioning applied; this is method-agnostic per OrthoLoC. + d. RANSAC + median residual via `self._ransac_filter.filter(refined_correspondences, threshold_px=...)` — same helper as C3. + e. Build a NEW `CandidateMatchSet` for the best candidate with the refined `inlier_correspondences`, refined `inlier_count`, refined `per_candidate_residual_px`. Other candidates in `per_candidate` left unchanged. + f. Build a NEW `MatchResult` via `dataclasses.replace(mr, per_candidate=[...new best, *unchanged others], reprojection_residual_px=new_best_residual, refinement_label="adhop", refinement_added_latency_ms=elapsed_ms)`. + g. INFO log `kind="c3_5.refiner.frame_done"` (DEBUG-level per description.md § 9; promote to INFO only on the SUCCESS-after-many-failures recovery transition; details in implementation). + h. FDR `refiner.frame_done` record with `{frame_id, was_invoked: true, refinement_label: "adhop", refinement_added_latency_ms, pre_residual_px, post_residual_px, inlier_count_before, inlier_count_after}`. + i. Return the new `MatchResult`. + 7. **Except `RefinerBackboneError`** (or any TRT-runtime failure that the `InferenceRuntime` raises and a guard layer maps to `RefinerBackboneError` per ADR-001 error contract): + - ERROR log `kind="c3_5.refiner.backbone_error"` with `{frame_id, exc_type, phase}`. + - FDR `refiner.frame_done` record with `{frame_id, was_invoked: true, refinement_label: "passthrough", refinement_added_latency_ms: elapsed_ms_so_far, error: true}`. + - Return the input `mr` unchanged (refinement_label stays at default `"passthrough"`). + - **Critical**: the exception is NEVER re-raised out of `refine_if_needed` (Invariant 4). Other exception types (e.g., `MemoryError`) ARE re-raised because the runtime contract is that only the documented `RefinerBackboneError` class is convertible to passthrough. + - `was_invoked() -> bool`: return `self._was_invoked`. + - Module-level `create(config, ransac_filter, inference_runtime) -> ConditionalRefiner`: + 1. `adhop_weights_path = config.refiner.adhop_weights_path` (TRT engine produced by AZ-321). + 2. Load AdHoP engine via `inference_runtime.load_engine(adhop_weights_path)` — happens ONCE at startup. + 3. Construct `AdHoPRefiner(runtime=inference_runtime, ransac_filter=ransac_filter, fdr_client=..., config=config.refiner, adhop_engine_handle=adhop_engine_handle)`. +- Composition-root wiring path: when `config.refiner.strategy == "adhop"` AND the AdHoP engine compile artifact is present, the AZ-? factory invokes `adhop_refiner.create(...)`. +- All FDR + log records have `refinement_label` set per the post-refinement outcome. + +## Scope + +### Included +- `AdHoPRefiner` implementation per the `ConditionalRefiner` Protocol. +- Conditional gate (`<=` semantics, inclusive, deterministic). +- AdHoP TRT engine forward via C7 `InferenceRuntime`. +- RANSAC + median residual recomputation via the shared `RansacFilter`. +- Passthrough fall-through on `RefinerBackboneError` (Invariant 4). +- Bit-identical correspondence preservation when the gate decides passthrough (Invariant 5). +- 60 s rolling invocation-rate counter + WARN log emission (rate-limited to ONE warning per minute). +- Per-frame FDR record emission with full provenance fields. +- Composition-root wiring path. +- Unit tests covering Invariants 1–9 + gate semantics + passthrough fall-through + invocation-rate accounting. + +### Excluded +- The Protocol + DTO extension + errors + factory + `PassthroughRefiner` — owned by AZ-? (Protocol task). +- The `RansacFilter` helper — already AZ-282. +- The C7 `InferenceRuntime` — owned by AZ-297..AZ-300. +- The AdHoP TRT engine compile path — owned by AZ-321. +- C3.5-IT-01..03 + C3.5-PT-01 component-internal acceptance tests — deferred to E-BBT (AZ-262). Unit tests in this task cover the per-method invariant smoke. +- The `OrthoLoC` upstream code drop — vendored separately (Plan-phase pin); this task consumes the runtime-loadable AdHoP engine only. + +## Acceptance Criteria + +**AC-1: Protocol conformance** +`AdHoPRefiner` instance passes `isinstance(refiner, ConditionalRefiner)`. + +**AC-2: Gate inclusive semantics (Invariant 3)** +Given `mr.reprojection_residual_px == residual_threshold_px` (equality) +When `refine_if_needed` is called +Then the strategy returns `mr` unchanged AND `was_invoked()` is False. +And: `mr.reprojection_residual_px = residual_threshold_px + 1e-6` triggers the invoked path AND `was_invoked()` is True. + +**AC-3: Successful AdHoP refinement produces enriched `MatchResult`** +Given a `MatchResult` with `reprojection_residual_px = 5.0`, `residual_threshold_px = 2.5`, and a stub AdHoP engine that produces refined correspondences with median residual `1.2` +When `refine_if_needed` is called +Then the output `MatchResult` has: +- `refinement_label == "adhop"` +- `reprojection_residual_px ≈ 1.2` +- `refinement_added_latency_ms > 0` +- The best candidate's `inlier_correspondences` reflects the refined coordinates (NOT byte-identical to input). +- `was_invoked()` returns True. + +**AC-4: Passthrough fall-through on `RefinerBackboneError` (Invariant 4)** +Given a stub AdHoP engine that raises `RefinerBackboneError` on forward +When `refine_if_needed` is called with a residual above threshold +Then: +- The output `MatchResult` IS the input `mr` (object reference). +- `refinement_label == "passthrough"` (default value preserved). +- `was_invoked()` returns True (the attempt counted). +- ERROR log `kind="c3_5.refiner.backbone_error"` emitted ONCE. +- FDR `refiner.frame_done` record emitted with `error: true`. +- The exception is NEVER re-raised (test asserts no exception escapes). + +**AC-5: Other exception types DO re-raise** +Given a stub AdHoP engine that raises `MemoryError` (NOT `RefinerBackboneError`) +When `refine_if_needed` is called +Then `MemoryError` propagates out (not converted to passthrough). Documents the closed-set semantics of Invariant 4. + +**AC-6: Bit-identical correspondences on gate-decided passthrough (Invariant 5)** +Given `mr.reprojection_residual_px = 1.0` AND `residual_threshold_px = 2.5` (gate → passthrough) +When `refine_if_needed` is called +Then for every candidate `i`, `out.per_candidate[i].inlier_correspondences IS mr.per_candidate[i].inlier_correspondences` (same object reference). Output's `refinement_label` stays at default `"passthrough"`. + +**AC-7: `_invocation_window` accuracy** +Given a sequence of 30 frames at 3 Hz with 10 invoked + 20 gate-passthroughs +When `refine_if_needed` has been called for each +Then the strategy's internal rate calculation reports `10/30 == 0.333` over the 10s window. + +**AC-8: Invocation-rate WARN is rate-limited** +Given the invocation rate exceeds `invocation_rate_warn_threshold = 0.25` for 60 consecutive seconds +When the strategy is exercised over that period +Then ONE (and only ONE) WARN log per 60 s window is emitted (rate-limited). + +**AC-9: `was_invoked()` semantics matches Invariant 8** +- Gate-decided passthrough → False. +- AdHoP-success → True. +- AdHoP-fall-through (backbone error) → True. + +**AC-10: Composition-root wiring** +Given `config.refiner.strategy = "adhop"` AND the AdHoP engine artifact path is valid +When `compose_root(config)` runs +Then an `AdHoPRefiner` instance is wired; ONE INFO log `kind="c3_5.refiner.ready"` with `{strategy: "adhop", residual_threshold_px}` is emitted; the strategy holds reference to the SAME `RansacFilter` instance as C3 + C4 (identity-shared). + +**AC-11: FDR `refiner.frame_done` shape** +Every `refine_if_needed` call (regardless of gate decision) emits exactly ONE FDR record with the documented field set. + +## Non-Functional Requirements + +**Performance** +- `refine_if_needed` p95 (gate-passthrough) ≤ 1 ms (per C3.5-PT-01 with margin). +- `refine_if_needed` p95 (AdHoP-invoked) ≤ 90 ms target / 150 ms hard limit (per C3.5-PT-01). +- `_invocation_window` update p99 ≤ 5 µs (deque-based or ring-buffer; pruning is amortised O(1)). + +**Compatibility** +- AdHoP engine file format owned by C10 + C7. + +**Reliability** +- Single-thread by contract (Invariant 1); no internal locking. +- TRT errors NEVER cascade out of `refine_if_needed` (Invariant 4). + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Protocol conformance | Passes `isinstance` | +| AC-2 | Gate `<=` inclusive | equality → passthrough; +ε → invoked | +| AC-3 | AdHoP success path | `refinement_label="adhop"`; refined residual; latency > 0 | +| AC-4 | Backbone error → passthrough fall-through | Same-reference output; ERROR log; no escape | +| AC-5 | Other exceptions re-raise | `MemoryError` propagates | +| AC-6 | Gate passthrough byte-identical | Same object reference | +| AC-7 | Invocation rate accuracy | `10/30 == 0.333` | +| AC-8 | WARN log rate-limited | One per 60 s | +| AC-9 | `was_invoked()` semantics | Three cases match | +| AC-10 | Composition wiring | Ready log; identity-shared `RansacFilter` | +| AC-11 | FDR record shape | Exactly one per call; documented fields | + +## Constraints + +- **Single-threaded** by contract; the `_invocation_window` is non-thread-safe by design. +- **AdHoP backbone errors NEVER propagate out** of `refine_if_needed` (Invariant 4). Other exception types DO propagate (AC-5 closed-set semantics). +- **Gate uses `<=` not `<`** — equality at the threshold means passthrough (deterministic, documented). +- **Engine load is ONCE at `create` time** — never lazy on first frame; ensures the F1 takeoff cold-start cost is bounded and deterministic. +- **`RansacFilter` is constructor-injected, identity-shared** with C3 + C4 — composition root constructs ONE instance. +- **WARN log is rate-limited to ONE per 60 s** — avoids log flooding when the threshold is mis-tuned. + +## Risks & Mitigation + +**Risk 1: AdHoP TRT engine OOM on Jetson under thermal-throttle conditions** +- *Mitigation*: passthrough fall-through (Invariant 4) on `RefinerBackboneError`; downstream pose estimator handles the same `MatchResult` shape regardless of refinement outcome. The repeated-OOM scenario is detectable via the `_invocation_window` "almost-always-error" pattern → operator-tooling pre-flight raises the residual threshold per R10. + +**Risk 2: AdHoP refinement produces a result with HIGHER residual than the input** (degenerate frames) +- *Mitigation*: per the contract, the strategy returns whatever AdHoP produces; downstream pose estimator gates on the new residual. C3.5-IT-01 measures the improvement rate (≥ 90%); the remaining ≤10% are accepted (documented). + +**Risk 3: `_invocation_window` deque grows unbounded if `refine_if_needed` is called from the wrong thread / without pruning** +- *Mitigation*: lazy prune at every append; bounded by ~180 entries at 3 Hz × 60 s. Memory upper bound is ~6 KB. Single-thread invariant covers the racing concern. + +**Risk 4: AdHoP weights / engine file path missing at startup** +- *Mitigation*: `create(...)` raises `RefinerConfigError` (caught at composition root) before the strategy is wired in. F1 takeoff abort follows the existing error handling pattern. + +## Runtime Completeness + +- **Named capability**: `AdHoPRefiner` — production-default `ConditionalRefiner` (architecture / E-C3.5 / `solution.md` / R10). +- **Production code that must exist**: real `AdHoPRefiner` calling real C7 `InferenceRuntime` with real TRT-compiled AdHoP engine; real shared `RansacFilter` for inlier filtering + median residual; real conditional gate at the documented `<=` semantics; real passthrough fall-through on `RefinerBackboneError`; real 60 s rolling invocation-rate counter + rate-limited WARN log; real per-frame FDR record emission; real composition-root wiring. +- **Allowed external stubs**: `FakeInferenceRuntime`, `FakeRansacFilter`, `FakeFdrClient` for tests. +- **Unacceptable substitutes**: a Python+NumPy AdHoP forward (would not satisfy the latency budget); using a different RANSAC implementation; allowing `RefinerBackboneError` to propagate out (Invariant 4 violation); deferring the `_invocation_window` to a future task (C3.5-IT-03 fails without it); using `<` instead of `<=` for the gate (would create a deterministic-replay divergence on equality-ish frames). diff --git a/_docs/02_tasks/todo/AZ-355_c4_pose_protocol.md b/_docs/02_tasks/todo/AZ-355_c4_pose_protocol.md new file mode 100644 index 0000000..c3418ea --- /dev/null +++ b/_docs/02_tasks/todo/AZ-355_c4_pose_protocol.md @@ -0,0 +1,144 @@ +# C4 PoseEstimator Protocol + Factory + DTOs + Composition + +**Task**: AZ-355_c4_pose_protocol +**Name**: C4 `PoseEstimator` Protocol + Factory + DTOs + Composition +**Description**: Define the public `PoseEstimator` Protocol (PEP 544 `@runtime_checkable`), the C4 DTOs (`PoseEstimate`, `LatLonAlt`, `Quat`, `CovarianceMode` enum, `PoseSourceLabel` enum), the error hierarchy (`PoseEstimatorError`, `PnpFailureError`, `CovarianceDegradedWarning` — note: a `Warning` subclass NOT an `Exception`), and the composition-root factory `build_pose_estimator(config, ransac_filter, wgs_converter, se3_utils, isam2_graph_handle) -> PoseEstimator`. The shared `RansacFilter` (AZ-282), `WgsConverter` (AZ-279), and `SE3Utils` (AZ-277) helpers are constructor-injected. The C5 iSAM2 graph handle is constructor-injected from the runtime root (ADR-003 shared substrate; C4 NEVER owns the graph). This task delivers the foundational scaffolding the Marginals (TBD AZ-?) and Hybrid (TBD AZ-?) tasks depend on; no PnP / GTSAM / Jacobian implementation is in scope here. +**Complexity**: 3 points +**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-270_compose_root, AZ-282_ransac_filter, AZ-279_wgs_converter, AZ-277_se3_utils, AZ-266_log_module +**Component**: c4_pose (epic AZ-259 / E-C4) +**Tracker**: AZ-355 +**Epic**: AZ-259 (E-C4) + +### Document Dependencies + +- `_docs/02_document/contracts/c4_pose/pose_estimator_protocol.md` — the public contract this task implements. +- `_docs/02_document/components/06_c4_pose/description.md` — § 1 architectural pattern (single concrete impl behind a Protocol); § 2 `PoseEstimator` interface + `PoseEstimate` DTO; § 5 error handling; § 9 logging. +- `_docs/02_document/module-layout.md` — `c4_pose` Per-Component Mapping; joint native ownership note (C5 owns `cpp/gtsam_bindings/`; C4 imports READ-ONLY). +- `_docs/02_document/architecture.md` — ADR-001, ADR-003 (shared GTSAM substrate), ADR-006 (Jacobian fallback acceptance), ADR-009. +- `_docs/02_document/contracts/c3_5_adhop/conditional_refiner_protocol.md` — `MatchResult` shape consumed at input. + +## Problem + +Without this task, the C4 PnP/Marginals consumers (TBD) and the downstream C5 StateEstimator (AZ-260) would each invent their own ad-hoc interface for the pose-estimator boundary, breaking ADR-001 (Strategy + composition root) and ADR-009 (interface-first DI). The DTO surface (`PoseEstimate`, `CovarianceMode`, `PoseSourceLabel`) is also consumed by C5 + C8 + C13 — defining it once in `_types/pose.py` prevents drift across consumers. The error hierarchy choice (`CovarianceDegradedWarning` as a `Warning` subclass, NOT `Exception`) is also non-obvious and needs to be codified before any caller writes a `try/except` around `estimate(...)` and accidentally swallows the warning. + +## Outcome + +- `src/gps_denied_onboard/components/c4_pose/interface.py` defining the `PoseEstimator` Protocol with `estimate` + `current_covariance_mode`. +- `src/gps_denied_onboard/components/c4_pose/__init__.py` re-exporting `PoseEstimator`, `PoseEstimate`, `EstimatorOutput` (per module-layout `c4_pose` Public API row), `CovarianceMode`, `PoseSourceLabel`. +- `src/gps_denied_onboard/_types/pose.py` defining the frozen + slotted dataclasses `LatLonAlt`, `Quat`, `PoseEstimate`, plus the `CovarianceMode` and `PoseSourceLabel` enums. +- `src/gps_denied_onboard/components/c4_pose/errors.py` defining `PoseEstimatorError`, `PnpFailureError`, and `CovarianceDegradedWarning` (the latter as a `Warning` subclass). +- `src/gps_denied_onboard/runtime_root/pose_factory.py` exporting `build_pose_estimator(config, ransac_filter, wgs_converter, se3_utils, isam2_graph_handle) -> PoseEstimator`. Single-strategy resolution table (`"opencv_gtsam"` only). Lazy-imports the concrete module via `importlib.import_module(...)` for symmetry with other component factories. +- Composition-root `compose_root` extension: invoke `build_pose_estimator` AFTER `RansacFilter`, `WgsConverter`, `SE3Utils`, and the C5 iSAM2 graph handle are constructed; bind the result to the SAME ingest thread as C5. +- Config schema extension to AZ-269: `config.pose.strategy` (default `"opencv_gtsam"`), `config.pose.ransac_iterations` (default 200), `config.pose.ransac_reprojection_threshold_px` (default 4.0), `config.pose.thermal_throttle_threshold_celsius` (default 75.0; informational). +- INFO log on every successful `build_pose_estimator`: `kind="c4.pose.strategy_loaded"` with strategy name + thresholds. +- `ISam2GraphHandle` Protocol stub at `src/gps_denied_onboard/components/c4_pose/_isam2_handle.py` (READ-ONLY view; allows C5 to provide a duck-typed handle without prematurely defining C5's graph internals). Documents the ONE method C4 needs: `get_pose_key(frame_id) -> int`. + +## Scope + +### Included +- The `PoseEstimator` Protocol with `estimate` + `current_covariance_mode`. +- The five DTOs / enums in `_types/pose.py`. +- The error hierarchy (note: `CovarianceDegradedWarning` is a `Warning`, NOT `Exception`). +- The composition-root factory. +- Config schema extension. +- The `ISam2GraphHandle` Protocol stub (consumed-side surface only; concrete impl owned by E-C5 / AZ-260). +- Composition-root wiring path. +- Unit tests covering Protocol conformance, DTO immutability + slots, factory rejection on unknown strategy, factory acceptance, INFO log emission, error-hierarchy distinction (`CovarianceDegradedWarning` IS-A `Warning`, NOT `Exception`). + +### Excluded +- The `OpenCVGtsamPoseEstimator` concrete implementation — owned by the Marginals task (TBD). +- The Jacobian fallback path + thermal switch — owned by the Hybrid task (TBD). +- The GTSAM `Marginals` factor add — owned by the Marginals task. +- The C5 iSAM2 graph implementation — owned by AZ-260. +- C4-IT-01..04 + C4-PT-01 — deferred to E-BBT (AZ-262). +- The C7 `ThermalState` source — owned by AZ-302. + +## Acceptance Criteria + +**AC-1: Protocol conformance — `runtime_checkable`** +A `FakePoseEstimator` test double implementing both methods passes `isinstance`; missing-method fakes fail. + +**AC-2: DTOs are frozen + slots** +`LatLonAlt`, `Quat`, `PoseEstimate` are `frozen=True, slots=True`. Mutation raises `FrozenInstanceError`. `__slots__` non-empty. + +**AC-3: Enums have the documented values** +`CovarianceMode` has exactly `MARGINALS` and `JACOBIAN` (string-valued). `PoseSourceLabel` has exactly `SATELLITE_ANCHORED`, `VISUAL_PROPAGATED`, `DEAD_RECKONED`. + +**AC-4: `CovarianceDegradedWarning` IS-A `Warning`, NOT `Exception`** +`issubclass(CovarianceDegradedWarning, Warning)` is True; `issubclass(CovarianceDegradedWarning, Exception)` is False (in Python's hierarchy `Warning` is NOT an `Exception` subclass at the catch-by-default level — `try/except Exception` does NOT catch warnings emitted via `warnings.warn`). Test verifies that a `try/except Exception` around `warnings.warn(CovarianceDegradedWarning(...))` does NOT catch the warning. + +**AC-5: `PnpFailureError` IS-A `Exception`** +`issubclass(PnpFailureError, PoseEstimatorError)` AND `issubclass(PnpFailureError, Exception)` both True. + +**AC-6: Factory rejects unknown strategy** +`config.pose.strategy = "garbage"` → `PoseEstimatorConfigError` raised; ERROR log emitted. + +**AC-7: Factory accepts `"opencv_gtsam"` and emits INFO log** +Successful construction; ONE INFO log `kind="c4.pose.strategy_loaded"` with structured fields. + +**AC-8: Public API surface — `__init__.py` re-exports** +`from gps_denied_onboard.components.c4_pose import PoseEstimator, PoseEstimate, CovarianceMode, PoseSourceLabel` resolves; `_isam2_handle` and internal classes NOT in `__all__`. + +**AC-9: Strategy bound to single ingest thread (same thread as C5)** +Composition root binds C4 + C5 to the same thread; binding C4 to a different thread raises `RuntimeError`. + +**AC-10: `ISam2GraphHandle` Protocol stub conforms to `runtime_checkable`** +A test double implementing `get_pose_key(frame_id) -> int` passes `isinstance(fake, ISam2GraphHandle)`. + +## Non-Functional Requirements + +**Performance** +- `build_pose_estimator` p99 ≤ 50 ms. + +**Compatibility** +- Protocol method-signature changes are MAJOR; DTO field additions are MINOR. + +**Reliability** +- Single-thread invariant enforced at composition-root binding (AC-9). +- `CovarianceDegradedWarning` semantics codified — callers MUST use `warnings.catch_warnings` if they need to programmatically observe warnings. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Protocol conformance | Fake passes; partial fake fails | +| AC-2 | DTO immutability + slots | `FrozenInstanceError`; non-empty `__slots__` | +| AC-3 | Enum values | Documented values present | +| AC-4 | Warning vs Exception hierarchy | `try/except Exception` does NOT catch the warning | +| AC-5 | `PnpFailureError` IS-A Exception | `try/except Exception` catches | +| AC-6 | Unknown-strategy rejection | `PoseEstimatorConfigError`; ERROR log | +| AC-7 | Successful factory load | INFO log with structured fields | +| AC-8 | Public API re-exports | Public names resolve; internals not | +| AC-9 | Single-thread binding | Second binding (different thread) raises `RuntimeError` | +| AC-10 | `ISam2GraphHandle` Protocol | Fake passes | + +## Constraints + +- **`@runtime_checkable` MUST be used** on both `PoseEstimator` and `ISam2GraphHandle`. +- **DTOs MUST be `frozen=True, slots=True`.** +- **`CovarianceDegradedWarning` MUST be a `Warning` subclass** (NOT `Exception`). Documents R10 acceptance. +- **The factory does NOT instantiate `RansacFilter`, `WgsConverter`, `SE3Utils`, or the iSAM2 graph handle** — runtime root constructs ONCE and passes references. +- **Single-thread binding** with C5 is enforced at composition root; ADR-003 shared GTSAM substrate is non-thread-safe. + +## Risks & Mitigation + +**Risk 1: `ISam2GraphHandle` Protocol stub couples C4 prematurely to C5** +- *Mitigation*: the stub defines ONLY `get_pose_key(frame_id) -> int` — the minimal surface C4 needs to attach factors. C5 (AZ-260) implements the concrete handle; if C5's graph design changes, the stub may grow but the Protocol surface stays stable as long as C4's needs don't change. + +**Risk 2: `CovarianceDegradedWarning` semantics confuse callers expecting an exception-like flow** +- *Mitigation*: documented in the contract (Invariant 9) AND codified in AC-4. Description.md § 5 also states explicitly "NOT a fatal condition". + +**Risk 3: Composition root needs to construct C4 and C5 in lockstep (chicken-and-egg)** +- *Mitigation*: ADR-003 documents this. The composition root constructs the iSAM2 graph FIRST (C5), then C4 (passing the handle). The Protocol task creates the stub Protocol so C5 can be implemented in parallel without C4 implementations being ready. + +## Runtime Completeness + +- **Named capability**: `PoseEstimator` Protocol + `PoseEstimate` DTO + `ISam2GraphHandle` Protocol stub + composition-root factory. +- **Production code that must exist**: real Protocol + real DTOs + real error hierarchy + real factory + real config schema extension + real composition-root wiring path that binds C4 to the same thread as C5. +- **Allowed external stubs**: `FakePoseEstimator`, `FakeISam2GraphHandle`, `FakeRansacFilter`, `FakeWgsConverter`, `FakeSE3Utils` for tests. +- **Unacceptable substitutes**: making `CovarianceDegradedWarning` an `Exception` subclass (would change warning semantics for all callers); skipping the `ISam2GraphHandle` stub (would force C4 implementations to import C5's concrete graph type → cycle). + +## Contract + +This task produces/implements the contract at `_docs/02_document/contracts/c4_pose/pose_estimator_protocol.md`. +Consumers MUST read that file — not this task spec — to discover the interface. diff --git a/_docs/02_tasks/todo/AZ-358_c4_opencv_gtsam_marginals.md b/_docs/02_tasks/todo/AZ-358_c4_opencv_gtsam_marginals.md new file mode 100644 index 0000000..92ac9b3 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-358_c4_opencv_gtsam_marginals.md @@ -0,0 +1,194 @@ +# C4 OpenCVGtsamPoseEstimator — Marginals (steady-state) path + +**Task**: AZ-358_c4_opencv_gtsam_marginals +**Name**: C4 `OpenCVGtsamPoseEstimator` — `solvePnPRansac` + GTSAM `Marginals` factor add (steady-state path) +**Description**: Implement `OpenCVGtsamPoseEstimator`, the production-default `PoseEstimator`. STEADY-STATE PATH ONLY (`thermal_state.throttle == False`): runs OpenCV `solvePnPRansac` with `SOLVEPNP_IPPE` on the inlier correspondences from `MatchResult.per_candidate[best_candidate_idx].inlier_correspondences`; on success, adds a `GenericProjectionFactorCal3DS2` to C5's shared iSAM2 graph (via the `ISam2GraphHandle.get_pose_key(frame_id)` API); recovers the posterior 6×6 covariance via `gtsam.Marginals(graph, values).marginalCovariance(pose_key)`; converts the local-tangent-plane pose to WGS84 via the shared `WgsConverter`; assembles a `PoseEstimate` with `covariance_mode = MARGINALS`, `source_label = SATELLITE_ANCHORED`. RANSAC convergence failure or degenerate geometry raises `PnpFailureError` (per Invariant 9). When `thermal_state.throttle == True`, raises `NotImplementedError("Jacobian path owned by Hybrid task")` — replaced by AZ-? (Hybrid task) when that lands. +**Complexity**: 5 points +**Dependencies**: AZ-355 (Protocol + DTOs + factory + `ISam2GraphHandle` Protocol stub), AZ-381 (E-C5 — supplies the concrete `ISam2GraphHandle` impl + iSAM2 graph; co-developed per ADR-003), AZ-282 (RANSAC helper, used internally for residual recomputation if needed), AZ-279 (`WgsConverter`), AZ-277 (`SE3Utils`), AZ-269 (config), AZ-266 (logging), AZ-272 (FDR record schema), AZ-263 +**Component**: c4_pose (epic AZ-259 / E-C4) +**Tracker**: AZ-358 +**Epic**: AZ-259 (E-C4) + +### Document Dependencies + +- `_docs/02_document/contracts/c4_pose/pose_estimator_protocol.md` — producer/consumer split; Marginals task scope. +- `_docs/02_document/components/06_c4_pose/description.md` — § 1 (D-C4-1=(b) IPPE; D-C4-2=(b) Marginals); § 5 error handling; § 7 `Marginals.marginalCovariance(pose_key)` is the dominant cost (~30–90 ms). +- `_docs/02_document/components/06_c4_pose/tests.md` — C4-IT-01 (WGS84 accuracy), C4-IT-02 (SPD covariance), C4-IT-04 (shared-graph integration). +- `_docs/02_document/architecture.md` — ADR-003 (shared GTSAM substrate), ADR-009 (interface-first DI). +- `_docs/02_document/contracts/shared_helpers/wgs_converter.md`. +- `_docs/02_document/contracts/shared_helpers/ransac_filter.md`. + +## Problem + +Without this task, C5 has no source of pose anchors with native 6×6 covariance — meaning every frame would be `visual_propagated` and AC-1.4 (95% covariance + source label) cannot be satisfied. The Marginals path is also the production-default per D-C4-2 = (b); the Jacobian fallback is degraded-mode only. Defining the steady-state path FIRST (before the Hybrid task) is also necessary so the Hybrid task has a working baseline against which to measure the Jacobian-degraded accuracy delta (~5–10% per ADR-006). + +## Outcome + +- `src/gps_denied_onboard/components/c4_pose/opencv_gtsam_estimator.py` defining: + - `OpenCVGtsamPoseEstimator` class implementing the `PoseEstimator` Protocol. + - Constructor: `__init__(self, config, ransac_filter, wgs_converter, se3_utils, isam2_graph_handle, fdr_client)`. + - `_last_covariance_mode: CovarianceMode` (private; reset every `estimate` call). + - `estimate(match_result, calibration, thermal_state)`: + 1. **If `thermal_state.throttle == True`**: raise `NotImplementedError("Jacobian path owned by Hybrid task; install AZ-? to enable.")`. Code review hook for the Hybrid task to delete this branch and replace with the actual implementation. + 2. **Steady-state path** (`thermal_state.throttle == False`): + a. Extract inlier 2D points + 3D world points from `match_result.per_candidate[match_result.best_candidate_idx].inlier_correspondences` + the candidate's `tile_id` → look up tile world coordinates via the calibration's georeferencing. + b. Call `cv2.solvePnPRansac(world_pts, image_pts, K, dist, flags=cv2.SOLVEPNP_IPPE, iterationsCount=config.pose.ransac_iterations, reprojectionError=config.pose.ransac_reprojection_threshold_px)`. + c. On RANSAC failure: raise `PnpFailureError(f"PnP convergence failure: frame={match_result.frame_id}")`. ERROR log + FDR record. The exception escapes (per Invariant 9). + d. On success: convert `(rvec, tvec)` to a GTSAM `Pose3` via `SE3Utils`. + e. Add a `gtsam.GenericProjectionFactorCal3DS2(image_pts, noise_model, pose_key, landmark_keys, calibration_gtsam)` to C5's iSAM2 graph via `isam2_graph_handle.add_factor(...)` (the handle exposes a write API; AZ-260 implements). NOTE: the Protocol stub in AZ-? defines only `get_pose_key`; this task EXTENDS the stub with `add_factor` and `compute_marginals` (or pushes the Protocol extension upstream into the AZ-? Protocol task during co-development). + f. Trigger an iSAM2 update via `isam2_graph_handle.update()`. + g. Compute `covariance_6x6 = gtsam.Marginals(graph, values).marginalCovariance(pose_key)`. + h. Verify SPD: `np.linalg.cholesky(covariance_6x6)` succeeds; if not, log ERROR + raise `PnpFailureError("non-SPD covariance from Marginals; numerical instability")` (defensive). + i. Convert local-tangent-plane pose to WGS84 via `wgs_converter.local_to_wgs84(pose)`. + j. Assemble `PoseEstimate(frame_id, position_wgs84, orientation, covariance_6x6, MARGINALS, SATELLITE_ANCHORED, last_anchor_age_ms_from_isam2_handle, monotonic_ns())`. + k. `self._last_covariance_mode = MARGINALS`. + l. INFO log on first-frame ready; DEBUG log per frame `kind="c4.pose.frame_done"` with `{frame_id, inliers, residual, mode}`. + m. FDR `pose.frame_done` record. + 3. Return the `PoseEstimate`. + - `current_covariance_mode() -> CovarianceMode`: return `self._last_covariance_mode` (initialised to `MARGINALS` at construction; updated per call). + - Module-level `create(config, ransac_filter, wgs_converter, se3_utils, isam2_graph_handle) -> PoseEstimator` factory function. +- `ISam2GraphHandle` Protocol extension (in AZ-?'s `_isam2_handle.py` — co-developed): + - `add_factor(factor) -> None`: add a factor to the iSAM2 graph. + - `update() -> None`: trigger an iSAM2 update. + - `compute_marginals() -> gtsam.Marginals`: returns the Marginals object for covariance recovery. + - `last_anchor_age_ms() -> int`: tracked by C5; broadcast to C4 via the handle. + The extension is part of THIS task's scope; the Protocol stub in AZ-? is updated in lockstep. + +## Scope + +### Included +- `OpenCVGtsamPoseEstimator` Marginals path (steady-state). +- `solvePnPRansac` with `SOLVEPNP_IPPE`. +- GTSAM factor add to C5's iSAM2 graph via `ISam2GraphHandle`. +- `Marginals.marginalCovariance(pose_key)` for native 6×6 covariance recovery. +- WGS84 conversion via shared `WgsConverter`. +- SPD-invariant defensive check. +- `PnpFailureError` raise on convergence failure or degenerate geometry. +- `NotImplementedError` placeholder for the Jacobian path (replaced by Hybrid task). +- `ISam2GraphHandle` Protocol extension (`add_factor`, `update`, `compute_marginals`, `last_anchor_age_ms`). +- Composition-root wiring path. +- Unit tests covering: PnP success path on synthetic correspondences; PnP failure (degenerate geometry) → `PnpFailureError`; SPD covariance assertion; WGS84 conversion correctness against `WgsConverter` test vectors; thermal-throttle → `NotImplementedError`. + +### Excluded +- The Jacobian fallback path — owned by the Hybrid task. +- The C5 iSAM2 graph implementation — owned by AZ-260; this task consumes via Protocol. +- C4-IT-01..04 + C4-PT-01 — deferred to E-BBT (AZ-262). +- The `ThermalState` source — owned by AZ-302. +- The camera calibration loader (intrinsics + distortion + extrinsics) — owned by C5 / shared. + +## Acceptance Criteria + +**AC-1: PnP success on synthetic correspondences** +Given a 50-point inlier set with known ground-truth pose +When `estimate(...)` runs +Then the returned `position_wgs84` is within 1 m of ground truth (synthetic; real-world tolerances per C4-IT-01). + +**AC-2: PnP RANSAC failure → `PnpFailureError`** +Given an inlier set with all-collinear points (degenerate geometry) +When `estimate(...)` is called +Then `PnpFailureError` is raised; ONE ERROR log; ONE FDR `pose.frame_done` record with `error: true`. + +**AC-3: SPD covariance invariant** +Given a successful PnP + Marginals run +When the resulting `covariance_6x6` is checked +Then `np.linalg.cholesky(covariance_6x6)` succeeds (matrix is positive-definite); the matrix is symmetric to 1e-10 tolerance. + +**AC-4: `covariance_mode == MARGINALS` on success** +Every successful `estimate` returns `PoseEstimate.covariance_mode == CovarianceMode.MARGINALS` AND `current_covariance_mode() == MARGINALS`. + +**AC-5: `source_label == SATELLITE_ANCHORED` on success (Invariant 7)** +Every successful `estimate` returns `PoseEstimate.source_label == PoseSourceLabel.SATELLITE_ANCHORED`. C4 NEVER emits `VISUAL_PROPAGATED`. + +**AC-6: WGS84 conversion uses shared `WgsConverter`** +Given a known local-tangent-plane pose AND a known origin +When `estimate` runs +Then the WGS84 conversion exactly matches `WgsConverter.local_to_wgs84(pose)` test vectors. (This is implementation hygiene — verifies no inline math.) + +**AC-7: Factor add against C5 iSAM2 handle** +Given a stub `ISam2GraphHandle` recording all calls +When `estimate(...)` runs +Then the call sequence is: `add_factor(factor)` × 1 → `update()` × 1 → `compute_marginals()` × 1 → marginal recovered. No second `update()` per frame. + +**AC-8: Thermal throttle → `NotImplementedError`** +Given `thermal_state.throttle = True` +When `estimate(...)` is called +Then `NotImplementedError("Jacobian path owned by Hybrid task; install AZ-? to enable.")` is raised. + +**AC-9: Non-SPD covariance defensive raise** +Given a stubbed `ISam2GraphHandle.compute_marginals()` returning a non-SPD matrix (synthetically corrupted) +When `estimate(...)` runs +Then `PnpFailureError("non-SPD covariance from Marginals; numerical instability")` is raised; ERROR log emitted. + +**AC-10: Composition-root wiring** +Given `config.pose.strategy = "opencv_gtsam"` AND a valid `ISam2GraphHandle` +When `compose_root(config)` runs +Then an `OpenCVGtsamPoseEstimator` is wired; ONE INFO log `kind="c4.pose.ready"` with `{strategy: "opencv_gtsam", default_covariance: "MARGINALS"}` is emitted; the strategy holds the SAME `RansacFilter`, `WgsConverter`, `SE3Utils` instances as C3 / C5 (identity-shared). + +**AC-11: FDR `pose.frame_done` record shape** +Every `estimate` call (success OR `PnpFailureError`) emits exactly ONE FDR record with documented fields: `{frame_id, inliers, residual, mode, covariance_norm, position_wgs84, error}`. + +## Non-Functional Requirements + +**Performance** +- `estimate` p95 (MARGINALS, K=15) ≤ 90 ms target / 130 ms hard limit (per C4-PT-01 / AC-4.1). +- `solvePnPRansac` portion p95 ≤ 30 ms. +- `Marginals.marginalCovariance(pose_key)` portion p95 ≤ 60 ms (the dominant cost per description.md § 7). + +**Compatibility** +- OpenCV ≥ 4.12.0 (CVE-2025-53644 mitigation per description.md). +- GTSAM Python bindings — version per Plan-phase pin. + +**Reliability** +- `PnpFailureError` is the ONLY exception escaping (Invariant 9). All other failure modes (degenerate geometry, non-SPD covariance, etc.) MAP to `PnpFailureError`. +- SPD invariant defensive check covers GTSAM numerical instability; if it triggers, an upstream issue exists in C5's graph health. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Synthetic PnP success | Position within 1 m of GT | +| AC-2 | Degenerate geometry → `PnpFailureError` | Exception raised; ERROR log; FDR error record | +| AC-3 | SPD covariance | Cholesky succeeds; symmetric to 1e-10 | +| AC-4 | Mode == MARGINALS | Both `PoseEstimate.covariance_mode` and `current_covariance_mode()` | +| AC-5 | Source label == SATELLITE_ANCHORED | Always on success | +| AC-6 | `WgsConverter` use | Output matches helper vectors | +| AC-7 | iSAM2 handle call sequence | `add_factor` → `update` → `compute_marginals`, each ×1 | +| AC-8 | Thermal throttle → `NotImplementedError` | Exception raised with documented message | +| AC-9 | Non-SPD covariance defensive | `PnpFailureError("non-SPD ...")` | +| AC-10 | Composition wiring | INFO log; identity-shared helpers | +| AC-11 | FDR record shape | Exactly one per call; documented fields | + +## Constraints + +- **Single-threaded** by contract (Invariant 1); same thread as C5. +- **Steady-state path ONLY** — `thermal_state.throttle == True` raises `NotImplementedError`. The Hybrid task replaces this branch. +- **`solvePnPRansac` flag MUST be `SOLVEPNP_IPPE`** per D-C4-1 = (b). +- **`Marginals` is the covariance-recovery primitive** — not Jacobian-based math (the Hybrid task's job). +- **No inline math for WGS84 conversion** — must use `WgsConverter` (AC-6). +- **`ISam2GraphHandle` extension MUST be applied to AZ-?'s Protocol stub** in lockstep — both tasks update the same `_isam2_handle.py` file. This is documented as a co-developed scope per ADR-003. + +## Risks & Mitigation + +**Risk 1: GTSAM `Marginals` is non-thread-safe + the iSAM2 graph is shared with C5** +- *Mitigation*: single-thread invariant (Invariant 1); composition root binds C4 + C5 to the same thread. The handle's `update()` + `compute_marginals()` are called sequentially within `estimate`; C5's update path runs in a different code section but on the same thread. + +**Risk 2: `solvePnPRansac` in OpenCV 4.12 has a known subtle behaviour change for `SOLVEPNP_IPPE`** that could affect convergence on the Derkachi fixture +- *Mitigation*: pin OpenCV version per Plan-phase pin. C4-IT-01 verifies WGS84 accuracy on Derkachi; if the verdict regresses, escalate. + +**Risk 3: The `ISam2GraphHandle` extension creates a chicken-and-egg with AZ-260 (C5)** +- *Mitigation*: ADR-003 acknowledges this. The Protocol extension is owned by THIS task; the concrete impl is owned by AZ-260; both are co-developed. The Protocol task (AZ-?) ships the minimal `get_pose_key` surface; this task extends to `add_factor`/`update`/`compute_marginals`/`last_anchor_age_ms`. AZ-260 implements all four. + +**Risk 4: `Marginals.marginalCovariance(pose_key)` cost (~30–90 ms) blows the AC-4.1 budget under load** +- *Mitigation*: the Hybrid task addresses this via the Jacobian fallback under thermal throttle. This task's NFR target is 90 ms p95 — within budget; the ~5–10% accuracy loss of the Jacobian path is the trade documented in ADR-006. + +## Runtime Completeness + +- **Named capability**: `OpenCVGtsamPoseEstimator` Marginals path — production-default `PoseEstimator`. +- **Production code that must exist**: real OpenCV `solvePnPRansac` call; real GTSAM factor add against C5's iSAM2 handle; real `Marginals.marginalCovariance(pose_key)`; real WGS84 conversion via `WgsConverter`; real `PnpFailureError` raise on failure; real SPD-invariant defensive check; real FDR record emission; real composition-root wiring. +- **Allowed external stubs**: `FakeISam2GraphHandle`, `FakeRansacFilter`, `FakeWgsConverter`, `FakeSE3Utils`, `FakeFdrClient`. Production wiring uses real concretes (real OpenCV, real GTSAM). +- **Unacceptable substitutes**: a SciPy-only PnP implementation (would not have the `IPPE` solver behaviour); a Jacobian-derived covariance on the steady-state path (would be the Hybrid task's job); inline WGS84 math (violates AC-6); a synthetic Marginals object that returns canned matrices (would skip the actual GTSAM integration test). + +## Contract + +This task implements the steady-state portion of `_docs/02_document/contracts/c4_pose/pose_estimator_protocol.md`. +The Hybrid task (TBD) replaces the `NotImplementedError` branch with the actual Jacobian implementation. diff --git a/_docs/02_tasks/todo/AZ-361_c4_jacobian_thermal_hybrid.md b/_docs/02_tasks/todo/AZ-361_c4_jacobian_thermal_hybrid.md new file mode 100644 index 0000000..d1a5df2 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-361_c4_jacobian_thermal_hybrid.md @@ -0,0 +1,187 @@ +# C4 D-CROSS-LATENCY-1 Hybrid — Jacobian fallback + thermal-state-driven mode switch + +**Task**: AZ-361_c4_jacobian_thermal_hybrid +**Name**: C4 D-CROSS-LATENCY-1 hybrid — Jacobian-degraded covariance + per-frame thermal-state-driven mode switch +**Description**: Extend `OpenCVGtsamPoseEstimator` (TBD AZ-? Marginals task) with the D-CROSS-LATENCY-1 hybrid: when `thermal_state.throttle == True`, REPLACE the `Marginals.marginalCovariance(pose_key)` path with a Jacobian-derived 6×6 covariance computed directly from the OpenCV `solvePnPRansac` outputs (`rvec`, `tvec`, inlier residuals) using the standard PnP Jacobian + reprojection-residual variance. The pose estimate itself is still `solvePnPRansac` output; ONLY the covariance recovery path differs. The mode-switch decision is made PER FRAME at the start of `estimate(...)` based on `thermal_state.throttle`. Switching MARGINALS → JACOBIAN or back happens immediately on the next call (Invariant 4 — mode-switch latency ≤ 1 frame). The Jacobian path also emits `CovarianceDegradedWarning` via `warnings.warn(...)` once per 60 s window (filterwarnings-based rate-limiting). Replaces the AZ-? Marginals task's `NotImplementedError("Jacobian path owned by Hybrid task...")` branch with the actual implementation. +**Complexity**: 3 points +**Dependencies**: AZ-358 (Marginals path + class scaffold), AZ-355 (`CovarianceDegradedWarning` + `CovarianceMode` enum), AZ-302_c7_thermal_publisher (`ThermalState` source), AZ-277_se3_utils, AZ-279_wgs_converter, AZ-269_config_loader, AZ-266_log_module, AZ-272_fdr_record_schema, AZ-263 +**Component**: c4_pose (epic AZ-259 / E-C4) +**Tracker**: AZ-361 +**Epic**: AZ-259 (E-C4) + +### Document Dependencies + +- `_docs/02_document/contracts/c4_pose/pose_estimator_protocol.md` — Hybrid task scope. +- `_docs/02_document/components/06_c4_pose/description.md` — § 1 D-C4-2 = (a) Jacobian fallback under throttle; § 5 `CovarianceDegradedWarning` semantics; § 7 ~5–15 ms Jacobian cost. +- `_docs/02_document/components/06_c4_pose/tests.md` — C4-IT-03 (D-CROSS-LATENCY-1 mode switch within 1 frame), C4-PT-01 (JACOBIAN p95 ≤ 15 ms). +- `_docs/02_document/architecture.md` — ADR-006 (~5–10% accuracy loss accepted under throttle), R10. + +## Problem + +Without this task, the system has only the Marginals path; under thermal throttle (sustained Jetson high-temperature mode), `Marginals.marginalCovariance(pose_key)` consumes ~60 ms of the AC-4.1 latency budget AND GPU/CPU contention with C7 inference makes the overall pipeline miss the 400 ms p95 budget. The D-CROSS-LATENCY-1 hybrid trades ~5–10% covariance accuracy (per ADR-006) for ~5–15 ms Jacobian-path latency, restoring the latency budget. AC-NEW-5 (operating envelope; thermal-throttle-driven covariance degradation hybrid) requires this path to exist and to switch per-frame within 1 frame of the thermal flag flipping. + +## Outcome + +- `src/gps_denied_onboard/components/c4_pose/opencv_gtsam_estimator.py` — REMOVE the `NotImplementedError` branch; ADD the Jacobian path. +- New private method `_estimate_marginals_path(...)` extracted from the existing Marginals body (refactor; no behaviour change). +- New private method `_estimate_jacobian_path(...)`: + 1. Run `cv2.solvePnPRansac` (same call as Marginals path) — pose extraction is identical. + 2. From the inlier 2D points + 3D world points + reprojected coordinates, compute the per-point reprojection residuals. + 3. Compute the PnP Jacobian `J` (6×N) at the converged pose using OpenCV's `cv2.projectPoints` with derivative output (or via `SE3Utils.numerical_jacobian` if OpenCV does not expose the analytical Jacobian for IPPE). + 4. Compute residual variance σ² as `mean(residuals²)` (isotropic noise model — documented as the simplification the Jacobian path makes per ADR-006). + 5. Compute the 6×6 information matrix `Λ = (1/σ²) · Jᵀ J`. + 6. Compute covariance `Σ = inv(Λ + ε·I)` with ε = 1e-9 (ridge regularisation for numerical stability when `Λ` is ill-conditioned). + 7. Verify SPD: `np.linalg.cholesky(Σ)` succeeds; on failure raise `PnpFailureError("non-SPD Jacobian covariance; numerical instability")` (defensive). + 8. Convert pose to WGS84 via `WgsConverter.local_to_wgs84(pose)` — SAME conversion path as Marginals. + 9. Assemble `PoseEstimate(..., covariance_mode = JACOBIAN, source_label = SATELLITE_ANCHORED)`. + 10. `self._last_covariance_mode = JACOBIAN`. + 11. **`warnings.warn(CovarianceDegradedWarning("Jacobian covariance engaged; thermal_throttle=true"), stacklevel=2)`** — first time per 60 s window; rate-limited via `_jacobian_warn_window` (timestamp of last emission). Subsequent invocations within the window use `warnings.simplefilter("once")` semantics — emit only the first one. + 12. WARN log `kind="c4.pose.covariance_degraded"` with `{frame_id, thermal_state}` — emitted ONCE per 60 s window (rate-limited). + 13. FDR `pose.frame_done` record with `mode: "jacobian"`. +- `estimate(...)` becomes a dispatcher: read `thermal_state.throttle` at entry; call `_estimate_marginals_path(...)` or `_estimate_jacobian_path(...)`. NO state buffering — strictly per-call. +- Note: the Jacobian path does NOT add a factor to C5's iSAM2 graph (per design — under throttle, the system runs lighter; C5 receives a `PoseEstimate` but no graph factor add). DOCUMENT this trade explicitly: under throttle, the iSAM2 graph stops growing; recovery happens automatically when the thermal flag flips back. + +> **Cross-task interaction with AZ-260 (C5)**. C5's iSAM2 update path needs to handle the case where C4 emits a `PoseEstimate` without a corresponding factor add (Jacobian path). C5's `add_pose_anchor(pose_estimate)` MUST inspect `pose_estimate.covariance_mode` and skip the graph-resync work for `JACOBIAN`. This requirement is captured in AZ-260's task spec; this task does NOT modify C5 — it only documents the requirement. + +## Scope + +### Included +- Refactor of existing `OpenCVGtsamPoseEstimator.estimate(...)` into a dispatcher + two private path methods. +- New `_estimate_jacobian_path(...)` implementation. +- Per-frame thermal-state-driven dispatch. +- `CovarianceDegradedWarning` emission via `warnings.warn` (NOT raise). +- 60 s rate-limiting on the WARN log AND the warnings.warn emission. +- SPD-invariant defensive check on Jacobian covariance. +- Removal of `NotImplementedError` from AZ-? Marginals task's body. +- Documentation update to the contract / description: explicit note that the Jacobian path does NOT add to C5's iSAM2 graph. +- Unit tests covering: thermal flag flip → mode switch within 1 frame; Jacobian covariance is SPD; Jacobian covariance produces accuracy WITHIN ~5–10% of Marginals on a synthetic baseline; `CovarianceDegradedWarning` emitted via `warnings.warn` not raise; rate-limiting of the warning + WARN log. + +### Excluded +- The Marginals path — already shipped by AZ-?. +- C5's iSAM2 update-path adjustment (`add_pose_anchor` mode inspection) — owned by AZ-260; this task only documents the requirement. +- The `ThermalState` source — owned by AZ-302. +- C4-IT-03 + C4-PT-01 component-internal acceptance tests — deferred to E-BBT (AZ-262); unit tests in this task cover the per-method invariants. + +## Acceptance Criteria + +**AC-1: Per-frame mode dispatch** +Given an alternating `thermal_state.throttle` sequence (False, True, False, True, ...) over 10 frames +When `estimate(...)` is called for each frame +Then `current_covariance_mode()` returns the correct mode after EACH call (no buffering, no hysteresis). + +**AC-2: Mode-switch latency ≤ 1 frame (Invariant 4)** +Given the thermal flag flips between two consecutive `estimate(...)` calls +When the second call runs +Then the new mode IS the new flag's mode (no carry-over from the previous call). + +**AC-3: Jacobian covariance is SPD** +Given a successful Jacobian path run +Then `np.linalg.cholesky(covariance_6x6)` succeeds; the matrix is symmetric to 1e-10 tolerance. + +**AC-4: `covariance_mode == JACOBIAN` on Jacobian path** +When the Jacobian path runs successfully +Then `PoseEstimate.covariance_mode == CovarianceMode.JACOBIAN` AND `current_covariance_mode() == JACOBIAN`. + +**AC-5: `source_label == SATELLITE_ANCHORED` on success regardless of path (Invariant 7)** +Both Marginals and Jacobian paths emit `SATELLITE_ANCHORED` on success. + +**AC-6: `CovarianceDegradedWarning` emitted via `warnings.warn`** +When the Jacobian path runs +Then `warnings.warn(CovarianceDegradedWarning("..."))` is called; verified via `warnings.catch_warnings()` test harness. The warning is NOT raised as an exception. + +**AC-7: `warnings.warn` rate-limited to ONE per 60 s window** +Given the Jacobian path runs at 3 Hz for 70 s (total 210 calls) +When the warnings emitted are counted +Then exactly 2 warnings are emitted (one for window 0–60 s, one for window 60–120 s). + +**AC-8: WARN log rate-limited similarly** +Same scenario as AC-7 → exactly 2 WARN log records `kind="c4.pose.covariance_degraded"`. + +**AC-9: Marginals path unchanged by refactor** +Given a Marginals-only test fixture (already exercised in AZ-?) +When the AZ-? tests are re-run after this task's refactor +Then ALL AZ-? AC-1..AC-11 still pass without modification. + +**AC-10: Non-SPD Jacobian covariance defensive raise** +Given an inlier set producing a near-singular `JᵀJ` (synthetically degenerate) +When the Jacobian path runs +Then `PnpFailureError("non-SPD Jacobian covariance; numerical instability")` is raised; ERROR log emitted. + +**AC-11: Jacobian accuracy within ~5–10% of Marginals on synthetic baseline** +Given 100 synthetic frames AND ground-truth pose +When BOTH paths are run on the same input (test harness toggles `thermal_state.throttle`) +Then the Jacobian path's RMSE is within 1.10× the Marginals path's RMSE (10% tolerance per ADR-006). Informational; does NOT block this AC if the actual ratio is between 1.0 and 1.10. + +**AC-12: Jacobian path skips iSAM2 factor add** +Given the Jacobian path runs +Then NO `isam2_graph_handle.add_factor(...)` call AND NO `isam2_graph_handle.update()` call is made; only `last_anchor_age_ms()` is read (for the `last_satellite_anchor_age_ms` field). + +**AC-13: FDR `pose.frame_done` distinguishes path** +The `mode` field in the FDR record matches the path: `"marginals"` or `"jacobian"`. + +## Non-Functional Requirements + +**Performance** +- `_estimate_jacobian_path` p95 ≤ 15 ms target / 25 ms hard limit (per C4-PT-01). +- Mode-switch latency: zero — the dispatch is a single `if` at call entry. + +**Compatibility** +- OpenCV ≥ 4.12.0 (same as AZ-?). +- The Jacobian computation MUST work for both analytical (if OpenCV exposes it for IPPE) AND numerical (`SE3Utils.numerical_jacobian`) paths; choose analytical when available, fall back to numerical. + +**Reliability** +- SPD-invariant defensive check covers the Jacobian path. +- `CovarianceDegradedWarning` rate-limiting prevents log flooding under sustained throttle. + +## Unit Tests + +| AC Ref | What to Test | Required Outcome | +|--------|--------------|------------------| +| AC-1 | Alternating thermal flag → alternating mode | Each call's mode matches the flag | +| AC-2 | Mode-switch within 1 frame | New mode on the next call | +| AC-3 | Jacobian SPD | Cholesky succeeds | +| AC-4 | Jacobian mode reporting | Both `PoseEstimate` and `current_covariance_mode()` | +| AC-5 | Source label always SATELLITE_ANCHORED | Both paths | +| AC-6 | `warnings.warn` not raise | Captured via `warnings.catch_warnings()` | +| AC-7 | Warning rate-limited | 2 warnings over 70 s | +| AC-8 | WARN log rate-limited | 2 records over 70 s | +| AC-9 | AZ-? tests still pass | Re-run all 11 ACs after refactor | +| AC-10 | Near-singular Jacobian → defensive raise | `PnpFailureError` | +| AC-11 | Jacobian within 10% of Marginals | RMSE ratio ≤ 1.10 | +| AC-12 | Jacobian skips iSAM2 add | No `add_factor`/`update` calls | +| AC-13 | FDR mode field | `"marginals"` or `"jacobian"` | + +## Constraints + +- **Single-threaded** by contract; same thread as C5 + the Marginals task. +- **No buffering** — mode dispatch is per-call; thermal flag is read at call entry. +- **Jacobian path skips iSAM2 factor add** — design choice per C5's degraded mode requirement. +- **`CovarianceDegradedWarning` is emitted via `warnings.warn`** — NEVER raised. +- **Rate-limiting is 60 s** for both the warning and the WARN log. +- **Refactor of Marginals path MUST be behaviour-preserving** — AZ-?'s tests pass unchanged (AC-9). + +## Risks & Mitigation + +**Risk 1: OpenCV may not expose an analytical Jacobian for `SOLVEPNP_IPPE`** +- *Mitigation*: implementation falls back to `SE3Utils.numerical_jacobian` (forward-difference, 1e-6 step). Numerical Jacobian costs ~5 ms additional but stays within the 15 ms target. + +**Risk 2: Jacobian-derived covariance is overly optimistic on hard frames** +- *Mitigation*: ADR-006 documents the ~5–10% accuracy loss; AC-11 verifies within tolerance. AC-NEW-5 (full operating envelope verification) belongs to E-BBT / NFT-LIM-04 — workstation-baseline portion only here. + +**Risk 3: The `_jacobian_warn_window` rate-limiting might miss a warning if the first frame after the 60 s rollover is a Marginals frame** +- *Mitigation*: window is reset on every Jacobian call; rollover boundaries are not special-cased. The intent is "at least one warning per active throttle window of duration ≥ 60 s", which is satisfied. + +**Risk 4: Refactoring AZ-?'s `estimate(...)` may introduce subtle behaviour changes** +- *Mitigation*: AC-9 mandates AZ-?'s full test suite passes unchanged. The refactor is mechanical (extract method); review hook ensures fidelity. + +## Runtime Completeness + +- **Named capability**: D-CROSS-LATENCY-1 hybrid — Jacobian fallback + per-frame thermal-driven mode switch. +- **Production code that must exist**: real Jacobian computation (analytical or numerical); real per-frame thermal-driven dispatch; real SPD defensive check; real `warnings.warn` emission with rate-limiting; real WARN log emission with rate-limiting; refactor of Marginals path into a private method; FDR record `mode` field distinguishes paths. +- **Allowed external stubs**: `FakeISam2GraphHandle`, `FakeWgsConverter`, `FakeSE3Utils`, `FakeFdrClient` — same as AZ-?. +- **Unacceptable substitutes**: a Marginals call with a "skip the cubic part" hack (would not deliver the latency budget); raising `CovarianceDegradedWarning` instead of emitting via `warnings.warn` (changes the user-facing semantics); skipping the rate limiter (would flood logs under sustained throttle); buffering thermal state across frames (would violate Invariant 4); emitting warnings inside the Marginals path "for symmetry" (no — the warning is documentation that the system is in degraded mode). + +## Contract + +This task implements the Jacobian / hybrid portion of `_docs/02_document/contracts/c4_pose/pose_estimator_protocol.md`. +After this task lands, the contract's invariants (especially Invariants 3, 4, 6, 9) are FULLY exercised. diff --git a/_docs/02_tasks/todo/AZ-381_c5_state_protocol.md b/_docs/02_tasks/todo/AZ-381_c5_state_protocol.md new file mode 100644 index 0000000..13d7d24 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-381_c5_state_protocol.md @@ -0,0 +1,105 @@ +# C5 StateEstimator Protocol + Factory + DTOs + Composition + Concrete ISam2GraphHandle + +**Task**: AZ-381_c5_state_protocol +**Name**: C5 `StateEstimator` Protocol + Factory + DTOs + Composition + concrete `ISam2GraphHandle` +**Description**: Define the public `StateEstimator` Protocol (PEP 544 `@runtime_checkable`), the C5 DTOs (`EstimatorOutput`, `EstimatorHealth`, `IsamState` enum), the error hierarchy (`StateEstimatorError`, `EstimatorDegradedError`, `EstimatorFatalError`, `StateEstimatorConfigError`), the composition-root factory `build_state_estimator(...) -> tuple[StateEstimator, ISam2GraphHandle]`, AND the CONCRETE `ISam2GraphHandle` implementation extending the AZ-355 Protocol stub with `add_factor`/`update`/`compute_marginals`/`last_anchor_age_ms` methods. The handle is constructed alongside the iSAM2 graph (initially empty here; populated by AZ-382 iSAM2 wiring task) and passed by reference to C4 via the runtime root. Strategy resolution per ADR-002 with `BUILD_STATE_` gating. Shared helpers (`ImuPreintegrator` AZ-276, `SE3Utils` AZ-277, `WgsConverter` AZ-279) constructor-injected. Config schema extension for `state.{strategy, keyframe_window_size, spoof_promotion_min_stable_s, spoof_promotion_visual_consistency_tol_m, no_estimate_fallback_s}`. No iSAM2 graph internals or factor-add logic in scope here. +**Complexity**: 3 points +**Dependencies**: AZ-263, AZ-269, AZ-270, AZ-276 (`ImuPreintegrator`), AZ-277 (`SE3Utils`), AZ-279 (`WgsConverter`), AZ-273 (`FdrClient`), AZ-355 (C4's `ISam2GraphHandle` Protocol stub — extended here), AZ-266 +**Component**: c5_state (epic AZ-260 / E-C5) +**Tracker**: AZ-381 +**Epic**: AZ-260 (E-C5) + +### Document Dependencies + +- `_docs/02_document/contracts/c5_state/state_estimator_protocol.md` — the public contract this task implements. +- `_docs/02_document/components/07_c5_state/description.md` — § 1, § 2, § 5 error handling, § 9 logging. +- `_docs/02_document/architecture.md` — ADR-001, ADR-002, ADR-003, ADR-009. +- `_docs/02_document/contracts/c4_pose/pose_estimator_protocol.md` — `ISam2GraphHandle` Protocol stub source. +- `_docs/02_document/module-layout.md` — `c5_state` Per-Component Mapping. + +## Problem + +Without this task, C4 has no concrete `ISam2GraphHandle` to inject (only the Protocol stub from AZ-355) — meaning the runtime root cannot wire C4 + C5 together. The DTO surface (`EstimatorOutput`, `EstimatorHealth`) is also consumed by C8, C13, and the orthorectifier — defining it once in `_types/state.py` prevents drift. The eight downstream consumer tasks (iSAM2 wiring, factor adds, marginals, spoof gate, ESKF, smoothed history, AC-5.2, orthorectifier) depend on the Protocol surface + the handle being available. + +## Outcome + +- `src/gps_denied_onboard/components/c5_state/interface.py` — `StateEstimator` Protocol with all 6 methods. +- `src/gps_denied_onboard/components/c5_state/__init__.py` — re-exports `StateEstimator`, `EstimatorOutput`, `EstimatorHealth`. +- `src/gps_denied_onboard/_types/state.py` — `EstimatorOutput`, `EstimatorHealth`, `IsamState` enum (frozen + slots). +- `src/gps_denied_onboard/components/c5_state/errors.py` — error hierarchy. +- `src/gps_denied_onboard/components/c5_state/_isam2_handle.py` — concrete `ISam2GraphHandleImpl(ISam2GraphHandle)` class with all four methods. Body: empty stubs that raise `NotImplementedError("iSAM2 wiring task owns this body")` until AZ-382 lands. Each method's `NotImplementedError` message names the responsible task ID for traceability. +- `src/gps_denied_onboard/runtime_root/state_factory.py` — `build_state_estimator(...)` returning the tuple. Lazy-import per ADR-002. +- Composition-root extension: invoke `build_state_estimator` AFTER the shared helpers; pass the returned `ISam2GraphHandle` to `build_pose_estimator` (C4); bind C4 + C5 to the SAME ingest thread. +- Config schema extension for the five `state.*` fields. +- INFO log on successful build: `kind="c5.state.strategy_loaded"`. + +## Scope + +### Included +- `StateEstimator` Protocol with 6 methods. +- DTOs (`EstimatorOutput`, `EstimatorHealth`, `IsamState`) in `_types/state.py`. +- Error hierarchy. +- Concrete `ISam2GraphHandleImpl` skeleton (body owned by AZ-382 iSAM2 wiring task). +- Composition-root factory + thread binding. +- Config schema extension. +- Unit tests: Protocol conformance, DTO immutability + slots, factory rejection on unknown strategy + missing build flag, ISam2GraphHandleImpl methods exist (return `NotImplementedError`), thread binding. + +### Excluded +- iSAM2 + `IncrementalFixedLagSmoother` body — owned by AZ-382 (next task). +- Factor adds (VIO + Pose + IMU) — owned by AZ-383. +- Marginals + outputs — owned by AZ-384. +- Source-label state machine + spoof gate — owned by AZ-385. +- ESKF baseline — owned by AZ-386. +- Smoothed-history → FDR — owned by AZ-387. +- AC-5.2 fallback — owned by AZ-388. +- Orthorectifier sub-path — owned by AZ-389. +- Component-internal acceptance tests C5-IT-01..07 + C5-PT-01 + C5-ST-01 — deferred to E-BBT (AZ-262). + +## Acceptance Criteria + +**AC-1: Protocol conformance** — `runtime_checkable` `isinstance` returns True for a fake with all 6 methods. + +**AC-2: DTOs frozen + slots** — `FrozenInstanceError` on mutation; `__slots__` non-empty. + +**AC-3: `IsamState` enum has 4 values** — `INIT`, `TRACKING`, `DEGRADED`, `LOST`. + +**AC-4: Factory rejects missing build flag** — `config.state.strategy = "nonexistent"` → `StateEstimatorConfigError("BUILD_STATE_NONEXISTENT is OFF...")`. + +**AC-5: Factory rejects unknown strategy at config-load** — `config.state.strategy = "garbage"` → `StateEstimatorConfigError` at config load. + +**AC-6: Factory returns the tuple** — both `StateEstimator` AND `ISam2GraphHandle` are returned from a successful build; INFO log with `{strategy, keyframe_window_size}`. + +**AC-7: Thread binding** — composition root binds C5 to ONE ingest thread (the same as C4); second binding raises `RuntimeError`. + +**AC-8: `ISam2GraphHandleImpl` skeleton** — instance is `isinstance(handle, ISam2GraphHandle)`; calling `add_factor`, `update`, `compute_marginals`, `last_anchor_age_ms` each raises `NotImplementedError(f"Body owned by ...")` with the correct task ID in the message. + +**AC-9: Public API re-exports** — `from gps_denied_onboard.components.c5_state import StateEstimator, EstimatorOutput, EstimatorHealth` resolves; internals not in `__all__`. + +**AC-10: Error hierarchy catchability** — every error caught by `except StateEstimatorError`. + +## Non-Functional Requirements + +- `build_state_estimator` p99 ≤ 50 ms. + +## Constraints + +- `@runtime_checkable` on Protocol; DTOs `frozen=True, slots=True`. +- Lazy-import per ADR-002. +- Single-thread binding enforced (AC-7). +- The `ISam2GraphHandleImpl` skeleton's `NotImplementedError` messages MUST name the responsible task ID — AZ-382 iSAM2 wiring is the receiver. + +## Risks & Mitigation + +- **Risk**: AZ-382 iSAM2 task lands before this task → cycle. *Mitigation*: this task ships first; AZ-382 imports `ISam2GraphHandleImpl` and replaces method bodies. +- **Risk**: AZ-355 stub Protocol may differ slightly from AZ-358's extension. *Mitigation*: this task verifies isinstance against the FINAL Protocol shape (post-AZ-358 extension) — both AZ-358 and this task update the Protocol stub in lockstep. + +## Runtime Completeness + +- **Named capability**: `StateEstimator` Protocol + DTOs + factory + concrete `ISam2GraphHandle` skeleton. +- **Production code**: real Protocol, real DTOs, real error hierarchy, real factory, real `ISam2GraphHandleImpl` skeleton with `NotImplementedError` bodies, real composition wiring. +- **Allowed external stubs**: test fakes only. +- **Unacceptable substitutes**: hardcoding the C5 strategy class in C4's factory (defeats ADR-009); skipping the concrete `ISam2GraphHandleImpl` (would force AZ-382 iSAM2 wiring to also reshape Protocol). + +## Contract + +Implements `_docs/02_document/contracts/c5_state/state_estimator_protocol.md`. diff --git a/_docs/02_tasks/todo/AZ-382_c5_isam2_smoother_wiring.md b/_docs/02_tasks/todo/AZ-382_c5_isam2_smoother_wiring.md new file mode 100644 index 0000000..faab104 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-382_c5_isam2_smoother_wiring.md @@ -0,0 +1,101 @@ +# C5 GtsamIsam2StateEstimator — iSAM2 + IncrementalFixedLagSmoother K=10–20 wiring + +**Task**: AZ-382_c5_isam2_smoother_wiring +**Name**: C5 `GtsamIsam2StateEstimator` skeleton — iSAM2 + IncrementalFixedLagSmoother (K=10–20) wiring +**Description**: Implement the `GtsamIsam2StateEstimator` class skeleton with the GTSAM iSAM2 + `gtsam_unstable.IncrementalFixedLagSmoother` lifecycle: graph + `Values` containers; key-management policy (`gtsam.symbol('x', frame_id_int)` for poses, `'b'` for bias, `'v'` for velocity); window size K from `config.state.keyframe_window_size` (default 15; D-C5-3 K=10–20). REPLACE the `NotImplementedError` skeleton bodies in `_isam2_handle.py` (from AZ-381) with the actual `add_factor`/`update`/`compute_marginals`/`last_anchor_age_ms` implementations against this estimator's iSAM2 graph. The estimator's `add_vio`/`add_pose_anchor`/`add_fc_imu` methods still raise `NotImplementedError("Factor adds owned by AZ-383")` until the next task lands. `current_estimate`/`smoothed_history`/`health_snapshot` similarly. This task delivers the foundational graph + handle wiring on which all subsequent C5 tasks depend. +**Complexity**: 5 points +**Dependencies**: AZ-381 (Protocol + factory + handle skeleton), AZ-263, AZ-269, AZ-266, AZ-272 (FDR record schema) +**Component**: c5_state (epic AZ-260 / E-C5) +**Tracker**: AZ-382 +**Epic**: AZ-260 (E-C5) + +### Document Dependencies + +- `_docs/02_document/contracts/c5_state/state_estimator_protocol.md` — Protocol surface. +- `_docs/02_document/components/07_c5_state/description.md` — § 5 (iSAM2 + IncrementalFixedLagSmoother dependencies); § 7 (single-writer thread; missing-key silent failure mitigation). +- `_docs/02_document/architecture.md` — ADR-003 (shared GTSAM substrate); D-C5-3 (K=10–20 keyframe window). + +## Problem + +Without this task, the GTSAM iSAM2 graph does not exist; C4's `add_factor` calls have no target; `Marginals` cannot be computed; the `ISam2GraphHandleImpl` skeleton's `NotImplementedError` bodies would block all C4 wiring. Every subsequent C5 task (factor adds, marginals, source-label, smoothed-history) needs the graph + handle to be real. + +## Outcome + +- `src/gps_denied_onboard/components/c5_state/gtsam_isam2_estimator.py` defining: + - `GtsamIsam2StateEstimator` class implementing `StateEstimator`. + - Constructor: `__init__(self, config, imu_preintegrator, se3_utils, wgs_converter, fdr_client)`. + - Internal: `self._isam2 = gtsam.ISAM2(parameters)`, `self._smoother = gtsam.IncrementalFixedLagSmoother(K * frame_period_s)`, `self._graph = gtsam.NonlinearFactorGraph()`, `self._values = gtsam.Values()`, `self._key_for_frame: dict[UUID, int]`. + - Module-level `create(config, imu_preintegrator, se3_utils, wgs_converter, fdr_client) -> StateEstimator` factory function. + - `add_*` methods raise `NotImplementedError("Factor adds owned by AZ-383")` (next task). + - `current_estimate`/`smoothed_history`/`health_snapshot` raise `NotImplementedError("Marginals + outputs owned by AZ-384")`. +- Replace bodies of `ISam2GraphHandleImpl` in `_isam2_handle.py` (from AZ-381) — REAL `add_factor` (calls `self._graph.add(factor)`), REAL `update` (calls `self._isam2.update(graph, values)` + `self._smoother.update(graph, values, timestamps)`), REAL `compute_marginals` (returns `gtsam.Marginals(graph, values)`), REAL `last_anchor_age_ms` (tracks `last_anchor_monotonic_ns` updated by AZ-383 when a satellite-anchored pose is added). +- DEBUG log on construction: `kind="c5.state.isam2_initialised"` with `{keyframe_window_size, total_factors_initial: 0}`. +- Defensive logging: every internal mutation of `_isam2`/`_smoother` MUST be wrapped in success/failure logging per § 7 (R05 mitigation: "every add logs success/false"). + +## Scope + +### Included +- `GtsamIsam2StateEstimator` class scaffold. +- iSAM2 + `IncrementalFixedLagSmoother` initialisation; key-management policy. +- `ISam2GraphHandleImpl` body replacement (the four real methods). +- Defensive success/failure logging for every iSAM2 / smoother mutation (R05 mitigation). +- Config schema check: `keyframe_window_size` in [10, 20]. +- Composition-root invocation path: factory `create(...)` returns the estimator; `state_factory` extracts the `_isam2_handle` reference. +- Unit tests: graph + values construction; handle method bodies reachable; key-management policy assigns unique keys per frame; defensive logging emits on every mutation. + +### Excluded +- Factor adds (VIO/Pose/IMU) — owned by next task. +- Marginals + outputs — owned by next task. +- Source-label + spoof gate — owned by next task. +- AC-5.2 fallback — owned by next task. +- ESKF — owned by AZ-386. +- Smoothed history → FDR path body (the FDR write itself is the smoothed-history task; this task only ensures `_smoother` is constructed). +- C5-IT/PT/ST tests — deferred to E-BBT. + +## Acceptance Criteria + +**AC-1: Construction** — `GtsamIsam2StateEstimator(...)` instantiates without error; `_isam2`, `_smoother`, `_graph`, `_values`, `_key_for_frame` are all initialised. + +**AC-2: Key-management policy** — frame-IDs map to unique GTSAM keys via `gtsam.symbol('x', counter)`; `_key_for_frame` is consulted before assigning new keys. + +**AC-3: Window size respected** — `IncrementalFixedLagSmoother` instantiated with `K * frame_period_s` (e.g., K=15 × 0.333 s = 5 s window). + +**AC-4: Window size validation** — `keyframe_window_size = 5` (out of [10, 20] range) → `StateEstimatorConfigError`. + +**AC-5: `ISam2GraphHandleImpl.add_factor` real body** — calls `self._estimator._graph.add(factor)`; success logged; failure logged + raises `EstimatorDegradedError`. + +**AC-6: `ISam2GraphHandleImpl.update` real body** — calls `self._estimator._isam2.update(graph, values)` AND `self._estimator._smoother.update(...)`; success logged; failure logged + raises `EstimatorFatalError`. + +**AC-7: `ISam2GraphHandleImpl.compute_marginals` real body** — returns `gtsam.Marginals(self._estimator._isam2.getFactorsUnsafe(), self._estimator._isam2.getCurrentEstimate())`. + +**AC-8: `ISam2GraphHandleImpl.last_anchor_age_ms` real body** — returns `(monotonic_ns() - self._estimator._last_anchor_ns) // 1e6`. Initialised to 0 (no anchor yet). + +**AC-9: Defensive logging on every mutation** — `add_factor`/`update` log SUCCESS or FAILURE with structured fields (R05 mitigation). + +**AC-10: `add_*` and `current_estimate` raise `NotImplementedError`** — with messages `"Factor adds owned by AZ-383"` / `"Marginals + outputs owned by AZ-384"`. + +## Non-Functional Requirements + +- Construction p99 ≤ 100 ms. +- `add_factor` p99 ≤ 1 ms (just appends to local graph). +- `update` p99 ≤ 30 ms steady state. +- `compute_marginals` p99 ≤ 60 ms (the dominant per-frame cost). + +## Constraints + +- Single-writer thread. +- iSAM2 + smoother are GTSAM-pinned (Plan-phase); both live in the same `gtsam` namespace. +- Defensive logging is mandatory (R05 — silent factor-add failure mitigation). +- `keyframe_window_size` MUST be in [10, 20] per D-C5-3. + +## Risks & Mitigation + +- **Risk: GTSAM `IncrementalFixedLagSmoother` API differs across pin versions** — Plan-phase pins lock the version. +- **Risk: Missing-key silent failure (R05)** — defensive logging on every mutation; AC-9 enforces. +- **Risk: `_isam2_handle.py` body replacement creates a merge-conflict with AZ-381** — AZ-381 skeleton ships first; this task strictly replaces method bodies; no Protocol changes. + +## Runtime Completeness + +- **Named capability**: `GtsamIsam2StateEstimator` skeleton + concrete `ISam2GraphHandleImpl` bodies. +- **Production code**: real iSAM2 + smoother construction; real handle method bodies that mutate the graph. +- **Unacceptable substitutes**: a `MockGraph` in production code; suppressing the defensive logs (R05). diff --git a/_docs/02_tasks/todo/AZ-383_c5_factor_adds.md b/_docs/02_tasks/todo/AZ-383_c5_factor_adds.md new file mode 100644 index 0000000..d2511dc --- /dev/null +++ b/_docs/02_tasks/todo/AZ-383_c5_factor_adds.md @@ -0,0 +1,87 @@ +# C5 GtsamIsam2StateEstimator — VIO + Pose + IMU factor adds + +**Task**: AZ-383_c5_factor_adds +**Name**: C5 `GtsamIsam2StateEstimator` — `add_vio` / `add_pose_anchor` / `add_fc_imu` factor add bodies +**Description**: Implement the three `add_*` factor add bodies on `GtsamIsam2StateEstimator`. `add_vio(VioOutput)` → `BetweenFactorPose3` between consecutive pose keys with VIO-derived noise model. `add_pose_anchor(PoseEstimate)` → if `pose.covariance_mode == MARGINALS`: `GenericProjectionFactorCal3DS2` (or equivalent prior factor) on the pose key; if `JACOBIAN`: skip iSAM2 add (per AZ-361 contract — the running estimate is updated but the graph stops growing under throttle). `add_fc_imu(ImuWindow)` → `CombinedImuFactor` using the shared `ImuPreintegrator` (AZ-276) for the preintegrated measurement. Out-of-order timestamps → `EstimatorDegradedError`. Each factor add is wrapped in success/false logging (R05). Updates `_last_anchor_ns` when a satellite-anchored pose is added (consumed by `last_anchor_age_ms`). +**Complexity**: 5 points +**Dependencies**: AZ-382 (graph + handle), AZ-381 (DTOs + handle), AZ-276 (`ImuPreintegrator`), AZ-358 (`PoseEstimate.covariance_mode` semantics), AZ-263, AZ-269, AZ-266, AZ-272 +**Component**: c5_state (epic AZ-260 / E-C5) +**Tracker**: AZ-383 +**Epic**: AZ-260 (E-C5) + +### Document Dependencies + +- `_docs/02_document/contracts/c5_state/state_estimator_protocol.md` — Invariants 2 (timestamp ordering), 3 (mode dispatch). +- `_docs/02_document/components/07_c5_state/description.md` — § 5 error handling. +- `_docs/02_document/contracts/shared_helpers/imu_preintegrator.md` — `preintegrate(window) -> CombinedImuFactor`. +- `_docs/02_document/contracts/c4_pose/pose_estimator_protocol.md` — `covariance_mode` semantics. + +## Problem + +Without this task, the iSAM2 graph stays empty; no factors are added; `current_estimate()` (next task) cannot recover any meaningful state. + +## Outcome + +- `add_vio(vio: VioOutput)`: build `BetweenFactorPose3(prev_pose_key, curr_pose_key, vio.relative_pose, noise_model)` from `vio.covariance_relative_pose`; append via `_isam2_handle.add_factor(...)`; log SUCCESS or FAILURE. +- `add_pose_anchor(pose: PoseEstimate)`: + - If `pose.covariance_mode == MARGINALS`: build `PriorFactorPose3(pose_key, pose.position+orientation, noise_model_from_cov)`; `add_factor` + `update`. Set `_last_anchor_ns = pose.emitted_at`. + - If `pose.covariance_mode == JACOBIAN`: skip iSAM2 factor add; INFO log `kind="c5.state.skip_isam2_jacobian_path"`. Set `_last_anchor_ns = pose.emitted_at` (still counts as a recent anchor for AC-1.3 binning). +- `add_fc_imu(imu_window: ImuWindow)`: `cif = imu_preintegrator.preintegrate(imu_window) -> CombinedImuFactor`; `add_factor(cif)`. +- Timestamp ordering: store `_last_added_timestamp_ns`; reject any `add_*` call whose timestamp is < `_last_added_timestamp_ns` with `EstimatorDegradedError`. +- After every successful `add_*`, trigger `_isam2_handle.update()` per the contract. + +## Scope + +### Included +- All three `add_*` method bodies. +- Mode-dispatch in `add_pose_anchor` per AZ-361 cross-task contract. +- Timestamp-ordering enforcement (Invariant 2). +- Defensive logging on every factor add (R05). +- Unit tests: VIO factor adds against a stub graph; pose-anchor MARGINALS path adds factor + updates `_last_anchor_ns`; pose-anchor JACOBIAN path SKIPS factor add but updates `_last_anchor_ns`; IMU factor adds via `ImuPreintegrator`; out-of-order timestamps raise `EstimatorDegradedError`; success/failure logging fires. + +### Excluded +- `current_estimate` / `smoothed_history` / `health_snapshot` — owned by next task (78). +- Source-label / spoof gate — owned by 79. +- AC-5.2 fallback — owned by 81. + +## Acceptance Criteria + +**AC-1: VIO factor add** — Stub `_isam2_handle.add_factor` records calls; AC asserts a `BetweenFactorPose3` is added with the correct keys and noise model. + +**AC-2: Pose-anchor MARGINALS path** — `add_pose_anchor(pose)` with `covariance_mode = MARGINALS` adds a `PriorFactorPose3`; `_isam2_handle.update()` is called; `_last_anchor_ns = pose.emitted_at`. + +**AC-3: Pose-anchor JACOBIAN path skips factor add** — no `add_factor`/`update` call; INFO log emitted; `_last_anchor_ns` STILL updated. + +**AC-4: IMU factor add via `ImuPreintegrator`** — `imu_preintegrator.preintegrate.assert_called_once_with(imu_window)`; the returned `CombinedImuFactor` is added. + +**AC-5: Timestamp ordering** — out-of-order `add_*` raises `EstimatorDegradedError`; ERROR log; FDR record. + +**AC-6: Defensive logging on every factor add (R05)** — SUCCESS log on every successful add; FAILURE log + `EstimatorDegradedError` on iSAM2 failure. + +**AC-7: `add_*` triggers `update()` exactly once per call** — verified via stub. + +**AC-8: `_last_anchor_ns` accuracy** — after a satellite-anchored pose at t0, `last_anchor_age_ms()` returns `(t1 - t0) / 1e6` at later time t1. + +## Non-Functional Requirements + +- `add_vio` p99 ≤ 5 ms (factor build + add). +- `add_pose_anchor` p99 ≤ 30 ms (incl. iSAM2 update). +- `add_fc_imu` p99 ≤ 10 ms (incl. preintegration via `ImuPreintegrator`). + +## Constraints + +- Single-writer thread (Invariant 1). +- Mode-dispatch in `add_pose_anchor` is mandatory (cross-task contract with AZ-361). +- Timestamp ordering is enforced; out-of-order is `EstimatorDegradedError`. +- Defensive logging is mandatory (R05). + +## Risks & Mitigation + +- **Risk: `gtsam.PriorFactorPose3` API changed in pinned version** — verify against pinned GTSAM during impl. +- **Risk: Mode dispatch oversight** — explicit AC (AC-3) verifies the JACOBIAN path skips the iSAM2 add. + +## Runtime Completeness + +- **Named capability**: VIO / Pose / IMU factor add path. +- **Production code**: real GTSAM factor types, real ImuPreintegrator preintegrate, real handle update calls. +- **Unacceptable substitutes**: stub factor types in production; skipping the JACOBIAN-path skip (would corrupt the graph under throttle). diff --git a/_docs/02_tasks/todo/AZ-384_c5_marginals_outputs.md b/_docs/02_tasks/todo/AZ-384_c5_marginals_outputs.md new file mode 100644 index 0000000..7a3d9c8 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-384_c5_marginals_outputs.md @@ -0,0 +1,88 @@ +# C5 GtsamIsam2StateEstimator — Marginals + current_estimate / smoothed_history / health_snapshot + +**Task**: AZ-384_c5_marginals_outputs +**Name**: C5 `GtsamIsam2StateEstimator` — Marginals + output methods +**Description**: Implement `current_estimate()`, `smoothed_history(n)`, and `health_snapshot()` on `GtsamIsam2StateEstimator`. `current_estimate()`: get the current pose key from the most-recent frame; recover the 6×6 covariance via `_isam2_handle.compute_marginals().marginalCovariance(pose_key)`; convert local-tangent-plane pose to WGS84 via `WgsConverter`; assemble `EstimatorOutput(smoothed=False, source_label = , last_satellite_anchor_age_ms = handle.last_anchor_age_ms())`. `smoothed_history(n)`: return up to `min(n, K)` smoothed past keyframes from `IncrementalFixedLagSmoother.calculateEstimate()` projection; each entry has `smoothed=True`. `health_snapshot()`: report `IsamState` (INIT/TRACKING/DEGRADED/LOST) based on convergence quality; `keyframe_count = len(_smoother.timestamps())`; `cov_norm_growing_for_s` (rolling counter incremented when frame-to-frame cov norm rises monotonically per AC-NEW-8); `spoof_promotion_blocked` (queries the source-label state machine — owned by AZ-385; this task introduces a stub that returns False until AZ-385 lands). SPD-invariant defensive check on every emitted covariance. +**Complexity**: 3 points +**Dependencies**: AZ-383 (graph populated with factors), AZ-382 / AZ-381 (handle + Protocol), AZ-279 (`WgsConverter`), AZ-277 (`SE3Utils`), AZ-263, AZ-269, AZ-266, AZ-272 +**Component**: c5_state (epic AZ-260 / E-C5) +**Tracker**: AZ-384 +**Epic**: AZ-260 (E-C5) + +### Document Dependencies + +- `_docs/02_document/contracts/c5_state/state_estimator_protocol.md` — Invariants 4 (current_estimate fresh), 6/7 (smoothed history bounded + flagged), 10 (SPD). +- `_docs/02_document/components/07_c5_state/description.md` — § 2 outputs; § 7 Marginals dominant cost. + +## Problem + +Without this task, the system has no way to read the posterior pose; downstream C8 cannot emit FC corrections; FDR has no smoothed history. + +## Outcome + +- `current_estimate()` body: get current pose + Marginals 6×6; WGS84-convert via helper; assemble `EstimatorOutput`. +- `smoothed_history(n)` body: iterate smoother's active keyframes; build `EstimatorOutput(smoothed=True)` for each. +- `health_snapshot()` body: `IsamState` derivation + `keyframe_count` + `cov_norm_growing_for_s` rolling counter + `spoof_promotion_blocked` (from injected source-label state machine; default `False` until AZ-79). +- `_cov_norm_window` private rolling-window counter (60 s lazy-pruned) for AC-NEW-8 monotonicity check. +- SPD-invariant defensive check before every `EstimatorOutput` emission; on failure raise `EstimatorFatalError`. +- Constructor extension: optional `source_label_state_machine` arg (default `None`; AZ-79 wires it up). + +## Scope + +### Included +- All three method bodies. +- `_cov_norm_window` rolling-window counter. +- SPD defensive check. +- Source-label state machine injection point (Optional, default None). +- Unit tests: synthetic graph with known pose → `current_estimate()` returns expected pose+covariance; `smoothed_history(20)` bounded by K=15; SPD invariant; `IsamState` derivation; `cov_norm_growing_for_s` monotonicity counter accuracy. + +### Excluded +- Source-label state machine impl — owned by AZ-79. +- Spoof gate logic body — owned by AZ-79. +- AC-5.2 fallback — owned by AZ-81. +- ESKF baseline — owned by AZ-80. + +## Acceptance Criteria + +**AC-1: `current_estimate` returns fresh `EstimatorOutput`** — every call returns a new instance with `smoothed=False`. + +**AC-2: SPD covariance** — `np.linalg.cholesky(out.covariance_6x6)` succeeds for every emitted output; non-SPD raises `EstimatorFatalError`. + +**AC-3: WGS84 conversion** — uses shared `WgsConverter`; output matches helper test vectors. + +**AC-4: `smoothed_history(n)` bounded by K** — `len(smoothed_history(100)) <= K=15`; each has `smoothed=True`. + +**AC-5: `current_estimate` has `smoothed=False`** — distinguishes from history. + +**AC-6: `health_snapshot.isam2_state` matches convergence quality** — INIT before first factor; TRACKING after; DEGRADED on inflated cov; LOST on `EstimatorFatalError`. + +**AC-7: `keyframe_count` accuracy** — matches `IncrementalFixedLagSmoother.timestamps().size()`. + +**AC-8: `cov_norm_growing_for_s`** — increments while consecutive frames show monotone-rising cov norm; resets to 0 on a non-rising frame. + +**AC-9: `spoof_promotion_blocked` via injected state machine** — queries `source_label_state_machine.is_spoof_promotion_blocked()`; default `False` if no state machine wired. + +**AC-10: `last_satellite_anchor_age_ms` pass-through** — every `EstimatorOutput.last_satellite_anchor_age_ms == handle.last_anchor_age_ms()`. + +## Non-Functional Requirements + +- `current_estimate` p95 ≤ 60 ms (Marginals dominant). +- `smoothed_history(K)` p99 ≤ 20 ms. +- `health_snapshot` p99 ≤ 5 µs (O(1) accumulator reads). + +## Constraints + +- Single-writer thread. +- SPD defensive check is mandatory. +- `WgsConverter` use is mandatory (no inline math). + +## Risks & Mitigation + +- **Risk: `IncrementalFixedLagSmoother.calculateEstimate()` returns full Values not just keyframe poses** — filter by key; verify against pinned GTSAM API. +- **Risk: SPD invariant fails under iSAM2 numerical instability** — defensive raise (AC-2) maps to `EstimatorFatalError`; AC-5.2 fallback then triggers. + +## Runtime Completeness + +- **Named capability**: posterior pose + covariance recovery + smoothed history. +- **Production code**: real Marginals, real WGS84 conversion, real rolling-window counter, real SPD defensive check. +- **Unacceptable substitutes**: synthetic Marginals; inline WGS84 math. diff --git a/_docs/02_tasks/todo/AZ-385_c5_source_label_spoof_gate.md b/_docs/02_tasks/todo/AZ-385_c5_source_label_spoof_gate.md new file mode 100644 index 0000000..6bd3fff --- /dev/null +++ b/_docs/02_tasks/todo/AZ-385_c5_source_label_spoof_gate.md @@ -0,0 +1,97 @@ +# C5 SourceLabelStateMachine + spoof-promotion gate + +**Task**: AZ-385_c5_source_label_spoof_gate +**Name**: C5 `SourceLabelStateMachine` + AC-NEW-2 / AC-NEW-8 spoof-promotion gate +**Description**: Implement the `SourceLabelStateMachine` (component-internal helper per description.md § 6) governing the `source_label` value emitted in every `EstimatorOutput`. The state machine inputs: most-recent `PoseEstimate` arrival; FC `GpsHealth` (from `add_fc_imu` window's gps_health field); time since last `SATELLITE_ANCHORED` add. Outputs: `SATELLITE_ANCHORED` | `VISUAL_PROPAGATED` | `DEAD_RECKONED` per the gate logic. Spoof-promotion gate (AC-NEW-2 / AC-NEW-8): NEVER re-introduce a previously-spoofed FC GPS source until BOTH (i) FC `gps_health == STABLE_NON_SPOOFED` for ≥10 s (config `spoof_promotion_min_stable_s`) AND (ii) the next satellite-anchored frame agrees with the FC GPS within `spoof_promotion_visual_consistency_tol_m` (default 30 m). Document EVERY reject in FDR + GCS STATUSTEXT (R07; C5-ST-01 — logging cannot be silenced). Inject the state machine into `GtsamIsam2StateEstimator` via the constructor extension point (AZ-384); wire `health_snapshot.spoof_promotion_blocked` to query the state machine. +**Complexity**: 5 points +**Dependencies**: AZ-384 (outputs + state-machine injection point), AZ-381 / AZ-382 / AZ-383, AZ-263, AZ-269, AZ-266, AZ-272 (FDR), AZ-391 (E-C8 — `GpsHealth` from inbound subscription; co-developed), AZ-397 (E-C8 — GCS STATUSTEXT broadcast via QgcTelemetryAdapter; co-developed; if not ready, this task ships against the AZ-390 Protocol contract surface) +**Component**: c5_state (epic AZ-260 / E-C5) +**Tracker**: AZ-385 +**Epic**: AZ-260 (E-C5) + +### Document Dependencies + +- `_docs/02_document/contracts/c5_state/state_estimator_protocol.md` — Invariants 5 (label reflects gate state), 8 (spoof-rejection logging cannot be silenced). +- `_docs/02_document/components/07_c5_state/description.md` — § 5 spoof-promotion gate semantics; § 6 `SourceLabelStateMachine` helper ownership; § 9 logging. +- `_docs/02_document/architecture.md` — ADR-008 + spoof gate logic; R07. + +## Problem + +Without this task, every `EstimatorOutput` would carry a static `SATELLITE_ANCHORED` label regardless of actual gate state — defeating AC-NEW-2 / AC-NEW-8 (spoofing-promotion gate) and AC-3.5 (VIO-only fallback under matcher failure). Spoof-rejection events would not be logged → R07 silent rollback risk. + +## Outcome + +- `src/gps_denied_onboard/components/c5_state/_source_label_sm.py` defining: + - `SourceLabelStateMachine` class (component-internal; not in `__all__`). + - State: current label, `_last_anchored_frame_ns`, `_gps_health_stable_since_ns`, `_pending_promotion`, `_promotion_blocked` (latched until consistency check). + - Method: `update(now_ns, gps_health, last_satellite_anchor_age_ms, last_visual_consistent_with_gps) -> PoseSourceLabel`. + - Method: `is_spoof_promotion_blocked() -> bool`. + - Defensive logging: every state transition emits an INFO log; every rejected promotion emits a WARN log + FDR record + GCS STATUSTEXT (mandatory; unconditional; cannot be silenced via config — AC-NEW-2 / R07 mitigation). + - Configurable thresholds from `config.state.{spoof_promotion_min_stable_s, spoof_promotion_visual_consistency_tol_m}`. +- Wire the state machine into `GtsamIsam2StateEstimator.__init__` via the AZ-78 injection point. +- `health_snapshot.spoof_promotion_blocked` queries `source_label_state_machine.is_spoof_promotion_blocked()`. +- `current_estimate.source_label` is computed from the state machine for every emission. +- Unit tests: gate transitions; ≥10 s STABLE_NON_SPOOFED requirement; consistency check requirement; rejected promotion emits FDR + GCS STATUSTEXT; logging cannot be suppressed (test attempts to suppress via mocked log handler — assertion still fires the FDR + GCS broadcast). + +## Scope + +### Included +- `SourceLabelStateMachine` impl. +- Wire-up to `GtsamIsam2StateEstimator`. +- Unconditional logging path for spoof-rejection (cannot be silenced). +- FDR record + GCS STATUSTEXT emission on every rejected promotion. +- Unit tests covering all gate transitions + reject-logging mandatory invariant. + +### Excluded +- The `GpsHealth` source — owned by AZ-261 (E-C8 inbound); this task imports the DTO surface only. +- The GCS STATUSTEXT broadcast wire — owned by AZ-261 (C8 outbound); this task calls a Protocol method `gcs_broadcast.statustext(text)`; concrete impl in AZ-261. +- AC-5.2 fallback — owned by AZ-81. +- C5-IT-06 / C5-IT-07 / C5-ST-01 — deferred to E-BBT. + +## Acceptance Criteria + +**AC-1: Initial label is `INIT`-mapped to `DEAD_RECKONED`** — until first successful pose anchor, label is `DEAD_RECKONED`. + +**AC-2: First successful anchor → `SATELLITE_ANCHORED`** — only if no spoof-block is active. + +**AC-3: Stale anchor → `VISUAL_PROPAGATED`** — when `last_satellite_anchor_age_ms` exceeds threshold AND VIO is healthy. + +**AC-4: Spoof detection → block promotion** — when FC reports `gps_health == SPOOFED`, set `_promotion_blocked = True`; subsequent attempts to promote rejected; FDR + GCS STATUSTEXT log the reject. + +**AC-5: Spoof recovery requires BOTH conditions** — `gps_health == STABLE_NON_SPOOFED` for ≥10 s AND visual-consistent next anchor within tol_m. Either alone does NOT lift the block. + +**AC-6: Logging cannot be silenced** — even when log level is set to ERROR (suppressing INFO/WARN/DEBUG), the FDR record AND GCS STATUSTEXT still emit on every reject. Test: mock the logger to drop all records, assert FDR client + GCS broadcast were still called. + +**AC-7: `is_spoof_promotion_blocked()`** — returns True when `_promotion_blocked` is set; False otherwise. + +**AC-8: `health_snapshot.spoof_promotion_blocked`** — wired through. + +**AC-9: Configurable thresholds** — changing `spoof_promotion_min_stable_s` from 10 to 30 changes the gate timing; verified via parametrised test. + +**AC-10: State transition logging** — every label change emits ONE INFO log `kind="c5.state.source_label_changed"` with `{from, to, reason}`. + +**AC-11: Reject FDR record shape** — `kind="c5.state.spoof_rejected"` with `{reason, gps_health, time_since_stable_s, visual_consistency_delta_m}`. + +**AC-12: GCS STATUSTEXT severity** — `WARNING` per MAVLink convention; message format `"GPS spoof rejected: "` ≤ 50 chars (MAVLink `STATUSTEXT.text` max). + +## Non-Functional Requirements + +- `update` p99 ≤ 5 µs (state machine is O(1)). +- Logging path is on the hot path; FDR + GCS broadcast are buffered (AZ-273 `FdrClient.put_nowait`); GCS broadcast is non-blocking. + +## Constraints + +- Logging cannot be silenced (AC-6; R07 mitigation). +- `gps_health` semantics owned by C8 inbound (AZ-261); this task consumes the DTO surface. +- Single-writer thread (consistent with C5). + +## Risks & Mitigation + +- **Risk: AZ-261 `GpsHealth` DTO not yet defined** — this task ships against the documented schema surface; if AZ-261 changes the schema, both tasks update lockstep. +- **Risk: GCS STATUSTEXT broadcast is best-effort** — AZ-261's broadcast may drop messages under load; this task records to FDR REGARDLESS of GCS broadcast success. + +## Runtime Completeness + +- **Named capability**: source-label state machine + spoof-promotion gate. +- **Production code**: real state machine, real FDR record + GCS STATUSTEXT emission, real configurable thresholds. +- **Unacceptable substitutes**: a label-rotation timer (would not detect spoof); silently swallowing reject logs (R07); a config flag that disables the spoof gate (defeats AC-NEW-2). diff --git a/_docs/02_tasks/todo/AZ-386_c5_eskf_baseline.md b/_docs/02_tasks/todo/AZ-386_c5_eskf_baseline.md new file mode 100644 index 0000000..f6a88d6 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-386_c5_eskf_baseline.md @@ -0,0 +1,96 @@ +# C5 EskfStateEstimator — mandatory simple-baseline + +**Task**: AZ-386_c5_eskf_baseline +**Name**: C5 `EskfStateEstimator` — mandatory simple-baseline (IT-12 engine rule at C5) +**Description**: Implement `EskfStateEstimator`, the mandatory simple-baseline `StateEstimator` per AC-2.1a engine rule applied at the state-estimator level. ESKF (Error-State Kalman Filter) over a 16-state vector (position 3 + velocity 3 + orientation 3 + accel bias 3 + gyro bias 3 + IMU dt scalar). Update on `add_vio` (relative-pose measurement); update on `add_pose_anchor` (absolute-pose measurement; respects `pose.covariance_mode` per AZ-383 contract — JACOBIAN does NOT skip the ESKF update because ESKF doesn't have a graph; it integrates as a normal measurement). `add_fc_imu` propagates the prediction step using the FC IMU window. `current_estimate` returns the current state + 6×6 covariance from the error-state covariance matrix (project from 16×16 down to 6×6 pose subspace). `smoothed_history(n)` returns recent past states from a circular buffer (NOT actually smoothed since ESKF is forward-only; entries have `smoothed=False` per honesty — the simple-baseline doesn't pretend to smooth). `health_snapshot` reports a simplified `IsamState` derivation. Selectable via `config.state.strategy = "eskf"` + `BUILD_STATE_ESKF` flag. +**Complexity**: 5 points +**Dependencies**: AZ-381 (Protocol + DTOs), AZ-276 (`ImuPreintegrator` consumed for IMU prediction step), AZ-277 (`SE3Utils`), AZ-279 (`WgsConverter`), AZ-263, AZ-269, AZ-266, AZ-272 +**Component**: c5_state (epic AZ-260 / E-C5) +**Tracker**: AZ-386 +**Epic**: AZ-260 (E-C5) + +### Document Dependencies + +- `_docs/02_document/contracts/c5_state/state_estimator_protocol.md` — Protocol surface; `EstimatorOutput` shape. +- `_docs/02_document/components/07_c5_state/description.md` — § 1 (mandatory simple-baseline; AC-2.1a engine rule applied at C5). +- `_docs/02_document/architecture.md` — AC-2.1a engine rule semantics. + +## Problem + +Without this task, IT-12 (engine rule comparative study at the state-estimator level) has no baseline to compare iSAM2 against. ADR-002 also requires the mandatory simple-baseline to exist as a real binary that can be selected at runtime; without it, the IT-12 verdict is unprovable. + +## Outcome + +- `src/gps_denied_onboard/components/c5_state/eskf_baseline.py` defining: + - `EskfStateEstimator` class implementing `StateEstimator` Protocol. + - 16-state error-state Kalman filter implementation (NumPy-based; no GTSAM). + - All 6 Protocol methods implemented per the description above. + - Module-level `create(config, imu_preintegrator, se3_utils, wgs_converter, fdr_client) -> StateEstimator`. +- `BUILD_STATE_ESKF` build flag wiring (ON in research; OFF in airborne-default per ADR-002 build-time exclusion). +- Honest reporting: `smoothed_history` entries flagged `smoothed=False` (because ESKF doesn't smooth); `health_snapshot.isam2_state` mapped to a simplified ESKF state model (TRACKING when filter is healthy; DEGRADED when innovation magnitude exceeds threshold; LOST on filter divergence). +- `_last_anchor_ns` tracked for `last_satellite_anchor_age_ms` (same semantics as the iSAM2 estimator). +- Unit tests: ESKF prediction step accuracy on synthetic IMU sequence; relative-pose update; absolute-pose update; convergence on synthetic data; SPD covariance; configurable measurement noise; honest `smoothed=False` reporting. + +## Scope + +### Included +- `EskfStateEstimator` impl. +- 16-state error-state Kalman filter NumPy impl. +- All 6 Protocol methods. +- `BUILD_STATE_ESKF` flag wiring. +- SPD-invariant defensive check on every emitted covariance. +- Unit tests + parametrised configuration tests. + +### Excluded +- iSAM2 estimator — already AZ-382. +- Source-label state machine — owned by AZ-385 (this task uses the same injection point). +- Smoothed history → FDR — owned by AZ-387. +- AC-5.2 fallback — owned by AZ-388. + +## Acceptance Criteria + +**AC-1: Protocol conformance** — passes `isinstance` against `StateEstimator`. + +**AC-2: ESKF prediction step accuracy** — on synthetic IMU sequence with known ground-truth trajectory, position drift < 1 m over 5 s. + +**AC-3: Relative-pose update** — `add_vio` updates the state with the VIO measurement; covariance shrinks on consistent measurements. + +**AC-4: Absolute-pose update** — `add_pose_anchor` updates the state with the absolute measurement regardless of `covariance_mode` (no skip; ESKF doesn't have a graph). + +**AC-5: SPD covariance** — every emitted `EstimatorOutput.covariance_6x6` is SPD; non-SPD raises `EstimatorFatalError`. + +**AC-6: `smoothed_history(n)` honest `smoothed=False`** — every entry has `smoothed=False` (ESKF doesn't smooth). + +**AC-7: `BUILD_STATE_ESKF=OFF` rejection** — factory rejection via `StateEstimatorConfigError` per AZ-381 Protocol task contract. + +**AC-8: Source-label state machine integration** — same injection point as iSAM2 estimator (AZ-385 wires both). + +**AC-9: Filter divergence handling** — when innovation exceeds 10× the measurement-covariance norm, raise `EstimatorFatalError`; AC-5.2 fallback fires downstream. + +**AC-10: Composition wiring** — `config.state.strategy = "eskf"` + `BUILD_STATE_ESKF=ON` → factory returns `EskfStateEstimator` instance. + +## Non-Functional Requirements + +- `add_vio` p99 ≤ 5 ms. +- `add_pose_anchor` p99 ≤ 10 ms. +- `current_estimate` p99 ≤ 5 ms. +- Memory ≤ 5 MB resident (ESKF state vector + buffers). + +## Constraints + +- NumPy-based; no GTSAM dependency. +- 16-state vector dimension is fixed. +- Single-writer thread. +- SPD-invariant defensive check is mandatory. +- Honest reporting: `smoothed=False` (no pretending to smooth). + +## Risks & Mitigation + +- **Risk: ESKF impl bugs** — comprehensive unit tests with synthetic ground truth (AC-2..AC-4). +- **Risk: Filter divergence under spoofed measurements** — AC-9 detects via innovation magnitude. + +## Runtime Completeness + +- **Named capability**: ESKF mandatory simple-baseline `StateEstimator`. +- **Production code**: real NumPy ESKF impl, real prediction + update steps, real SPD-invariant defensive check. +- **Unacceptable substitutes**: a wrapped GTSAM ISAM2 (defeats the simple-baseline contract); `smoothed=True` lies (defeats honesty). diff --git a/_docs/02_tasks/todo/AZ-387_c5_smoothed_history_fdr.md b/_docs/02_tasks/todo/AZ-387_c5_smoothed_history_fdr.md new file mode 100644 index 0000000..7e0e983 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-387_c5_smoothed_history_fdr.md @@ -0,0 +1,74 @@ +# C5 Smoothed-history → FDR path (NOT to FC) + +**Task**: AZ-387_c5_smoothed_history_fdr +**Name**: C5 smoothed past-keyframe → FDR path (AC-4.5 revised; NOT to FC) +**Description**: After every successful `current_estimate()`, emit the most-recent smoothed past-keyframe (when one becomes available from `IncrementalFixedLagSmoother.calculateEstimate()` retroactive update) to FDR via `FdrClient` (AZ-273). The FDR record carries `smoothed=True`. CRITICAL: the smoothed past-keyframe stream MUST go ONLY to FDR — NEVER routed to C8 outbound (the FC stream is forward-time only per AC-4.5 revised). Wire-up: this task adds the post-`current_estimate` hook to `GtsamIsam2StateEstimator` (and `EskfStateEstimator` — but ESKF reports `smoothed=False`, so this hook is a no-op for ESKF and the hook respects that). Defensive check: the C8 outbound encoder MUST receive ONLY non-smoothed estimates (verified at the C8 boundary in AZ-261 tests; documented here as a cross-task invariant). +**Complexity**: 3 points +**Dependencies**: AZ-384 (`smoothed_history` impl), AZ-386 (for the no-op hook on ESKF), AZ-273 (`FdrClient`), AZ-272 (FDR record schema), AZ-263, AZ-269, AZ-266 +**Component**: c5_state (epic AZ-260 / E-C5) +**Tracker**: AZ-387 +**Epic**: AZ-260 (E-C5) + +### Document Dependencies + +- `_docs/02_document/contracts/c5_state/state_estimator_protocol.md` — Invariants 6, 7. +- `_docs/02_document/components/07_c5_state/description.md` — § 7 caveats (AC-4.5 internal smoothing is onboard only). + +## Problem + +Without this task, smoothed past-keyframes are not persisted to FDR — defeating AC-4.5 (revised) post-flight forensics. A naive impl that routes smoothed estimates to C8 outbound would also break the FC contract (AC-4.5 forbids retroactive corrections to the FC). + +## Outcome + +- New private hook `_emit_smoothed_to_fdr_if_any()` called after every `current_estimate()` in `GtsamIsam2StateEstimator`. Body: read recent smoothed past-keyframes via `_smoother.calculateEstimate()`; for each newly-smoothed past-keyframe (delta from last emission), emit `EstimatorOutput(smoothed=True)` via `fdr_client.put(...)`. +- `EskfStateEstimator` hook: no-op (ESKF doesn't smooth). +- Composition root invariant: the `EstimatorOutput` stream feeding C8 outbound is filtered to `smoothed=False` only (this filter is enforced in C8 inbound — AZ-261 — but documented here). +- Unit tests: synthetic graph with delayed convergence → smoothed past-keyframes appear after some frames; FDR records emitted with correct shape; ESKF hook is no-op (FDR call count for ESKF == 0 over a full replay); no smoothed estimates leak to a C8-stub queue. + +## Scope + +### Included +- `_emit_smoothed_to_fdr_if_any()` impl. +- Hook into `current_estimate()` for both estimators. +- ESKF no-op handling (honest behaviour). +- Unit tests covering both estimators. + +### Excluded +- The C8 outbound filter — owned by AZ-261; this task documents the invariant. +- The FDR record schema — already AZ-272. +- iSAM2 estimator body — AZ-382 / AZ-384. + +## Acceptance Criteria + +**AC-1: iSAM2 emits smoothed past-keyframes to FDR** — synthetic graph with 20 frames of delayed convergence → smoothed entries appear in FDR after the smoother's window catches up. + +**AC-2: FDR records have `smoothed=True`** — every emitted record carries the flag. + +**AC-3: ESKF emits zero smoothed FDR records** — over a full 60-frame replay, `FdrClient.put` is never called from the ESKF hook. + +**AC-4: No leak to C8 outbound** — a stub C8-outbound queue receives ZERO smoothed estimates over the same replay; only `smoothed=False` records reach it. + +**AC-5: Idempotency** — emitting the same smoothed past-keyframe twice is prevented (via a `_last_emitted_smoothed_frame_id` watermark). + +**AC-6: FDR record shape** — `kind="c5.state.smoothed_history"`, fields per the AZ-272 FDR record schema. + +## Non-Functional Requirements + +- Hook adds ≤ 5 ms to `current_estimate` p99 (per smoothed entry; usually 0–1 entries per call). + +## Constraints + +- Smoothed estimates ONLY go to FDR. +- ESKF hook MUST be a no-op (honesty). +- Idempotency via watermark. + +## Risks & Mitigation + +- **Risk: `IncrementalFixedLagSmoother` retro-emit timing** — verify against pinned GTSAM API; tests cover the typical case where smoothed entries appear with a few-frame delay. +- **Risk: Smoothed estimate accidentally routed to C8** — AZ-261's filter is the enforcement point; this task documents the invariant. + +## Runtime Completeness + +- **Named capability**: smoothed past-keyframe → FDR path. +- **Production code**: real `_smoother.calculateEstimate()` query; real FDR emission with `smoothed=True`; real watermark idempotency. +- **Unacceptable substitutes**: routing smoothed estimates to C8 (AC-4.5 violation); ESKF emitting fake smoothed records (honesty violation). diff --git a/_docs/02_tasks/todo/AZ-388_c5_ac52_fallback.md b/_docs/02_tasks/todo/AZ-388_c5_ac52_fallback.md new file mode 100644 index 0000000..590a802 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-388_c5_ac52_fallback.md @@ -0,0 +1,85 @@ +# C5 AC-5.2 fallback — 3 s no-estimate detector + signal emission + +**Task**: AZ-388_c5_ac52_fallback +**Name**: C5 AC-5.2 fallback path — 3 s no-estimate detector + downstream signal +**Description**: Implement the AC-5.2 contract: when `current_estimate()` cannot return a fresh `EstimatorOutput` for ≥3 s (config `state.no_estimate_fallback_s`, default 3.0) — either because every call raised `EstimatorFatalError` OR the keyframe window has been empty — emit ONE downstream signal `kind="c5.state.no_estimate_fallback_engaged"` to FDR + GCS STATUSTEXT (severity CRITICAL) instructing C8 outbound to switch to FC IMU-only emission. The signal is emitted ONCE per fallback engagement (rate-limited by an `_in_fallback` boolean); on recovery (a fresh successful `current_estimate()`), emit ONE recovery signal `kind="c5.state.no_estimate_fallback_recovered"`. Add a private rolling counter `_last_successful_estimate_ns`; check at the top of every `current_estimate()` call AND on a separate watchdog tick (driven by C8 outbound's 5 Hz call cadence; the watchdog is implemented as a method `check_fallback_state(now_ns) -> bool` returning the current fallback state). +**Complexity**: 3 points +**Dependencies**: AZ-384 (`current_estimate` body), AZ-386 (same hook for ESKF), AZ-273 (FDR), AZ-272, AZ-390 (E-C8 — `GcsAdapter` Protocol surface), AZ-397 (E-C8 — `QgcTelemetryAdapter` concrete STATUSTEXT broadcast), AZ-263, AZ-269, AZ-266 +**Component**: c5_state (epic AZ-260 / E-C5) +**Tracker**: AZ-388 +**Epic**: AZ-260 (E-C5) + +### Document Dependencies + +- `_docs/02_document/contracts/c5_state/state_estimator_protocol.md` — Invariant 9. +- `_docs/02_document/components/07_c5_state/description.md` — § 5 (AC-5.2 fallback). + +## Problem + +Without this task, a sustained iSAM2 numerical failure or empty keyframe window would leave the system silently emitting stale or no estimates; C8 outbound would have no signal to switch to FC IMU-only; FC would receive degraded-quality estimates indefinitely. + +## Outcome + +- `_last_successful_estimate_ns` private counter (set on every successful `current_estimate()`). +- `_in_fallback` boolean (latched on engagement, cleared on recovery). +- Hook in `current_estimate()` (both estimators): + - On entry: if `now_ns - _last_successful_estimate_ns > no_estimate_fallback_s * 1e9` AND not `_in_fallback` → engage fallback (emit signal, set flag, log). + - On successful return: if `_in_fallback` → emit recovery signal, clear flag, log. +- New public method `check_fallback_state(now_ns) -> bool`: idempotent watchdog check returning the current fallback state. C8 outbound calls this at its 5 Hz cadence to drive the FC IMU-only switch even when no `current_estimate()` is being called. +- Engagement signal: FDR `kind="c5.state.no_estimate_fallback_engaged"` + GCS STATUSTEXT (CRITICAL severity, message "Onboard estimator lost; FC IMU-only"). +- Recovery signal: FDR `kind="c5.state.no_estimate_fallback_recovered"` + GCS STATUSTEXT (NOTICE severity). +- Configurable threshold; AC-5.2 default 3.0 s. +- Unit tests: 3 s no estimate → engagement signal fires once; sustained no-estimate over 30 s → still ONE engagement signal (rate-limited); successful estimate after engagement → recovery signal fires once; `check_fallback_state` returns correct state from external watchdog. + +## Scope + +### Included +- `_last_successful_estimate_ns` counter + `_in_fallback` flag. +- Hooks in both estimators' `current_estimate`. +- Public `check_fallback_state(now_ns)` watchdog API. +- Engagement + recovery signal emission. +- Rate-limiting (one signal per state transition). +- Unit tests across both estimators. + +### Excluded +- The actual FC IMU-only emission — owned by AZ-261 (C8 outbound). +- C5-IT-05 component-internal acceptance test — deferred to E-BBT. + +## Acceptance Criteria + +**AC-1: Engagement after 3 s of no estimate** — synthetic timeline with no successful `current_estimate` for 3.5 s → ONE engagement signal fires. + +**AC-2: Engagement is one-shot** — sustained 30 s no-estimate → still ONE engagement signal (rate-limited). + +**AC-3: Recovery signal** — after engagement, a successful `current_estimate` → ONE recovery signal. + +**AC-4: `check_fallback_state` watchdog** — even without `current_estimate` being called, the watchdog method correctly reports `True` after 3 s. + +**AC-5: GCS STATUSTEXT severity correct** — engagement = CRITICAL; recovery = NOTICE. + +**AC-6: Configurable threshold** — `no_estimate_fallback_s = 5.0` → engagement at 5 s instead of 3 s. + +**AC-7: Both estimators participate** — iSAM2 + ESKF both fire engagement / recovery signals correctly. + +**AC-8: FDR record shapes** — engagement: `{reason: "no_successful_estimate_for_s"}`; recovery: `{recovered_after_s}`. + +## Non-Functional Requirements + +- `check_fallback_state` p99 ≤ 5 µs. +- Hook overhead in `current_estimate` < 10 µs. + +## Constraints + +- One signal per state transition (rate-limited). +- Both estimators MUST participate. +- Config threshold MUST be respected. + +## Risks & Mitigation + +- **Risk: `_last_successful_estimate_ns` race with watchdog** — single-writer thread; both update from the same thread (composition root binds C5 + C8 outbound 5 Hz tick handler). + +## Runtime Completeness + +- **Named capability**: AC-5.2 fallback detector. +- **Production code**: real counter, real signal emission, real watchdog method. +- **Unacceptable substitutes**: spamming engagement signals on every check (rate-limit violation); silently dropping recovery (would leave C8 in IMU-only forever). diff --git a/_docs/02_tasks/todo/AZ-389_c5_orthorectifier_c6.md b/_docs/02_tasks/todo/AZ-389_c5_orthorectifier_c6.md new file mode 100644 index 0000000..e4b1b5b --- /dev/null +++ b/_docs/02_tasks/todo/AZ-389_c5_orthorectifier_c6.md @@ -0,0 +1,88 @@ +# C5 Orthorectifier → C6 mid-flight tile gen sub-path + +**Task**: AZ-389_c5_orthorectifier_c6 +**Name**: C5 internal orthorectifier — produces mid-flight tile candidates for C6 +**Description**: Implement the orthorectifier sub-path inside C5: when a frame has converged in the iSAM2 graph (≥1 satellite anchor + visual consistency), apply the camera intrinsics + extrinsics + the C5-known pose to orthorectify the nav-camera frame into a tile-aligned image patch; emit a `MidFlightTileCandidate(tile_id, pixels, quality_metadata, source_pose)` to C6 (via the storage interface AZ-303 `tile_store.put_mid_flight_candidate(...)`). Quality metadata: `inlier_count`, `cov_norm`, `pose_age_ms`. The orthorectifier is C5-internal (per epic spec § Scope: "orthorectifier (lives within C5 as an internal subcomponent)"); it consumes the converged pose + nav frame from a per-frame buffer; it emits at most ONE candidate per frame (gated by quality thresholds: `cov_norm < threshold` AND `inlier_count > floor`). Triggered after a successful `current_estimate()` call when quality conditions hold. +**Complexity**: 3 points +**Dependencies**: AZ-384 (`current_estimate` body + cov norm), AZ-385 (only emit candidates when source_label == SATELLITE_ANCHORED), AZ-303 (`TileStore.put_mid_flight_candidate`), AZ-263, AZ-269, AZ-266, AZ-272 (FDR) +**Component**: c5_state (epic AZ-260 / E-C5) +**Tracker**: AZ-389 +**Epic**: AZ-260 (E-C5) + +### Document Dependencies + +- `_docs/02_document/contracts/c5_state/state_estimator_protocol.md`. +- `_docs/02_document/components/07_c5_state/description.md` — orthorectifier mention; § 1 downstream "C6 (mid-flight tile gen via orthorectifier)". +- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — `put_mid_flight_candidate` API. + +## Problem + +Without this task, the system never emits mid-flight tile candidates → C6's cache never grows in flight → AC-NEW-3 (mid-flight tile gen) is unachievable. + +## Outcome + +- `src/gps_denied_onboard/components/c5_state/_orthorectifier.py` defining: + - `Orthorectifier` class (component-internal; not in `__all__`). + - Method: `try_emit_candidate(frame, pose_estimate, cov_6x6, inlier_count, source_label) -> MidFlightTileCandidate | None`. + - Quality gates: `cov_norm < cov_threshold` AND `inlier_count > inlier_floor` AND `source_label == SATELLITE_ANCHORED`. + - Orthorectification math: project nav-camera frame to tile plane via camera intrinsics + extrinsics + pose; nearest-neighbour or bilinear sampling. +- Hook in `GtsamIsam2StateEstimator.current_estimate()` post-emission (or post-`add_pose_anchor` — implementer choice; gated to fire AT MOST once per frame). +- ESKF estimator: also has the hook (mid-flight tile gen is independent of state-estimator strategy). +- Configurable thresholds in `config.state.orthorectifier.{cov_norm_threshold, inlier_floor}`. +- Defensive: skip emission silently if quality gates fail (NOT a degraded-mode error; tile gen is opportunistic per AC-NEW-3). +- DEBUG log on every emission attempt; INFO log on first emission per flight. +- Unit tests: known pose + frame → expected orthorectified output; quality-gate skip behaviour; emission rate-limit (once per frame). + +## Scope + +### Included +- `Orthorectifier` impl. +- Hook in `current_estimate` for both estimators. +- Quality-gate logic. +- Configurable thresholds. +- Unit tests. + +### Excluded +- The C6 `tile_store.put_mid_flight_candidate` body — owned by AZ-303 / E-C6. +- C6's downstream tile-cache eviction integration — owned by AZ-308. +- The orthorectification kernel optimisation — production-acceptable kernel uses NumPy or OpenCV `cv2.warpPerspective`; CUDA optimisation is a feature-cycle improvement. + +## Acceptance Criteria + +**AC-1: Orthorectification correctness** — synthetic camera pose + planar tile → output pixels match expected projection within 1-pixel tolerance. + +**AC-2: Quality gate skip** — `cov_norm > threshold` → no candidate emitted; DEBUG log only. + +**AC-3: Source label gate** — `source_label != SATELLITE_ANCHORED` → no emission. + +**AC-4: Once-per-frame rate limit** — even if `current_estimate` is called multiple times for the same frame, at most ONE candidate is emitted. + +**AC-5: Both estimators participate** — iSAM2 + ESKF both attempt candidate emission. + +**AC-6: Composition wiring** — the orthorectifier is constructed inside the estimator at `__init__` time; `tile_store` is constructor-injected. + +**AC-7: First-emission INFO log** — `kind="c5.state.first_mid_flight_candidate"` with `{frame_id, tile_id, cov_norm}`. + +**AC-8: Defensive skip on missing inputs** — if `frame` or `pose_estimate` is None, skip silently with DEBUG log (NOT an error). + +## Non-Functional Requirements + +- `try_emit_candidate` p95 ≤ 30 ms (orthorectification kernel cost). +- Memory ≤ 50 MB resident (frame buffer + working memory). + +## Constraints + +- Component-internal (not in C5 `__all__`). +- Once-per-frame rate limit. +- Quality gates are mandatory; AC-NEW-3 gain is contingent on emitted candidates being high-quality. + +## Risks & Mitigation + +- **Risk: Orthorectification produces low-quality tiles under degenerate pose** — quality gates filter; if still problematic, AZ-308 cache-eviction policy filters at storage time. +- **Risk: AZ-303 `put_mid_flight_candidate` API not yet stable** — this task ships against the documented API surface. + +## Runtime Completeness + +- **Named capability**: orthorectifier → mid-flight tile candidate emission. +- **Production code**: real orthorectification kernel (NumPy or OpenCV), real quality gates, real tile_store.put_mid_flight_candidate call. +- **Unacceptable substitutes**: emitting raw nav-frame pixels (not orthorectified); skipping the quality gates (AC-NEW-3 corruption). diff --git a/_docs/02_tasks/todo/AZ-390_c8_adapter_protocol.md b/_docs/02_tasks/todo/AZ-390_c8_adapter_protocol.md new file mode 100644 index 0000000..cc01ca2 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-390_c8_adapter_protocol.md @@ -0,0 +1,103 @@ +# C8 FcAdapter / GcsAdapter Protocols + DTOs + Factories + Composition + +**Task**: AZ-390_c8_adapter_protocol +**Name**: C8 `FcAdapter` + `GcsAdapter` Protocols + DTOs + errors + composition factories +**Description**: Define the public `FcAdapter` and `GcsAdapter` Protocols (PEP 544 `@runtime_checkable`), the C8 DTOs (`PortConfig`, `FcKind` enum, `FcTelemetryFrame`, `TelemetryKind` enum + payload union, `FlightStateSignal`, `FlightState` enum, `GpsHealth`, `GpsStatus` enum, `Severity` enum, `EmittedExternalPosition`, `OperatorCommand`), the error hierarchy (`FcAdapterError` family + `GcsAdapterError` family per the contract), and the composition-root factories `build_fc_adapter(...) -> FcAdapter` + `build_gcs_adapter(...) -> GcsAdapter` with strategy resolution (`config.fc.adapter`, `config.gcs.adapter`) and `BUILD_FC_` / `BUILD_GCS_` flag gating per ADR-002. Composition root binds C8 outbound (`emit_external_position`, `emit_status_text`, `request_source_set_switch`) to a single emit thread; C8 inbound (`subscribe_telemetry`) fires on the inbound decode thread. Shared helpers (`WgsConverter` AZ-279, `SE3Utils` AZ-277, `FdrClient` AZ-273, `Clock`) constructor-injected. Config schema extension for `fc.{adapter, port_device, port_baud, signing_key_source}` and `gcs.{adapter, port_device, port_baud, summary_rate_hz}`. No wire encoding, no signing logic, no telemetry decoding in scope here — pure scaffolding the seven downstream consumer tasks depend on. +**Complexity**: 3 points +**Dependencies**: AZ-263, AZ-269, AZ-270, AZ-273 (`FdrClient`), AZ-277 (`SE3Utils`), AZ-279 (`WgsConverter`), AZ-266 +**Component**: c8_fc_adapter (epic AZ-261 / E-C8) +**Tracker**: AZ-390 +**Epic**: AZ-261 (E-C8) + +### Document Dependencies + +- `_docs/02_document/contracts/c8_fc_adapter/fc_adapter_protocol.md` — the public contract this task implements. +- `_docs/02_document/components/10_c8_fc_adapter/description.md` — § 1 overview, § 2 interfaces, § 5 implementation details. +- `_docs/02_document/architecture.md` — ADR-001, ADR-002, ADR-009. +- `_docs/02_document/module-layout.md` — `c8_fc_adapter` Per-Component Mapping. + +## Problem + +Without this task, no concrete C8 adapter has a Protocol to register against; the runtime root cannot wire C8 to C1 / C5 (which receive `ImuWindow` / `AttitudeWindow` / `GpsHealth` / `FlightStateSignal` exclusively via the constructor-injected `FcAdapter` interface); the seven downstream consumer tasks (inbound subscription, covariance projector, AP outbound, iNav outbound, signing handshake, source-set switch, GCS adapter) have no shared DTO surface to encode/decode against. + +## Outcome + +- `src/gps_denied_onboard/components/c8_fc_adapter/interface.py` — `FcAdapter`, `GcsAdapter` Protocols with all methods per the contract. +- `src/gps_denied_onboard/components/c8_fc_adapter/__init__.py` — re-exports `FcAdapter`, `GcsAdapter`, `EmittedExternalPosition`. +- `src/gps_denied_onboard/_types/fc.py` — `PortConfig`, `FcKind`, `FcTelemetryFrame`, `TelemetryKind`, `FlightStateSignal`, `FlightState`, `GpsHealth`, `GpsStatus`, `Severity`, `EmittedExternalPosition`, `OperatorCommand` (all frozen + slots). +- `src/gps_denied_onboard/components/c8_fc_adapter/errors.py` — full error hierarchy. +- `src/gps_denied_onboard/runtime_root/fc_factory.py` — `build_fc_adapter(...)` + `build_gcs_adapter(...)`. Lazy-import per ADR-002. +- Composition-root extension: invoke `build_fc_adapter` AFTER C5; invoke `build_gcs_adapter` AFTER `build_fc_adapter`; bind outbound to ONE emit thread (single-writer invariant). +- Config schema extension for `fc.*` + `gcs.*` fields. +- INFO log on successful build: `kind="c8.adapter.strategy_loaded"` with `{fc_kind, gcs_kind}`. + +## Scope + +### Included +- Both Protocols with all methods. +- All DTOs + enums. +- Error hierarchy. +- Both factories + composition-root wiring. +- Single-writer thread enforcement for outbound. +- Config schema extension. +- Unit tests: Protocol conformance, DTO immutability + slots, factory rejection on unknown strategy + missing build flag, single-thread enforcement. + +### Excluded +- Inbound MAVLink + MSP2 decoder bodies — owned by next task. +- `CovarianceProjector` — owned by next task. +- `PymavlinkArdupilotAdapter` outbound body — owned by AP outbound task. +- `Msp2InavAdapter` outbound body — owned by iNav outbound task. +- MAVLink 2.0 signing handshake — owned by signing task. +- D-C8-2 source-set switch body — owned by source-set task. +- `QgcTelemetryAdapter` body — owned by GCS task. +- C8-IT/PT/ST tests — deferred to E-BBT (AZ-262). + +## Acceptance Criteria + +**AC-1: Protocol conformance** — `runtime_checkable` `isinstance` returns True for fakes implementing each Protocol's full method set. + +**AC-2: DTOs frozen + slots** — `FrozenInstanceError` on mutation; `__slots__` non-empty for every DTO. + +**AC-3: Enum membership** — `FcKind` has 2 values (ARDUPILOT_PLANE, INAV); `FlightState` has 5 (INIT/ARMED/IN_FLIGHT/ON_GROUND/FAILED); `GpsStatus` has 5 (NO_FIX/DEGRADED/STABLE/STABLE_NON_SPOOFED/SPOOFED); `Severity` has 3 (INFO=6, WARNING=4, ERROR=3 — values mirror MAVLink STATUSTEXT severities). + +**AC-4: Factory rejects missing build flag** — `config.fc.adapter = "ardupilot_plane"` with `BUILD_FC_ARDUPILOT_PLANE=OFF` → `FcAdapterConfigError("BUILD_FC_ARDUPILOT_PLANE is OFF...")`. + +**AC-5: Factory rejects unknown strategy at config-load** — `config.fc.adapter = "garbage"` → `FcAdapterConfigError` at config load (NOT at build time). + +**AC-6: Single-writer thread for outbound** — composition root binds outbound to ONE thread; second binding raises `RuntimeError`. + +**AC-7: GCS factory parallel coverage** — same set of acceptance behaviours for `build_gcs_adapter` against the `GcsAdapter` Protocol. + +**AC-8: Public API re-exports** — `from gps_denied_onboard.components.c8_fc_adapter import FcAdapter, GcsAdapter, EmittedExternalPosition` resolves; internal modules NOT in `__all__`. + +**AC-9: Error hierarchy catchability** — every FC error caught by `except FcAdapterError`; every GCS error caught by `except GcsAdapterError`. `SourceSetSwitchNotSupportedError` is also a `SourceSetSwitchError` (sub-typed for iNav rejection). + +**AC-10: INFO log on build** — successful build logs `kind="c8.adapter.strategy_loaded"` once per adapter with the strategy name + port device. + +## Non-Functional Requirements + +- `build_fc_adapter` p99 ≤ 50 ms. +- `build_gcs_adapter` p99 ≤ 50 ms. + +## Constraints + +- `@runtime_checkable` on both Protocols; DTOs `frozen=True, slots=True`. +- Lazy-import per ADR-002. +- Single-thread binding enforced for outbound (AC-6). +- Public API surface limited to the two re-export sets (per `module-layout.md`). + +## Risks & Mitigation + +- **Risk**: Protocol surface changes after consumer tasks land. *Mitigation*: this task ships first; downstream tasks reference the Protocol shape locked here. Any extension is additive (new method on the Protocol implies a default no-op fallback or a follow-up Protocol version bump documented in the contract). +- **Risk**: Single-thread binding bug breaks the multi-consumer (C1 + C5) inbound path. *Mitigation*: AC-6 covers ONLY outbound; inbound subscribe-callback semantics are documented as fire-on-decode-thread + consumer responsibility (Invariant 8). + +## Runtime Completeness + +- **Named capability**: C8 Protocols + DTOs + factories. +- **Production code**: real Protocols, real DTOs, real error hierarchy, real factories, real composition-root wiring. +- **Allowed external stubs**: test fakes only; no production code may import `FcAdapterStub` outside tests. +- **Unacceptable substitutes**: hardcoding the C8 strategy class in the runtime root (defeats ADR-009); skipping the Protocol surface. + +## Contract + +Implements `_docs/02_document/contracts/c8_fc_adapter/fc_adapter_protocol.md`. diff --git a/_docs/02_tasks/todo/AZ-391_c8_inbound_subscription.md b/_docs/02_tasks/todo/AZ-391_c8_inbound_subscription.md new file mode 100644 index 0000000..446e6dc --- /dev/null +++ b/_docs/02_tasks/todo/AZ-391_c8_inbound_subscription.md @@ -0,0 +1,101 @@ +# C8 Inbound subscription — MAVLink + MSP2 telemetry decoders + +**Task**: AZ-391_c8_inbound_subscription +**Name**: C8 inbound subscription path — IMU/attitude/GPS-health/MAV_STATE producer +**Description**: Implement the inbound telemetry decode path for both `PymavlinkArdupilotAdapter` and `Msp2InavAdapter`. Decode AP wire frames (`RAW_IMU`/`SCALED_IMU2`, `ATTITUDE`, `GPS_RAW_INT`/`GPS2_RAW`, `HEARTBEAT`, `MAV_STATE` from `HEARTBEAT.system_status`, `STATUSTEXT`) via pymavlink. Decode iNav wire frames (`MSP2_INAV_ANALOG`, attitude+IMU stream) via YAMSPy. Both paths produce a unified `FcTelemetryFrame` stream + maintain bounded telemetry rings (drop-oldest on overflow per § 7) for `ImuWindow` (AZ-276 helper consumer side), `AttitudeWindow`, `GpsHealth`, `FlightStateSignal`. The decode thread is independent from the outbound emit thread (per Invariant 8). `subscribe_telemetry(callback)` returns a `Subscription` handle; multiple subscribers fan out from a single decode loop. Out-of-order timestamp drop + WARN log per Invariant 7. AC-5.1 warm-start: at first GPS_RAW_INT with valid fix, populate `FlightStateSignal.last_valid_gps_hint_wgs84` + `last_valid_gps_age_ms` for the C5 warm-start consumer. AC-5.1 surface deadline ≤ 1 s after C8 ready. +**Complexity**: 5 points +**Dependencies**: AZ-390 (Protocol + DTOs + composition), AZ-263, AZ-269, AZ-266, AZ-272 (FDR), AZ-273 (`FdrClient`), AZ-276 (`ImuPreintegrator` consumer side — this task feeds raw IMU samples) +**Component**: c8_fc_adapter (epic AZ-261 / E-C8) +**Tracker**: AZ-391 +**Epic**: AZ-261 (E-C8) + +### Document Dependencies + +- `_docs/02_document/contracts/c8_fc_adapter/fc_adapter_protocol.md` — Invariants 1, 7, 8. +- `_docs/02_document/components/10_c8_fc_adapter/description.md` — § 1 inbound, § 4 caching strategy, § 7 race conditions. +- `_docs/02_document/architecture.md` — § 5 External Integrations (per-message rate/auth/failure-mode table). + +## Problem + +Without this task, C8 has no inbound — C1 (VIO) gets no FC IMU prior, C5 (StateEstimator) gets no FC IMU window or GpsHealth or warm-start hint, and the per-FC ports never decode the wire stream. The Protocol from AZ-390 has `subscribe_telemetry` declared but no body. + +## Outcome + +- `src/gps_denied_onboard/components/c8_fc_adapter/_inbound_mavlink.py` — AP inbound decoder (pymavlink-based loop, `MAVLink_message_handler`, frame → `FcTelemetryFrame` translation). +- `src/gps_denied_onboard/components/c8_fc_adapter/_inbound_msp2.py` — iNav inbound decoder (YAMSPy + INAV-Toolkit; periodic poll loop since MSP2 is request-response). +- `src/gps_denied_onboard/components/c8_fc_adapter/_telemetry_rings.py` — bounded ring buffers per kind; drop-oldest semantics; thread-safe deque-based. +- `src/gps_denied_onboard/components/c8_fc_adapter/_subscription.py` — `Subscription` handle + multi-subscriber fan-out. +- Body of `subscribe_telemetry` on both `PymavlinkArdupilotAdapter` (AZ-393 produces the class shell — this task fills the inbound body) and `Msp2InavAdapter` (AZ-394 — same) — registers the callback against the multi-subscriber bus. +- Body of `current_flight_state` on both adapters — returns the latest `FlightStateSignal` from the cached ring. +- WARN log on out-of-order frame: `kind="c8.inbound.out_of_order_frame_dropped"` with `{kind, prev_ns, this_ns}`. +- DEBUG log on every decode error: `kind="c8.inbound.decode_error"`. + +## Scope + +### Included +- AP MAVLink 2.0 inbound decoder (5 message types). +- iNav MSP2 inbound decoder (poll loop + decode). +- Bounded telemetry rings + drop-oldest. +- Multi-subscriber fan-out + `Subscription` handle. +- AC-5.1 warm-start hint surfacing. +- Out-of-order drop + log per Invariant 7. +- Unit tests: AP frame decode, iNav frame decode, ring overflow drops oldest, multi-subscriber fan-out, out-of-order drop logged, warm-start hint surfaces within 1 s of first GPS_RAW_INT. + +### Excluded +- Outbound encoding paths — owned by AP / iNav outbound tasks. +- Signing handshake — owned by signing task. +- `CovarianceProjector` — owned by projector task. +- `GcsAdapter` inbound (operator commands) — owned by GCS task. +- C8-IT/PT/ST tests — deferred to E-BBT. + +## Acceptance Criteria + +**AC-1: AP RAW_IMU decode** — pymavlink `RAW_IMU` frame → `FcTelemetryFrame(kind=IMU_SAMPLE, payload=ImuSample(...))` with timestamp + 6-axis values; AC fails if `received_at` not set to `monotonic_ns()` at decode boundary. + +**AC-2: AP ATTITUDE decode** — `ATTITUDE` frame → `FcTelemetryFrame(kind=ATTITUDE, payload=AttitudeSample(...))` with roll/pitch/yaw. + +**AC-3: AP GPS_RAW_INT → GpsHealth** — `GPS_RAW_INT.fix_type` mapped to `GpsStatus` per the documented table (NO_FIX/DEGRADED/STABLE); `STABLE_NON_SPOOFED` requires the GPS_RAW_INT.signed_flag (or equivalent) to be set; `SPOOFED` requires the FC's spoofing-detection telemetry (not always present — degraded to STABLE if absent). + +**AC-4: AP HEARTBEAT → FlightState** — `HEARTBEAT.system_status` mapped to `FlightState` per the table. + +**AC-5: iNav MSP2 decode** — `MSP2_INAV_ANALOG` + attitude/IMU poll responses → matching `FcTelemetryFrame`s with the SAME unified DTO shape as AP. iNav has no spoofing-detection — `GpsStatus.SPOOFED` is unreachable for iNav. + +**AC-6: Bounded ring drop-oldest** — push 1000 frames into a 100-capacity ring; assert oldest 900 dropped; ring contains the latest 100; INFO log emitted at first overflow with `kind="c8.inbound.ring_overflow"`. + +**AC-7: Multi-subscriber fan-out** — register 3 subscribers; emit one frame; assert all 3 callbacks invoked; cancel one subscription; emit another frame; assert remaining 2 invoked. + +**AC-8: AC-5.1 warm-start hint within 1 s** — `current_flight_state()` returns `FlightStateSignal` with `last_valid_gps_hint_wgs84 != None` within 1 s of the first GPS_RAW_INT decode. + +**AC-9: Out-of-order drop + WARN** — inject a frame with `received_at` < previous frame of same kind; assert frame dropped + WARN log emitted. + +**AC-10: Decode-error isolation** — corrupt frame → DEBUG log + frame dropped; subsequent valid frames still processed (the decoder MUST NOT crash on a single malformed frame). + +## Non-Functional Requirements + +- Inbound IMU callback p95 ≤ 1 ms (C8-PT-01 budget). +- AP decode loop: 200 Hz IMU sustained without dropping > 1% of frames. +- iNav poll loop: 100 Hz attitude+IMU sustained. + +## Constraints + +- Single decode thread per adapter; thread-safe ring access. +- pymavlink is bundled unmodified per D-C8-3. +- YAMSPy + INAV-Toolkit at the project's pinned versions. +- The decode thread MUST NOT block on subscriber callbacks longer than 100 µs; slow subscribers must use a non-blocking enqueue + drain on their own thread. + +## Risks & Mitigation + +- **Risk: pymavlink message-handler performance under 200 Hz IMU** — *Mitigation*: profile early; if marginal, offload decode to a small C extension. Project's pin of pymavlink is known to handle this rate. +- **Risk: iNav poll-rate drift on slow UART** — *Mitigation*: configurable poll period; degrade gracefully (lower rate, log WARN once per minute). +- **Risk: Out-of-order frames silently mask bugs (R05-style)** — *Mitigation*: AC-9 mandates WARN on every drop; aggregate counter + INFO log every 60 s. + +## Runtime Completeness + +- **Named capability**: C8 inbound telemetry decode + multi-subscriber fan-out. +- **Production code**: real pymavlink decoder, real YAMSPy decoder, real ring buffers, real subscription handles. +- **Allowed external stubs**: SITL fakes for tests; no production stubs. +- **Unacceptable substitutes**: a periodic-fake-IMU generator in production. + +## Contract + +Implements `_docs/02_document/contracts/c8_fc_adapter/fc_adapter_protocol.md` — `subscribe_telemetry`, `current_flight_state`, Invariants 1, 7, 8. diff --git a/_docs/02_tasks/todo/AZ-392_c8_covariance_projector.md b/_docs/02_tasks/todo/AZ-392_c8_covariance_projector.md new file mode 100644 index 0000000..1ec600a --- /dev/null +++ b/_docs/02_tasks/todo/AZ-392_c8_covariance_projector.md @@ -0,0 +1,87 @@ +# C8 CovarianceProjector — honest 6×6 → 2×2 → equivalent_radius + +**Task**: AZ-392_c8_covariance_projector +**Name**: C8 `CovarianceProjector` — honest 6×6 → 2×2 → equivalent_radius helper (D-C8-8 = (b)) +**Description**: Implement the `CovarianceProjector` class — a C8-internal helper that projects a 6×6 GTSAM `Marginals` covariance from `EstimatorOutput.covariance_6x6` into the per-FC scalar accuracy field. Steps: 6×6 → 3×3 position sub-matrix (top-left 3×3 block) → 2×2 horizontal sub-matrix (rows/cols 0,1) → `equivalent_radius` per the AC-4.3 formula `sqrt(0.5 * (sigma_xx + sigma_yy + sqrt((sigma_xx - sigma_yy)^2 + 4*sigma_xy^2)))` (largest eigenvalue of 2×2). Two output paths: (i) `to_ardupilot_horiz_accuracy_m(cov_6x6) -> float` returning meters for AP `GPS_INPUT.horiz_accuracy`; (ii) `to_inav_h_pos_accuracy_mm(cov_6x6) -> int` returning millimeters (clamped to int range) for iNav `MSP2_SENSOR_GPS.hPosAccuracy`. Documented as the "honest covariance projection" per IT-10. Constructor injection (no static methods — per coderule SRP, this helper has variant-specific output formatting that belongs on an instance). NaN / non-SPD input → `FcEmitError("non-SPD covariance from C5; refusing emit")`. +**Complexity**: 3 points +**Dependencies**: AZ-390 (error hierarchy includes `FcEmitError`), AZ-263, AZ-269, AZ-266, AZ-272 (FDR for the SPD-violation log) +**Component**: c8_fc_adapter (epic AZ-261 / E-C8) +**Tracker**: AZ-392 +**Epic**: AZ-261 (E-C8) + +### Document Dependencies + +- `_docs/02_document/contracts/c8_fc_adapter/fc_adapter_protocol.md` — Invariant 4 (honest projection); error path for non-SPD. +- `_docs/02_document/components/10_c8_fc_adapter/description.md` — § 5 covariance projection formula; § 6 helper ownership table. + +## Problem + +Without this task, both AP and iNav outbound paths must each implement covariance projection independently — risking drift between adapters AND violating Invariant 4 (honest projection). The Frobenius-norm equivalence requirement (within 1% per C8-IT-01) demands a single canonical implementation. + +## Outcome + +- `src/gps_denied_onboard/components/c8_fc_adapter/_covariance_projector.py` — `CovarianceProjector` class with two methods. +- Re-exported as `c8_fc_adapter.helpers.CovarianceProjector` for internal C8 use only (NOT in Public API per `module-layout.md`). +- Constructor: `CovarianceProjector(fdr_client: FdrClient)` — for SPD-violation logging. +- Unit tests: known-input vs hand-computed expected output; Frobenius-norm equivalence within 1% on 100 synthetic 6×6 SPD matrices; non-SPD raises; NaN raises; clipping at iNav int max documented. + +## Scope + +### Included +- `CovarianceProjector` class with both per-FC methods. +- 6×6 → 3×3 → 2×2 → eigenvalue formula. +- AP meters output + iNav millimeters output. +- SPD validation + NaN check. +- Unit tests including Frobenius-norm equivalence (the C8-IT-01 acceptance test exercises this from end-to-end; this task's unit tests cover the projector body). + +### Excluded +- AP / iNav wire encoding — owned by outbound tasks (they consume this projector). +- The C8-IT-01 end-to-end test — deferred to E-BBT. + +## Acceptance Criteria + +**AC-1: Hand-computed reference** — for `cov_6x6` with known position-block `[[4, 1, 0], [1, 9, 0], [0, 0, 16]]`, `to_ardupilot_horiz_accuracy_m(...)` returns `sqrt(0.5 * (4 + 9 + sqrt((4-9)^2 + 4)))` ≈ `sqrt(6.5 + sqrt(29)/2)` (assert within 1e-9 numerical tolerance). + +**AC-2: Frobenius-norm equivalence** — on 100 synthetic 6×6 SPD matrices, the 2×2 horizontal-block Frobenius norm is within 1% of the 3×3 position-block horizontal-component Frobenius norm. (This is a slightly weaker form of C8-IT-01 — exact equivalence is exercised end-to-end.) + +**AC-3: AP units = meters** — `to_ardupilot_horiz_accuracy_m` returns a `float`; documented as meters. + +**AC-4: iNav units = millimeters** — `to_inav_h_pos_accuracy_mm` returns an `int`; documented as millimeters; conversion is `m * 1000.0` rounded half-up to int. + +**AC-5: iNav int clamping** — covariance with `equivalent_radius > 65.535 m` (uint16 max in mm) → returned value clamped to 65535; WARN log emitted with `kind="c8.cov_projector.inav_clamped"` on every clamp event. + +**AC-6: Non-SPD raises** — input with negative eigenvalue (e.g., `[[1, 0, 0], [0, -1, 0], [0, 0, 1]]` in the position block) → `FcEmitError("non-SPD covariance from C5; refusing emit")`. + +**AC-7: NaN raises** — input with any NaN entry → `FcEmitError("NaN covariance from C5; refusing emit")`. + +**AC-8: SPD-violation FDR log** — every SPD-violation (AC-6) and NaN (AC-7) emits an FDR record `kind="c8.cov_projector.spd_violation"` BEFORE raising. + +**AC-9: Per-FC same source** — `to_ardupilot_horiz_accuracy_m(cov) * 1000` equals `to_inav_h_pos_accuracy_mm(cov)` ± 1 (rounding), for any well-conditioned input. + +**AC-10: No global state** — two independent `CovarianceProjector` instances do not share state (instance method; not static — per SRP). + +## Non-Functional Requirements + +- p99 ≤ 100 µs per call (constant-time per § 5). + +## Constraints + +- Instance method (NOT static) per coderule SRP — variant-specific output formatting on the instance. +- Public API: NOT exposed outside C8 (helper-only). +- numpy is the only allowed dependency for the eigenvalue computation. +- The two methods MUST share the same intermediate 3×3 / 2×2 reduction code path — variant-specific code is the unit conversion at the end (per coderule SRP). + +## Risks & Mitigation + +- **Risk: numpy eigvalsh vs custom 2×2 closed-form gives different results within numerical tolerance** — *Mitigation*: use the closed-form `sqrt(0.5 * (a + d + sqrt((a-d)^2 + 4*b^2)))` directly; faster + bit-stable. +- **Risk: SPD-violation cascades into a tight error loop in production** — *Mitigation*: the projector raises; the AP / iNav outbound task is responsible for handling `FcEmitError` + dropping the frame + continuing. C5's emit thread does not retry on SPD violation. + +## Runtime Completeness + +- **Named capability**: honest 6×6 → 2×2 → equivalent_radius projection for both FC variants. +- **Production code**: real numpy-based reduction; real per-FC unit conversion; real SPD + NaN guards; real FDR logging. +- **Unacceptable substitutes**: a constant fixed-radius output ("3.0 m always") — defeats AC-4.3 and Invariant 4. + +## Contract + +Implements `_docs/02_document/contracts/c8_fc_adapter/fc_adapter_protocol.md` — Invariant 4. diff --git a/_docs/02_tasks/todo/AZ-393_c8_ardupilot_outbound.md b/_docs/02_tasks/todo/AZ-393_c8_ardupilot_outbound.md new file mode 100644 index 0000000..fcdd663 --- /dev/null +++ b/_docs/02_tasks/todo/AZ-393_c8_ardupilot_outbound.md @@ -0,0 +1,104 @@ +# C8 PymavlinkArdupilotAdapter — outbound GPS_INPUT + STATUSTEXT + NAMED_VALUE_FLOAT + +**Task**: AZ-393_c8_ardupilot_outbound +**Name**: C8 `PymavlinkArdupilotAdapter` outbound — `GPS_INPUT` 5 Hz + provenance side-channel +**Description**: Implement the `PymavlinkArdupilotAdapter.emit_external_position(EstimatorOutput)` body: encode `EstimatorOutput` into a MAVLink 2.0 `GPS_INPUT` frame (lat/lon/alt from WGS84 conversion via injected `WgsConverter`; `horiz_accuracy` from the injected `CovarianceProjector.to_ardupilot_horiz_accuracy_m`; `vel_n/vel_e/vel_d` from the velocity sub-vector if present in the C5 estimate); write to the pymavlink connection via `mav.gps_input_send(...)`. Side-channel: emit `NAMED_VALUE_FLOAT` with `name="src_lbl"` carrying the `EstimatorOutput.source_label` enum value (encoded as float per the documented enum-to-float mapping); also emit `STATUSTEXT(severity=INFO, "src=