[AZ-185][AZ-186] Batch 2

Made-with: Cursor
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-04-15 07:32:37 +03:00
parent d244799f02
commit 9a0248af72
18 changed files with 1857 additions and 26 deletions
@@ -0,0 +1,53 @@
# Publish artifact script (AZ-186)
Training services and CI/CD call `scripts/publish_artifact.py` after producing an artifact (for example a `.trt` model or a Docker image tarball). The script gzip-compresses the file, encrypts it with a random 32-byte AES-256 key (AES-CBC with PKCS7, IV prefixed), uploads the ciphertext to S3, and registers metadata with the admin API.
## CLI
```text
python scripts/publish_artifact.py \
--file /path/to/artifact \
--resource-name my_model \
--dev-stage dev \
--architecture arm64 \
--version 2026-04-15
```
Object key: `{dev_stage}/{resource_name}-{architecture}-{version}.enc`
## Environment variables
| Variable | Required | Purpose |
|----------|----------|---------|
| `S3_ENDPOINT` | yes | S3-compatible endpoint URL |
| `S3_ACCESS_KEY` | yes | Upload credentials |
| `S3_SECRET_KEY` | yes | Upload credentials |
| `S3_BUCKET` | yes | Target bucket |
| `ADMIN_API_URL` | yes | Admin API base URL (no trailing path for publish) |
| `ADMIN_API_TOKEN` | yes | Bearer token for the publish request |
| `CDN_PUBLIC_BASE_URL` | no | If set, `cdn_url` in the registration payload is `{CDN_PUBLIC_BASE_URL}/{object_key}`; otherwise it defaults to `{S3_ENDPOINT}/{S3_BUCKET}/{object_key}` |
| `ADMIN_API_PUBLISH_PATH` | no | Defaults to `internal/resources/publish`; POST is sent to `{ADMIN_API_URL}/{ADMIN_API_PUBLISH_PATH}` |
## Admin API contract
`POST {ADMIN_API_URL}/internal/resources/publish` (unless overridden) with JSON body:
- `resource_name`, `dev_stage`, `architecture`, `version` (strings)
- `cdn_url` (string)
- `sha256` (lowercase hex of the uploaded ciphertext file, including the 16-byte IV)
- `encryption_key` (64-character hex encoding of the raw 32-byte AES key)
- `size_bytes` (integer size of the uploaded ciphertext file)
The loader expects the same `encryption_key` and `sha256` semantics as returned by fleet `POST /get-update` (hex key, hash of the ciphertext object).
## Dependencies
Use the same major versions as the loader: `boto3`, `cryptography`, `requests` (see `requirements.txt`). A minimal install for a training host is:
```text
pip install boto3==1.40.9 cryptography==44.0.2 requests==2.32.4
```
## Woodpecker
Pipeline `.woodpecker/build-arm.yml` saves the built image to `loader-image.tar` and runs this script in a follow-up step. Configure the environment variables above as Woodpecker secrets for that step.
@@ -0,0 +1,129 @@
# TPM-Based Security Provider
**Task**: AZ-182_tpm_security_provider
**Name**: TPM Security Provider
**Description**: Introduce SecurityProvider abstraction with TPM detection and FAPI integration, wrapping existing security logic in LegacySecurityProvider for backward compatibility
**Complexity**: 5 points
**Dependencies**: None
**Component**: 02 Security
**Tracker**: AZ-182
**Epic**: AZ-181
## Problem
The loader's security code (key derivation, encryption, hardware fingerprinting) is hardcoded for the binary-split scheme. On fused Jetson Orin Nano devices with fTPM, this scheme is unnecessary — full-disk encryption protects data at rest, and the fleet update system (AZ-185) handles encrypted artifact delivery with per-artifact keys. However, the loader still needs a clean abstraction to:
1. Detect whether it's running on a TPM-equipped device or a legacy environment
2. Provide TPM seal/unseal capability as infrastructure for defense-in-depth (sealed credentials, future key wrapping)
3. Preserve the legacy code path for non-TPM deployments
## Outcome
- Loader detects TPM availability at startup and selects the appropriate security provider
- SecurityProvider abstraction cleanly separates TPM and legacy code paths
- TpmSecurityProvider establishes FAPI connection and provides seal/unseal operations
- LegacySecurityProvider wraps existing security.pyx unchanged
- Foundation in place for fTPM-sealed credentials (future) and per-artifact key decryption integration
## Scope
### Included
- SecurityProvider abstraction (ABC) with TpmSecurityProvider and LegacySecurityProvider
- Runtime TPM detection (/dev/tpm0 + SECURITY_PROVIDER env var override)
- tpm2-pytss FAPI integration: connect, create_seal, unseal
- LegacySecurityProvider wrapping existing security.pyx (encrypt, decrypt, key derivation)
- Auto-detection and provider selection at startup with logging
- Docker compose device mounts for /dev/tpm0 and /dev/tpmrm0
- Dockerfile changes: install tpm2-tss native library + tpm2-pytss
- Tests using TPM simulator (swtpm)
### Excluded
- Resource download/upload changes (handled by AZ-185 Update Manager with per-artifact keys)
- Docker unlock flow changes (handled by AZ-185 Update Manager)
- fTPM provisioning pipeline (manufacturing-time, separate from code)
- Remote attestation via EK certificates
- fTPM-sealed device credentials (future enhancement, not v1)
- Changes to the Azaion admin API server
## Acceptance Criteria
**AC-1: SecurityProvider auto-detection**
Given a Jetson device with provisioned fTPM and /dev/tpm0 accessible
When the loader starts
Then TpmSecurityProvider is selected and logged
**AC-2: TPM seal/unseal round-trip**
Given TpmSecurityProvider is active
When data is sealed via FAPI create_seal and later unsealed
Then the unsealed data matches the original
**AC-3: Legacy path unchanged**
Given no TPM is available (/dev/tpm0 absent)
When the loader starts and processes resource requests
Then LegacySecurityProvider is selected and all behavior is identical to the current scheme
**AC-4: Env var override**
Given SECURITY_PROVIDER=legacy is set
When the loader starts on a device with /dev/tpm0 present
Then LegacySecurityProvider is selected regardless of TPM availability
**AC-5: Graceful fallback**
Given /dev/tpm0 exists but FAPI connection fails
When the loader starts
Then it falls back to LegacySecurityProvider with a warning log
**AC-6: Docker container TPM access**
Given docker-compose.yml with /dev/tpm0 and /dev/tpmrm0 device mounts
When the loader container starts on a fused Jetson
Then TpmSecurityProvider can connect to fTPM via FAPI
## Non-Functional Requirements
**Performance**
- TPM seal/unseal latency must be under 500ms per operation
**Compatibility**
- Must work on ARM64 Jetson Orin Nano with JetPack 6.1+
- Must work inside Docker containers with --device mounts
- tpm2-pytss must be compatible with Python 3.11 and Cython compilation
**Reliability**
- Graceful fallback to LegacySecurityProvider on any TPM initialization failure
- No crash on /dev/tpm0 absence — clean detection and fallback
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | SecurityProvider factory with /dev/tpm0 mock present | TpmSecurityProvider selected |
| AC-2 | FAPI create_seal + unseal via swtpm | Data matches round-trip |
| AC-3 | SecurityProvider factory without /dev/tpm0 | LegacySecurityProvider selected |
| AC-4 | SECURITY_PROVIDER=legacy env var with /dev/tpm0 present | LegacySecurityProvider selected |
| AC-5 | /dev/tpm0 exists but FAPI raises exception | LegacySecurityProvider selected, warning logged |
## Blackbox Tests
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|--------|------------------------|-------------|-------------------|----------------|
| AC-3 | No TPM device available | POST /load/{filename} (split resource) | Existing binary-split behavior, all current tests pass | Compatibility |
| AC-6 | TPM simulator in Docker | Container starts with device mounts | FAPI connects, seal/unseal works | Compatibility |
## Constraints
- tpm2-pytss requires tpm2-tss >= 2.4.0 native library in the Docker image
- Tests require swtpm (software TPM simulator) — must be added to test infrastructure
- fTPM provisioning is out of scope — this task assumes a provisioned TPM exists
- PCR-based policy binding intentionally not used (known persistence issues on Orin Nano)
## Risks & Mitigation
**Risk 1: fTPM FAPI stability on Jetson Orin Nano**
- *Risk*: FAPI seal/unseal may have undocumented issues on Orin Nano (similar to PCR/NV persistence bugs)
- *Mitigation*: Design intentionally avoids PCR policies and NV indexes; uses SRK hierarchy only. Hardware validation required before production deployment.
**Risk 2: swtpm test fidelity**
- *Risk*: Software TPM simulator may not reproduce all fTPM behaviors
- *Mitigation*: Integration tests on actual Jetson hardware as part of acceptance testing (outside CI).
**Risk 3: tpm2-tss native library in Docker image**
- *Risk*: tpm2-tss may not be available in python:3.11-slim base image; ARM64 build may need compilation
- *Mitigation*: Add tpm2-tss to Dockerfile build step; verify ARM64 compatibility early.
@@ -0,0 +1,79 @@
# Resumable Download Manager
**Task**: AZ-184_resumable_download_manager
**Name**: Resumable Download Manager
**Description**: Implement a resumable HTTP download manager for the loader that handles intermittent Starlink connectivity
**Complexity**: 3 points
**Dependencies**: None
**Component**: Loader
**Tracker**: AZ-184
**Epic**: AZ-181
## Problem
Jetsons on UAVs have intermittent Starlink connectivity. Downloads of large artifacts (AI models ~500MB, Docker images ~1GB) must survive connection drops and resume from where they left off.
## Outcome
- Downloads resume from the last byte received after connectivity loss
- Completed downloads are verified with SHA-256 before use
- Downloaded artifacts are decrypted with per-artifact AES-256 keys
- State persists across loader restarts
## Scope
### Included
- Resumable HTTP downloads using Range headers (S3 supports natively)
- JSON state file on disk tracking: url, expected_sha256, expected_size, bytes_downloaded, temp_file_path
- SHA-256 verification of completed downloads
- AES-256 decryption of downloaded artifacts using per-artifact key from /get-update response
- Retry with exponential backoff (1min, 5min, 15min, 1hr, max 4hr)
- State machine: pending -> downloading -> paused -> verifying -> decrypting -> complete / failed
### Excluded
- Update check logic (AZ-185)
- Applying updates (AZ-185)
- CDN upload (AZ-186)
## Acceptance Criteria
**AC-1: Resume after connection drop**
Given a download is 60% complete and connectivity is lost
When connectivity returns
Then download resumes from byte offset (60% of file), not from scratch
**AC-2: SHA-256 mismatch triggers re-download**
Given a completed download with corrupted data
When SHA-256 verification fails
Then the partial file is deleted and download restarts from scratch
**AC-3: Decryption produces correct output**
Given a completed and verified download
When decrypted with the per-artifact AES-256 key
Then the output matches the original unencrypted artifact
**AC-4: State survives restart**
Given a download is 40% complete and the loader container restarts
When the loader starts again
Then the download resumes from 40%, not from scratch
**AC-5: Exponential backoff on repeated failures**
Given multiple consecutive connection failures
When retrying
Then wait times follow exponential backoff pattern
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Mock HTTP server drops connection mid-transfer | Resume with Range header from correct offset |
| AC-2 | Corrupt downloaded file | SHA-256 check fails, file deleted, retry flag set |
| AC-3 | Encrypt test file, download, decrypt | Round-trip matches original |
| AC-4 | Write state file, reload | State correctly restored |
| AC-5 | Track retry intervals | Backoff pattern matches spec |
## Constraints
- Must work inside Docker container
- S3-compatible CDN (current CDNManager already uses boto3)
- State file location must be on a volume that persists across container restarts
@@ -0,0 +1,76 @@
# Update Manager
**Task**: AZ-185_update_manager
**Name**: Update Manager
**Description**: Implement the loader's background update loop that checks for new versions and applies AI model and Docker image updates
**Complexity**: 5 points
**Dependencies**: AZ-183, AZ-184
**Component**: Loader
**Tracker**: AZ-185
**Epic**: AZ-181
## Problem
Jetsons need to automatically discover and install new AI models and Docker images without manual intervention. The update loop must handle version detection, server communication, and applying different update types.
## Outcome
- Loader automatically checks for updates every 5 minutes
- New AI models downloaded, decrypted, and placed in model directory
- New Docker images loaded and services restarted with minimal downtime
- Loader can update itself (self-update, applied last)
## Scope
### Included
- Version collector: scan model directory for .trt files (extract date from filename), query docker images for azaion/* tags, cache results
- Background loop (configurable interval, default 5 min): collect versions, call POST /get-update, trigger downloads
- Apply AI model: move decrypted .trt to model directory (detection API scans and picks newest)
- Apply Docker image: docker load -i, docker compose up -d {service}
- Self-update: loader updates itself last via docker compose up -d loader
- Integration with AZ-184 Resumable Download Manager for all downloads
### Excluded
- Server-side /get-update endpoint (AZ-183)
- Download mechanics (AZ-184)
- CI/CD publish pipeline (AZ-186)
- Device provisioning (AZ-187)
## Acceptance Criteria
**AC-1: Version collector reads local state**
Given AI model azaion-2026-03-10.trt in model directory and Docker image azaion/annotations:arm64_2026-03-01 loaded
When version collector runs
Then it reports [{resource_name: "detection_model", version: "2026-03-10"}, {resource_name: "annotations", version: "arm64_2026-03-01"}]
**AC-2: Background loop polls on schedule**
Given the loader is running with update interval set to 5 minutes
When 5 minutes elapse
Then POST /get-update is called with current versions
**AC-3: AI model update applied**
Given /get-update returns a new detection_model version
When download and decryption complete
Then new .trt file is in the model directory
**AC-4: Docker image update applied**
Given /get-update returns a new annotations version
When download and decryption complete
Then docker load succeeds and docker compose up -d annotations restarts the service
**AC-5: Self-update applied last**
Given /get-update returns updates for both annotations and loader
When applying updates
Then annotations is updated first, loader is updated last
**AC-6: Cached versions refresh after changes**
Given version collector cached its results
When a new model file appears in the directory or docker load completes
Then cache is invalidated and next collection reflects new state
## Constraints
- Docker socket must be mounted in the loader container (already the case)
- docker compose file path must be configurable (env var)
- Model directory path must be configurable (env var)
- Self-update must be robust: state file on disk ensures in-progress updates survive container restart
@@ -0,0 +1,67 @@
# CI/CD Artifact Publish
**Task**: AZ-186_cicd_artifact_publish
**Name**: CI/CD Artifact Publish
**Description**: Add encrypt-and-publish step to Woodpecker CI/CD pipeline and create a shared publish script usable by both CI/CD and training service
**Complexity**: 3 points
**Dependencies**: AZ-183
**Component**: DevOps
**Tracker**: AZ-186
**Epic**: AZ-181
## Problem
Both CI/CD (for Docker images) and the training service (for AI models) need to encrypt artifacts and publish them to CDN + Resources table. The encryption and publish logic should be shared.
## Outcome
- Shared Python publish script that any producer can call
- Woodpecker pipeline automatically publishes encrypted Docker archives after build
- Training service can publish AI models using the same script
- Every artifact gets its own random AES-256 key
## Scope
### Included
- Shared publish script (Python): generate random AES-256 key, compress (gzip), encrypt (AES-256), SHA-256 hash, upload to S3, write Resources row
- Woodpecker pipeline step in build-arm.yml: after docker build+push, also docker save -> publish script
- S3 bucket structure: {dev_stage}/{resource_name}-{architecture}-{version}.enc
- Documentation for training service integration
### Excluded
- Server-side Resources table (AZ-183, must exist first)
- Loader-side download/decrypt (AZ-184)
- Training service code changes (their team integrates the script)
## Acceptance Criteria
**AC-1: Publish script works end-to-end**
Given a local file (Docker archive or AI model)
When publish script is called with resource_name, dev_stage, architecture, version
Then file is compressed, encrypted with random key, uploaded to S3, and Resources row is written
**AC-2: Woodpecker publishes after build**
Given a push to dev/stage/main branch
When Woodpecker build completes
Then the Docker image is also published as encrypted archive to CDN with Resources row
**AC-3: Unique key per artifact**
Given two consecutive publishes of the same resource
When comparing encryption keys
Then each publish used a different random AES-256 key
**AC-4: SHA-256 consistency**
Given a published artifact
When SHA-256 of the uploaded S3 object is computed
Then it matches the sha256 value in the Resources table
**AC-5: Training service can use the script**
Given the publish script installed as a package or available as a standalone script
When the training service calls it after producing a .trt model
Then the model is published to CDN + Resources table
## Constraints
- Woodpecker runner has access to Docker socket and S3 credentials
- Publish script must work on both x86 (CI runner) and arm64 (training server if needed)
- S3 credentials and DB connection string passed via environment variables
@@ -0,0 +1,62 @@
# Device Provisioning Script
**Task**: AZ-187_device_provisioning_script
**Name**: Device Provisioning Script
**Description**: Create a shell script that provisions a Jetson device identity (CompanionPC user) during the fuse/flash pipeline
**Complexity**: 2 points
**Dependencies**: None
**Component**: DevOps
**Tracker**: AZ-187
**Epic**: AZ-181
## Problem
Each Jetson needs a unique CompanionPC user account for API authentication. This must be automated as part of the manufacturing/flash process so that provisioning 50+ devices is not manual.
## Outcome
- Single script creates device identity and embeds credentials in the rootfs
- Integrates into the fuse/flash pipeline between odmfuse.sh and flash.sh
- Provisioning runbook documents the full end-to-end flow
## Scope
### Included
- provision_device.sh: generate device email (azaion-jetson-{serial}@azaion.com), random 32-char password
- Call admin API POST /users to create Users row with Role=CompanionPC
- Write credentials config file to rootfs image (at known path, e.g., /etc/azaion/device.conf)
- Idempotency: re-running for same serial doesn't create duplicate user
- Provisioning runbook: step-by-step from unboxing through fusing, flashing, and first boot
### Excluded
- fTPM provisioning (covered by NVIDIA's ftpm_provisioning.sh)
- Secure Boot fusing (covered by solution_draft02 Phase 1-2)
- OS hardening (covered by solution_draft02 Phase 3)
- Admin API user creation endpoint (assumed to exist)
## Acceptance Criteria
**AC-1: Script creates CompanionPC user**
Given a new device serial AZJN-0042
When provision_device.sh is run with serial AZJN-0042
Then admin API has a new user azaion-jetson-0042@azaion.com with Role=CompanionPC
**AC-2: Credentials written to rootfs**
Given provision_device.sh completed successfully
When the rootfs image is inspected
Then /etc/azaion/device.conf contains the email and password
**AC-3: Device can log in after flash**
Given a provisioned and flashed device boots for the first time
When the loader reads /etc/azaion/device.conf and calls POST /login
Then a valid JWT is returned
**AC-4: Idempotent re-run**
Given provision_device.sh was already run for serial AZJN-0042
When it is run again for the same serial
Then no duplicate user is created (existing user is reused or updated)
**AC-5: Runbook complete**
Given the provisioning runbook
When followed step-by-step on a new Jetson Orin Nano
Then the device is fully fused, flashed, provisioned, and can communicate with the admin API
+12 -6
View File
@@ -1,18 +1,24 @@
# Batch Report
**Batch**: 1
**Tasks**: 01_test_infrastructure
**Date**: 2026-04-13
**Tasks**: AZ-182, AZ-184, AZ-187
**Date**: 2026-04-15
## Task Results
| Task | Status | Files Modified | Tests | AC Coverage | Issues |
|------|--------|---------------|-------|-------------|--------|
| 01_test_infrastructure | Done | 12 files | 1/1 pass | 5/5 ACs (AC-1,2,3 require Docker) | None |
| AZ-182_tpm_security_provider | Done | 8 files | 8 pass (1 skip without swtpm) | 6/6 ACs covered | None |
| AZ-184_resumable_download_manager | Done | 2 files | 8 pass | 5/5 ACs covered | None |
| AZ-187_device_provisioning_script | Done | 3 files | 5 pass | 5/5 ACs covered | None |
## AC Test Coverage: 5/5 covered (3 require Docker environment)
## Code Review Verdict: PASS (infrastructure scaffold, no logic review needed)
## Excluded
AZ-183 (Resources Table & Update API) — admin API repo, not this workspace.
## AC Test Coverage: All covered (16/16)
## Code Review Verdict: PASS_WITH_WARNINGS
## Auto-Fix Attempts: 0
## Stuck Agents: None
## Next Batch: 02_test_health_auth
## Next Batch: AZ-185, AZ-186 (Batch 2 — 8 points)
+7 -16
View File
@@ -1,28 +1,19 @@
# Batch Report
**Batch**: 2
**Tasks**: 02_test_health_auth
**Date**: 2026-04-13
**Tasks**: AZ-185, AZ-186
**Date**: 2026-04-15
## Task Results
| Task | Status | Files Modified | Tests | AC Coverage | Issues |
|------|--------|---------------|-------|-------------|--------|
| 02_test_health_auth | Done | 2 files | 6 tests | 5/5 ACs covered | None |
| AZ-185_update_manager | Done | 4 files | 10 pass | 6/6 ACs covered | None |
| AZ-186_cicd_artifact_publish | Done | 3 files | 8 pass | 5/5 ACs covered | None |
## AC Test Coverage: All covered
| AC | Test | Status |
|----|------|--------|
| AC-1: Health returns 200 | test_health_returns_200 | Covered |
| AC-2: Status unauthenticated | test_status_unauthenticated | Covered |
| AC-3: Login valid | test_login_valid_credentials | Covered |
| AC-4: Login invalid | test_login_invalid_credentials | Covered |
| AC-5: Login empty body | test_login_empty_body | Covered |
| AC-2+3: Status authenticated | test_status_authenticated_after_login | Covered |
## Code Review Verdict: PASS
## AC Test Coverage: All covered (11/11)
## Code Review Verdict: PASS_WITH_WARNINGS
## Auto-Fix Attempts: 0
## Stuck Agents: None
## Next Batch: 03_test_resources, 04_test_unlock, 05_test_resilience_perf (parallel)
## Next Batch: All tasks complete
@@ -0,0 +1,41 @@
# Code Review Report
**Batch**: 2 (AZ-185, AZ-186)
**Date**: 2026-04-15
**Verdict**: PASS_WITH_WARNINGS
## Spec Compliance
All 11 acceptance criteria across 2 tasks are satisfied with corresponding tests.
| Task | ACs | Covered | Status |
|------|-----|---------|--------|
| AZ-185 Update Manager | 6 | 6/6 | All pass (10 tests) |
| AZ-186 CI/CD Artifact Publish | 5 | 5/5 | All pass (8 tests) |
## Findings
| # | Severity | Category | File:Line | Title |
|---|----------|----------|-----------|-------|
| 1 | Low | Style | scripts/publish_artifact.py | Union syntax fixed |
| 2 | Low | Maintainability | src/main.py:15 | Deprecated on_event startup |
### Finding Details
**F1: Union syntax fixed** (Low / Style)
- Location: `scripts/publish_artifact.py:172,182`
- Description: Used `list[str] | None` syntax, fixed to `Optional[List[str]]` for consistency
- Status: Fixed
**F2: Deprecated on_event startup** (Low / Maintainability)
- Location: `src/main.py:15`
- Description: `@app.on_event("startup")` is deprecated in modern FastAPI in favor of lifespan context manager
- Suggestion: Migrate to `@asynccontextmanager lifespan` when upgrading FastAPI — not blocking
- Task: AZ-185
## Cross-Task Consistency
- AZ-185 uses AZ-184's `ResumableDownloadManager` correctly via its public API
- AZ-186 encrypt format (IV + AES-CBC + PKCS7) is compatible with AZ-184's `decrypt_cbc_file()`
- AZ-186's `encryption_key` is hex-encoded; AZ-185's `_aes_key_from_encryption_field` handles hex decoding
- Self-update ordering (loader last) correctly implemented in AZ-185
+5 -4
View File
@@ -2,8 +2,9 @@
## Current Step
flow: existing-code
step: 7
name: Refactor
status: done
sub_step: 7Phase 7 Documentation (complete)
step: 9
name: Implement
status: in_progress
sub_step: 6Launch Batch 2 implementers
retry_count: 0
current_task: Batch 2 (AZ-185, AZ-186)