mirror of
https://github.com/azaion/loader.git
synced 2026-04-22 08:36:31 +00:00
[AZ-185][AZ-186] Batch 2
Made-with: Cursor
This commit is contained in:
@@ -0,0 +1,129 @@
|
||||
# TPM-Based Security Provider
|
||||
|
||||
**Task**: AZ-182_tpm_security_provider
|
||||
**Name**: TPM Security Provider
|
||||
**Description**: Introduce SecurityProvider abstraction with TPM detection and FAPI integration, wrapping existing security logic in LegacySecurityProvider for backward compatibility
|
||||
**Complexity**: 5 points
|
||||
**Dependencies**: None
|
||||
**Component**: 02 Security
|
||||
**Tracker**: AZ-182
|
||||
**Epic**: AZ-181
|
||||
|
||||
## Problem
|
||||
|
||||
The loader's security code (key derivation, encryption, hardware fingerprinting) is hardcoded for the binary-split scheme. On fused Jetson Orin Nano devices with fTPM, this scheme is unnecessary — full-disk encryption protects data at rest, and the fleet update system (AZ-185) handles encrypted artifact delivery with per-artifact keys. However, the loader still needs a clean abstraction to:
|
||||
1. Detect whether it's running on a TPM-equipped device or a legacy environment
|
||||
2. Provide TPM seal/unseal capability as infrastructure for defense-in-depth (sealed credentials, future key wrapping)
|
||||
3. Preserve the legacy code path for non-TPM deployments
|
||||
|
||||
## Outcome
|
||||
|
||||
- Loader detects TPM availability at startup and selects the appropriate security provider
|
||||
- SecurityProvider abstraction cleanly separates TPM and legacy code paths
|
||||
- TpmSecurityProvider establishes FAPI connection and provides seal/unseal operations
|
||||
- LegacySecurityProvider wraps existing security.pyx unchanged
|
||||
- Foundation in place for fTPM-sealed credentials (future) and per-artifact key decryption integration
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
- SecurityProvider abstraction (ABC) with TpmSecurityProvider and LegacySecurityProvider
|
||||
- Runtime TPM detection (/dev/tpm0 + SECURITY_PROVIDER env var override)
|
||||
- tpm2-pytss FAPI integration: connect, create_seal, unseal
|
||||
- LegacySecurityProvider wrapping existing security.pyx (encrypt, decrypt, key derivation)
|
||||
- Auto-detection and provider selection at startup with logging
|
||||
- Docker compose device mounts for /dev/tpm0 and /dev/tpmrm0
|
||||
- Dockerfile changes: install tpm2-tss native library + tpm2-pytss
|
||||
- Tests using TPM simulator (swtpm)
|
||||
|
||||
### Excluded
|
||||
- Resource download/upload changes (handled by AZ-185 Update Manager with per-artifact keys)
|
||||
- Docker unlock flow changes (handled by AZ-185 Update Manager)
|
||||
- fTPM provisioning pipeline (manufacturing-time, separate from code)
|
||||
- Remote attestation via EK certificates
|
||||
- fTPM-sealed device credentials (future enhancement, not v1)
|
||||
- Changes to the Azaion admin API server
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: SecurityProvider auto-detection**
|
||||
Given a Jetson device with provisioned fTPM and /dev/tpm0 accessible
|
||||
When the loader starts
|
||||
Then TpmSecurityProvider is selected and logged
|
||||
|
||||
**AC-2: TPM seal/unseal round-trip**
|
||||
Given TpmSecurityProvider is active
|
||||
When data is sealed via FAPI create_seal and later unsealed
|
||||
Then the unsealed data matches the original
|
||||
|
||||
**AC-3: Legacy path unchanged**
|
||||
Given no TPM is available (/dev/tpm0 absent)
|
||||
When the loader starts and processes resource requests
|
||||
Then LegacySecurityProvider is selected and all behavior is identical to the current scheme
|
||||
|
||||
**AC-4: Env var override**
|
||||
Given SECURITY_PROVIDER=legacy is set
|
||||
When the loader starts on a device with /dev/tpm0 present
|
||||
Then LegacySecurityProvider is selected regardless of TPM availability
|
||||
|
||||
**AC-5: Graceful fallback**
|
||||
Given /dev/tpm0 exists but FAPI connection fails
|
||||
When the loader starts
|
||||
Then it falls back to LegacySecurityProvider with a warning log
|
||||
|
||||
**AC-6: Docker container TPM access**
|
||||
Given docker-compose.yml with /dev/tpm0 and /dev/tpmrm0 device mounts
|
||||
When the loader container starts on a fused Jetson
|
||||
Then TpmSecurityProvider can connect to fTPM via FAPI
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- TPM seal/unseal latency must be under 500ms per operation
|
||||
|
||||
**Compatibility**
|
||||
- Must work on ARM64 Jetson Orin Nano with JetPack 6.1+
|
||||
- Must work inside Docker containers with --device mounts
|
||||
- tpm2-pytss must be compatible with Python 3.11 and Cython compilation
|
||||
|
||||
**Reliability**
|
||||
- Graceful fallback to LegacySecurityProvider on any TPM initialization failure
|
||||
- No crash on /dev/tpm0 absence — clean detection and fallback
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | SecurityProvider factory with /dev/tpm0 mock present | TpmSecurityProvider selected |
|
||||
| AC-2 | FAPI create_seal + unseal via swtpm | Data matches round-trip |
|
||||
| AC-3 | SecurityProvider factory without /dev/tpm0 | LegacySecurityProvider selected |
|
||||
| AC-4 | SECURITY_PROVIDER=legacy env var with /dev/tpm0 present | LegacySecurityProvider selected |
|
||||
| AC-5 | /dev/tpm0 exists but FAPI raises exception | LegacySecurityProvider selected, warning logged |
|
||||
|
||||
## Blackbox Tests
|
||||
|
||||
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|
||||
|--------|------------------------|-------------|-------------------|----------------|
|
||||
| AC-3 | No TPM device available | POST /load/{filename} (split resource) | Existing binary-split behavior, all current tests pass | Compatibility |
|
||||
| AC-6 | TPM simulator in Docker | Container starts with device mounts | FAPI connects, seal/unseal works | Compatibility |
|
||||
|
||||
## Constraints
|
||||
|
||||
- tpm2-pytss requires tpm2-tss >= 2.4.0 native library in the Docker image
|
||||
- Tests require swtpm (software TPM simulator) — must be added to test infrastructure
|
||||
- fTPM provisioning is out of scope — this task assumes a provisioned TPM exists
|
||||
- PCR-based policy binding intentionally not used (known persistence issues on Orin Nano)
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: fTPM FAPI stability on Jetson Orin Nano**
|
||||
- *Risk*: FAPI seal/unseal may have undocumented issues on Orin Nano (similar to PCR/NV persistence bugs)
|
||||
- *Mitigation*: Design intentionally avoids PCR policies and NV indexes; uses SRK hierarchy only. Hardware validation required before production deployment.
|
||||
|
||||
**Risk 2: swtpm test fidelity**
|
||||
- *Risk*: Software TPM simulator may not reproduce all fTPM behaviors
|
||||
- *Mitigation*: Integration tests on actual Jetson hardware as part of acceptance testing (outside CI).
|
||||
|
||||
**Risk 3: tpm2-tss native library in Docker image**
|
||||
- *Risk*: tpm2-tss may not be available in python:3.11-slim base image; ARM64 build may need compilation
|
||||
- *Mitigation*: Add tpm2-tss to Dockerfile build step; verify ARM64 compatibility early.
|
||||
@@ -0,0 +1,79 @@
|
||||
# Resumable Download Manager
|
||||
|
||||
**Task**: AZ-184_resumable_download_manager
|
||||
**Name**: Resumable Download Manager
|
||||
**Description**: Implement a resumable HTTP download manager for the loader that handles intermittent Starlink connectivity
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: None
|
||||
**Component**: Loader
|
||||
**Tracker**: AZ-184
|
||||
**Epic**: AZ-181
|
||||
|
||||
## Problem
|
||||
|
||||
Jetsons on UAVs have intermittent Starlink connectivity. Downloads of large artifacts (AI models ~500MB, Docker images ~1GB) must survive connection drops and resume from where they left off.
|
||||
|
||||
## Outcome
|
||||
|
||||
- Downloads resume from the last byte received after connectivity loss
|
||||
- Completed downloads are verified with SHA-256 before use
|
||||
- Downloaded artifacts are decrypted with per-artifact AES-256 keys
|
||||
- State persists across loader restarts
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
- Resumable HTTP downloads using Range headers (S3 supports natively)
|
||||
- JSON state file on disk tracking: url, expected_sha256, expected_size, bytes_downloaded, temp_file_path
|
||||
- SHA-256 verification of completed downloads
|
||||
- AES-256 decryption of downloaded artifacts using per-artifact key from /get-update response
|
||||
- Retry with exponential backoff (1min, 5min, 15min, 1hr, max 4hr)
|
||||
- State machine: pending -> downloading -> paused -> verifying -> decrypting -> complete / failed
|
||||
|
||||
### Excluded
|
||||
- Update check logic (AZ-185)
|
||||
- Applying updates (AZ-185)
|
||||
- CDN upload (AZ-186)
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Resume after connection drop**
|
||||
Given a download is 60% complete and connectivity is lost
|
||||
When connectivity returns
|
||||
Then download resumes from byte offset (60% of file), not from scratch
|
||||
|
||||
**AC-2: SHA-256 mismatch triggers re-download**
|
||||
Given a completed download with corrupted data
|
||||
When SHA-256 verification fails
|
||||
Then the partial file is deleted and download restarts from scratch
|
||||
|
||||
**AC-3: Decryption produces correct output**
|
||||
Given a completed and verified download
|
||||
When decrypted with the per-artifact AES-256 key
|
||||
Then the output matches the original unencrypted artifact
|
||||
|
||||
**AC-4: State survives restart**
|
||||
Given a download is 40% complete and the loader container restarts
|
||||
When the loader starts again
|
||||
Then the download resumes from 40%, not from scratch
|
||||
|
||||
**AC-5: Exponential backoff on repeated failures**
|
||||
Given multiple consecutive connection failures
|
||||
When retrying
|
||||
Then wait times follow exponential backoff pattern
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | Mock HTTP server drops connection mid-transfer | Resume with Range header from correct offset |
|
||||
| AC-2 | Corrupt downloaded file | SHA-256 check fails, file deleted, retry flag set |
|
||||
| AC-3 | Encrypt test file, download, decrypt | Round-trip matches original |
|
||||
| AC-4 | Write state file, reload | State correctly restored |
|
||||
| AC-5 | Track retry intervals | Backoff pattern matches spec |
|
||||
|
||||
## Constraints
|
||||
|
||||
- Must work inside Docker container
|
||||
- S3-compatible CDN (current CDNManager already uses boto3)
|
||||
- State file location must be on a volume that persists across container restarts
|
||||
@@ -0,0 +1,76 @@
|
||||
# Update Manager
|
||||
|
||||
**Task**: AZ-185_update_manager
|
||||
**Name**: Update Manager
|
||||
**Description**: Implement the loader's background update loop that checks for new versions and applies AI model and Docker image updates
|
||||
**Complexity**: 5 points
|
||||
**Dependencies**: AZ-183, AZ-184
|
||||
**Component**: Loader
|
||||
**Tracker**: AZ-185
|
||||
**Epic**: AZ-181
|
||||
|
||||
## Problem
|
||||
|
||||
Jetsons need to automatically discover and install new AI models and Docker images without manual intervention. The update loop must handle version detection, server communication, and applying different update types.
|
||||
|
||||
## Outcome
|
||||
|
||||
- Loader automatically checks for updates every 5 minutes
|
||||
- New AI models downloaded, decrypted, and placed in model directory
|
||||
- New Docker images loaded and services restarted with minimal downtime
|
||||
- Loader can update itself (self-update, applied last)
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
- Version collector: scan model directory for .trt files (extract date from filename), query docker images for azaion/* tags, cache results
|
||||
- Background loop (configurable interval, default 5 min): collect versions, call POST /get-update, trigger downloads
|
||||
- Apply AI model: move decrypted .trt to model directory (detection API scans and picks newest)
|
||||
- Apply Docker image: docker load -i, docker compose up -d {service}
|
||||
- Self-update: loader updates itself last via docker compose up -d loader
|
||||
- Integration with AZ-184 Resumable Download Manager for all downloads
|
||||
|
||||
### Excluded
|
||||
- Server-side /get-update endpoint (AZ-183)
|
||||
- Download mechanics (AZ-184)
|
||||
- CI/CD publish pipeline (AZ-186)
|
||||
- Device provisioning (AZ-187)
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Version collector reads local state**
|
||||
Given AI model azaion-2026-03-10.trt in model directory and Docker image azaion/annotations:arm64_2026-03-01 loaded
|
||||
When version collector runs
|
||||
Then it reports [{resource_name: "detection_model", version: "2026-03-10"}, {resource_name: "annotations", version: "arm64_2026-03-01"}]
|
||||
|
||||
**AC-2: Background loop polls on schedule**
|
||||
Given the loader is running with update interval set to 5 minutes
|
||||
When 5 minutes elapse
|
||||
Then POST /get-update is called with current versions
|
||||
|
||||
**AC-3: AI model update applied**
|
||||
Given /get-update returns a new detection_model version
|
||||
When download and decryption complete
|
||||
Then new .trt file is in the model directory
|
||||
|
||||
**AC-4: Docker image update applied**
|
||||
Given /get-update returns a new annotations version
|
||||
When download and decryption complete
|
||||
Then docker load succeeds and docker compose up -d annotations restarts the service
|
||||
|
||||
**AC-5: Self-update applied last**
|
||||
Given /get-update returns updates for both annotations and loader
|
||||
When applying updates
|
||||
Then annotations is updated first, loader is updated last
|
||||
|
||||
**AC-6: Cached versions refresh after changes**
|
||||
Given version collector cached its results
|
||||
When a new model file appears in the directory or docker load completes
|
||||
Then cache is invalidated and next collection reflects new state
|
||||
|
||||
## Constraints
|
||||
|
||||
- Docker socket must be mounted in the loader container (already the case)
|
||||
- docker compose file path must be configurable (env var)
|
||||
- Model directory path must be configurable (env var)
|
||||
- Self-update must be robust: state file on disk ensures in-progress updates survive container restart
|
||||
@@ -0,0 +1,67 @@
|
||||
# CI/CD Artifact Publish
|
||||
|
||||
**Task**: AZ-186_cicd_artifact_publish
|
||||
**Name**: CI/CD Artifact Publish
|
||||
**Description**: Add encrypt-and-publish step to Woodpecker CI/CD pipeline and create a shared publish script usable by both CI/CD and training service
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-183
|
||||
**Component**: DevOps
|
||||
**Tracker**: AZ-186
|
||||
**Epic**: AZ-181
|
||||
|
||||
## Problem
|
||||
|
||||
Both CI/CD (for Docker images) and the training service (for AI models) need to encrypt artifacts and publish them to CDN + Resources table. The encryption and publish logic should be shared.
|
||||
|
||||
## Outcome
|
||||
|
||||
- Shared Python publish script that any producer can call
|
||||
- Woodpecker pipeline automatically publishes encrypted Docker archives after build
|
||||
- Training service can publish AI models using the same script
|
||||
- Every artifact gets its own random AES-256 key
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
- Shared publish script (Python): generate random AES-256 key, compress (gzip), encrypt (AES-256), SHA-256 hash, upload to S3, write Resources row
|
||||
- Woodpecker pipeline step in build-arm.yml: after docker build+push, also docker save -> publish script
|
||||
- S3 bucket structure: {dev_stage}/{resource_name}-{architecture}-{version}.enc
|
||||
- Documentation for training service integration
|
||||
|
||||
### Excluded
|
||||
- Server-side Resources table (AZ-183, must exist first)
|
||||
- Loader-side download/decrypt (AZ-184)
|
||||
- Training service code changes (their team integrates the script)
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Publish script works end-to-end**
|
||||
Given a local file (Docker archive or AI model)
|
||||
When publish script is called with resource_name, dev_stage, architecture, version
|
||||
Then file is compressed, encrypted with random key, uploaded to S3, and Resources row is written
|
||||
|
||||
**AC-2: Woodpecker publishes after build**
|
||||
Given a push to dev/stage/main branch
|
||||
When Woodpecker build completes
|
||||
Then the Docker image is also published as encrypted archive to CDN with Resources row
|
||||
|
||||
**AC-3: Unique key per artifact**
|
||||
Given two consecutive publishes of the same resource
|
||||
When comparing encryption keys
|
||||
Then each publish used a different random AES-256 key
|
||||
|
||||
**AC-4: SHA-256 consistency**
|
||||
Given a published artifact
|
||||
When SHA-256 of the uploaded S3 object is computed
|
||||
Then it matches the sha256 value in the Resources table
|
||||
|
||||
**AC-5: Training service can use the script**
|
||||
Given the publish script installed as a package or available as a standalone script
|
||||
When the training service calls it after producing a .trt model
|
||||
Then the model is published to CDN + Resources table
|
||||
|
||||
## Constraints
|
||||
|
||||
- Woodpecker runner has access to Docker socket and S3 credentials
|
||||
- Publish script must work on both x86 (CI runner) and arm64 (training server if needed)
|
||||
- S3 credentials and DB connection string passed via environment variables
|
||||
@@ -0,0 +1,62 @@
|
||||
# Device Provisioning Script
|
||||
|
||||
**Task**: AZ-187_device_provisioning_script
|
||||
**Name**: Device Provisioning Script
|
||||
**Description**: Create a shell script that provisions a Jetson device identity (CompanionPC user) during the fuse/flash pipeline
|
||||
**Complexity**: 2 points
|
||||
**Dependencies**: None
|
||||
**Component**: DevOps
|
||||
**Tracker**: AZ-187
|
||||
**Epic**: AZ-181
|
||||
|
||||
## Problem
|
||||
|
||||
Each Jetson needs a unique CompanionPC user account for API authentication. This must be automated as part of the manufacturing/flash process so that provisioning 50+ devices is not manual.
|
||||
|
||||
## Outcome
|
||||
|
||||
- Single script creates device identity and embeds credentials in the rootfs
|
||||
- Integrates into the fuse/flash pipeline between odmfuse.sh and flash.sh
|
||||
- Provisioning runbook documents the full end-to-end flow
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
- provision_device.sh: generate device email (azaion-jetson-{serial}@azaion.com), random 32-char password
|
||||
- Call admin API POST /users to create Users row with Role=CompanionPC
|
||||
- Write credentials config file to rootfs image (at known path, e.g., /etc/azaion/device.conf)
|
||||
- Idempotency: re-running for same serial doesn't create duplicate user
|
||||
- Provisioning runbook: step-by-step from unboxing through fusing, flashing, and first boot
|
||||
|
||||
### Excluded
|
||||
- fTPM provisioning (covered by NVIDIA's ftpm_provisioning.sh)
|
||||
- Secure Boot fusing (covered by solution_draft02 Phase 1-2)
|
||||
- OS hardening (covered by solution_draft02 Phase 3)
|
||||
- Admin API user creation endpoint (assumed to exist)
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Script creates CompanionPC user**
|
||||
Given a new device serial AZJN-0042
|
||||
When provision_device.sh is run with serial AZJN-0042
|
||||
Then admin API has a new user azaion-jetson-0042@azaion.com with Role=CompanionPC
|
||||
|
||||
**AC-2: Credentials written to rootfs**
|
||||
Given provision_device.sh completed successfully
|
||||
When the rootfs image is inspected
|
||||
Then /etc/azaion/device.conf contains the email and password
|
||||
|
||||
**AC-3: Device can log in after flash**
|
||||
Given a provisioned and flashed device boots for the first time
|
||||
When the loader reads /etc/azaion/device.conf and calls POST /login
|
||||
Then a valid JWT is returned
|
||||
|
||||
**AC-4: Idempotent re-run**
|
||||
Given provision_device.sh was already run for serial AZJN-0042
|
||||
When it is run again for the same serial
|
||||
Then no duplicate user is created (existing user is reused or updated)
|
||||
|
||||
**AC-5: Runbook complete**
|
||||
Given the provisioning runbook
|
||||
When followed step-by-step on a new Jetson Orin Nano
|
||||
Then the device is fully fused, flashed, provisioned, and can communicate with the admin API
|
||||
Reference in New Issue
Block a user