[AZ-187] Rules & cleanup

Made-with: Cursor
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-04-17 18:54:04 +03:00
parent cfed26ff8c
commit d883fdb3cc
33 changed files with 1917 additions and 515 deletions
@@ -0,0 +1,46 @@
# Acceptance Criteria Assessment
## Acceptance Criteria
| Criterion | Our Values | Researched Values | Cost/Timeline Impact | Status |
|-----------|-----------|-------------------|---------------------|--------|
| AC1: AI models not extractable | Binary-split: model split across API+CDN, requires both keys to reconstruct | TPM: models encrypted with device-sealed key, only decryptable on provisioned hardware. Industry standard for edge AI (SecEdge, NVIDIA Zero-Trust). Stronger guarantee than split-storage. | Medium — requires fTPM provisioning in manufacturing pipeline | Modified |
| AC2: Device authentication | Email/password → JWT → hardware-hashed key derivation | TPM attestation: device proves identity via EK certificate. Can coexist with existing JWT auth. Stronger — hardware fuse-derived, not software-computed. | Low — additive to existing auth | Modified |
| AC3: Keys bound to hardware | SHA-384(email+password+hw_hash+salt) from subprocess-collected CPU/GPU info | TPM-sealed keys bound to device fuses (MB2 bootloader seed). Significantly stronger — cannot be replicated by spoofing hardware strings. | Low — TPM key sealing replaces software key derivation | Modified |
| AC4: Existing API contracts preserved | F1-F6 flows must not break | Achievable — TPM changes are internal to the loader's security layer. API endpoints and contracts remain the same. | None | Unchanged |
| AC5: ARM64 Jetson Orin Nano support | Required | fTPM available on all Orin series (JetPack 6.1+). Python tooling (tpm2-pytss) supports ARM64. | None — natively supported | Unchanged |
| AC6: Works inside Docker containers | Docker socket mount | TPM accessible via --device /dev/tpm0 --device /dev/tpmrm0. No --privileged needed. | Low — add device mounts to docker-compose | Unchanged |
| AC7: Cython compilation remains | .pyx → .so for IP protection | tpm2-pytss is pure Python calling native tpm2-tss. Can be wrapped in Cython modules same as existing crypto code. | Low | Unchanged |
| AC8: Migration path exists | N/A (new requirement) | TPM+standard download and legacy binary-split can coexist via feature flag. TPM-provisioned devices use sealed keys; non-provisioned use legacy scheme. | Medium — dual code path during transition | Added |
## Restrictions Assessment
| Restriction | Our Values | Researched Values | Cost/Timeline Impact | Status |
|-------------|-----------|-------------------|---------------------|--------|
| R1: ARM64 Jetson Orin Nano | Hard requirement | fTPM fully supported on Orin Nano (JetPack 6.1+) | None | Unchanged |
| R2: Docker container | Socket mount for Docker-in-Docker | TPM device mount is separate from Docker socket. Both can coexist. | None | Unchanged |
| R3: fTPM provisioning at manufacturing | N/A (new) | Only offline provisioning supported (per-device during manufacturing). Requires: KDK0 gen, fuse burn, EK cert via CA, EKB encoding. This is a significant operational requirement. | High — new manufacturing step | Added |
| R4: fTPM maturity concerns | N/A (new) | PCR persistence issues reported on forums (PCR7 not resetting, NV handles lost after reboot). Not production-hardened for all use cases yet. | Medium — risk of instability | Added |
| R5: SaaS + Edge dual deployment | Both SaaS web servers and Jetson edge | TPM is machine-specific. Works perfectly for fixed edge devices. For SaaS/cloud VMs, need vTPM or alternative key management. Dual strategy may be needed. | Medium — different security models per deployment type | Added |
## Key Findings
1. **fTPM on Jetson Orin Nano is real and capable** — JetPack 6.1+ provides TPM 2.0 with hardware root of trust from device fuses. The security guarantees are stronger than the current software-computed hash-based scheme.
2. **Binary-split can be simplified but not immediately eliminated** — TPM provides device-bound encryption (model only decryptable on provisioned hardware). This makes the split-storage model unnecessary for the anti-extraction threat. However, the CDN offloading benefit of big/small split (bandwidth optimization) is orthogonal to security.
3. **Manufacturing pipeline impact is significant** — fTPM provisioning requires per-device fuse burning and EK certificate enrollment during manufacturing. This is a business process change, not just a code change.
4. **Known stability issues** — Forum reports of PCR values and NV handles not persisting across reboots. This needs investigation before production reliance.
5. **Docker integration is straightforward** — Device mount, no privileged mode needed. Python tooling (tpm2-pytss) is mature and supports the required Python version.
6. **Dual deployment model needs consideration** — Jetson edge devices get TPM. SaaS web servers likely don't have TPM. Need a strategy that works for both.
## Sources
- NVIDIA Jetson Linux Developer Guide r36.4.4 (L1)
- NVIDIA JetPack 6.1 Blog (L2)
- NVIDIA Developer Forums — PCR/NV persistence issues (L4)
- tpm2-pytss GitHub/PyPI (L1)
- SecEdge/TCG — Edge AI Trusted Computing (L3)
- DevOps StackExchange — Docker TPM access (L4)
@@ -0,0 +1,73 @@
# Question Decomposition
## Original Question
Can TPM-based security on Jetson Orin Nano replace the binary-split resource scheme, simplifying the loader to a standard authenticated resource downloader?
## Active Mode
Mode A Phase 1 — AC & Restrictions Assessment
## Question Type
Decision Support — weighing trade-offs of TPM vs binary-split security models
## Problem Context Summary
The Azaion Loader uses a binary-split scheme (ADR-002) designed for untrusted end-user laptops. The deployment model shifted to SaaS/Jetson Orin Nano edge devices where TPM provides hardware-rooted trust. The question is whether TPM makes binary-split obsolete.
## Research Subject Boundary Definition
| Dimension | Boundary |
|-----------|----------|
| Population | Jetson Orin Nano edge devices running containerized AI workloads |
| Geography | Global (no geographic restriction) |
| Timeframe | JetPack 6.1+ (July 2024 onwards, when fTPM was introduced) |
| Level | Production deployment (not development/prototyping) |
## Sub-Questions
### SQ1: What are the fTPM capabilities on Jetson Orin Nano?
- "Jetson Orin Nano TPM capabilities security JetPack 6.1"
- "NVIDIA fTPM OP-TEE architecture Orin"
- "Jetson Orin TPM 2.0 key sealing PCR operations"
- "Jetson fTPM provisioning manufacturing process"
- "Jetson Orin fTPM limitations known issues forums"
### SQ2: Can TPM-sealed keys replace the current key derivation scheme?
- "TPM key sealing vs SHA-384 key derivation comparison"
- "tpm2-pytss seal unseal Python example"
- "TPM sealed key Docker container access /dev/tpm0"
- "TPM hardware-bound encryption key management edge AI"
### SQ3: Is the binary-split storage model still needed with TPM?
- "binary split key fragment security model vs TPM hardware root of trust"
- "AI model protection TPM-based vs split storage"
- "edge device model protection TPM encryption vs distributed key"
- "when is split-key security necessary vs hardware security module"
### SQ4: What's the migration path?
- "TPM security migration coexist legacy encryption"
- "gradual TPM adoption edge devices existing fleet"
### SQ5: What are the implementation requirements?
- "tpm2-pytss ARM64 Jetson Linux Docker"
- "Jetson Orin fTPM LUKS disk encryption Docker container"
- "TPM2 tools Cython integration"
## Chosen Perspectives
1. **Implementer/Engineer**: Technical integration complexity, library maturity, Docker constraints, Cython compatibility
2. **Domain Expert (Security)**: Threat model comparison, attack surface analysis, defense-in-depth considerations
3. **Practitioner**: Real-world fTPM experiences on Jetson, known issues, production readiness
## Timeliness Sensitivity Assessment
- **Research Topic**: fTPM on Jetson Orin Nano for AI model protection
- **Sensitivity Level**: High
- **Rationale**: NVIDIA Jetson ecosystem updates frequently; fTPM introduced in JetPack 6.1 (July 2024); PCR persistence issues reported
- **Source Time Window**: 12 months
- **Priority official sources**:
1. NVIDIA Jetson Linux Developer Guide (r36.4.4+)
2. TCG TPM 2.0 Specification
3. tpm2-software GitHub (tpm2-tss, tpm2-tools, tpm2-pytss)
- **Key version information to verify**:
- JetPack: 6.1+ (r36.4+)
- tpm2-pytss: latest (supports Python 3.11)
- tpm2-tss: 2.4.0+
@@ -0,0 +1,139 @@
# Source Registry
## Source #1
- **Title**: NVIDIA JetPack 6.1 — fTPM Introduction Blog
- **Link**: https://developer.nvidia.com/blog/nvidia-jetpack-6-1-boosts-performance-and-security-through-camera-stack-optimizations-and-introduction-of-firmware-tpm/
- **Tier**: L2
- **Publication Date**: 2024-07
- **Timeliness Status**: Currently valid
- **Version Info**: JetPack 6.1
- **Target Audience**: Jetson developers/OEMs
- **Research Boundary Match**: Full match
- **Summary**: fTPM introduced in JetPack 6.1 for Orin series; provides device attestation and secure key storage without discrete TPM hardware.
- **Related Sub-question**: SQ1
## Source #2
- **Title**: Firmware TPM — NVIDIA Jetson Linux Developer Guide (r36.4.4)
- **Link**: https://docs.nvidia.com/jetson/archives/r36.4.4/DeveloperGuide/SD/Security/FirmwareTPM.html
- **Tier**: L1
- **Publication Date**: 2025-06
- **Timeliness Status**: Currently valid
- **Version Info**: r36.4.4 / JetPack 6.1
- **Target Audience**: Jetson device manufacturers and developers
- **Research Boundary Match**: Full match
- **Summary**: Comprehensive fTPM docs: architecture (OP-TEE + TrustZone), provisioning (offline only), PCR measured boot, key derivation from hardware fuse, EK certificate management. Per-device unique seed from MB2 bootloader.
- **Related Sub-question**: SQ1, SQ2
## Source #3
- **Title**: Security — NVIDIA Jetson Linux Developer Guide (r36.4.3)
- **Link**: https://docs.nvidia.com/jetson/archives/r36.4.3/DeveloperGuide/SD/Security.html
- **Tier**: L1
- **Publication Date**: 2025
- **Timeliness Status**: Currently valid
- **Version Info**: r36.4.3
- **Target Audience**: Jetson device manufacturers and developers
- **Research Boundary Match**: Full match
- **Summary**: Overview of Jetson security: Secure Boot, Disk Encryption (LUKS), OP-TEE, fTPM. Chain of trust from BootROM through fuses.
- **Related Sub-question**: SQ1, SQ3
## Source #4
- **Title**: Access ftpm pcr registers — NVIDIA Developer Forums
- **Link**: https://forums.developer.nvidia.com/t/access-ftpm-pcr-registers/328636
- **Tier**: L4
- **Publication Date**: 2024-2025
- **Timeliness Status**: Currently valid
- **Version Info**: JetPack 6.x / Debian-based
- **Target Audience**: Jetson Orin Nano developers
- **Research Boundary Match**: Full match
- **Summary**: Users report PCR7 values not persisting/resetting across reboots when using fTPM for disk encryption. Issues with cryptsetup integration.
- **Related Sub-question**: SQ1, SQ5
## Source #5
- **Title**: fTPM handles don't persist after reboot — NVIDIA Developer Forums
- **Link**: https://forums.developer.nvidia.com/t/ftpm-handles-dont-persist-after-a-reboot/344424
- **Tier**: L4
- **Publication Date**: 2024-2025
- **Timeliness Status**: Currently valid
- **Target Audience**: Jetson Orin NX developers
- **Research Boundary Match**: Full match (same Orin family)
- **Summary**: fTPM NV handles not persisting across reboots on Orin NX. Suggests broader persistence issues across Orin variants.
- **Related Sub-question**: SQ1, SQ5
## Source #6
- **Title**: Accessing TPM from inside a Docker Container — DevOps StackExchange
- **Link**: https://devops.stackexchange.com/questions/8509/accessing-tpm-from-inside-a-docker-container
- **Tier**: L4
- **Publication Date**: Various
- **Timeliness Status**: Currently valid
- **Target Audience**: DevOps engineers
- **Research Boundary Match**: Partial overlap (general Docker, not Jetson-specific)
- **Summary**: Mount /dev/tpm0 and /dev/tpmrm0 via --device flag. TPM is for key wrapping, not storage. Machine-specific binding.
- **Related Sub-question**: SQ2, SQ5
## Source #7
- **Title**: Docker container accessing virtual TPM — Medium
- **Link**: https://medium.com/@eng.fernandosilva/docker-container-accessing-virtual-tpm-device-from-vm-running-on-windows-11-hyper-v-6c1bbb0f0c5d
- **Tier**: L3
- **Publication Date**: 2024
- **Timeliness Status**: Currently valid
- **Target Audience**: Docker/DevOps practitioners
- **Research Boundary Match**: Partial overlap (Windows vTPM, but Docker access patterns apply)
- **Summary**: Docker --device /dev/tpm0:/dev/tpm0 --device /dev/tpmrm0:/dev/tpmrm0 for TPM access. No --privileged needed for device-based access.
- **Related Sub-question**: SQ5
## Source #8
- **Title**: Securing Edge AI through Trusted Computing — SecEdge/TCG Blog
- **Link**: https://www.secedge.com/tcg-blog-securing-edge-ai-through-trusted-computing/
- **Tier**: L3
- **Publication Date**: 2024-2025
- **Timeliness Status**: Currently valid
- **Target Audience**: Edge AI security architects
- **Research Boundary Match**: Full match
- **Summary**: TPM-based device trust for edge AI: device-bound encryption, model binding to specific hardware, attestation. Addresses unauthorized copying, tampering, and cloning threats.
- **Related Sub-question**: SQ3
## Source #9
- **Title**: tpm2-software/tpm2-pytss — GitHub
- **Link**: https://github.com/tpm2-software/tpm2-pytss
- **Tier**: L1
- **Publication Date**: 2026-02 (last update)
- **Timeliness Status**: Currently valid
- **Version Info**: Latest, supports Python 3.10-3.14
- **Target Audience**: Python developers using TPM
- **Research Boundary Match**: Full match
- **Summary**: Python bindings for TPM2 TSS. ESAPI, FAPI, marshaling support. Requires tpm2-tss >= 2.4.0. Available on PyPI.
- **Related Sub-question**: SQ5
## Source #10
- **Title**: Building a Zero-Trust Architecture for Confidential AI Factories — NVIDIA Blog
- **Link**: https://developer.nvidia.com/blog/building-a-zero-trust-architecture-for-confidential-ai-factories/
- **Tier**: L2
- **Publication Date**: 2024-2025
- **Timeliness Status**: Currently valid
- **Target Audience**: AI infrastructure architects
- **Research Boundary Match**: Reference only (cloud/data center focus, not edge)
- **Summary**: Zero-trust with TEEs and attestation for AI model protection. Hardware-enforced trust, model binding, three-way trust dilemma. Industry direction for AI model security.
- **Related Sub-question**: SQ3
## Source #11
- **Title**: OP-TEE — NVIDIA Jetson Linux Developer Guide (r36.4.4)
- **Link**: https://docs.nvidia.com/jetson/archives/r36.4.4/DeveloperGuide/SD/Security/OpTee.html
- **Tier**: L1
- **Publication Date**: 2025
- **Timeliness Status**: Currently valid
- **Version Info**: r36.4.4
- **Target Audience**: Jetson developers building Trusted Applications
- **Research Boundary Match**: Full match
- **Summary**: OP-TEE on Jetson Orin: TrustZone-based TEE, Client Application ↔ Trusted Application communication via libteec, crypto services available. Custom TAs can be built.
- **Related Sub-question**: SQ1, SQ2
## Source #12
- **Title**: LUKS Full Disk Encryption on Jetson Orin Nano — Piveral
- **Link**: https://nvidia-jetson.piveral.com/jetson-orin-nano/implementing-password-protected-luks-full-disk-encryption-on-jetson-orin-nano/
- **Tier**: L3
- **Publication Date**: 2024-2025
- **Timeliness Status**: Currently valid
- **Target Audience**: Jetson Orin Nano practitioners
- **Research Boundary Match**: Full match
- **Summary**: LUKS encryption on Orin Nano. Default auto-decrypt on boot defeats purpose. Must modify LUKS service for password prompts. gen_luks_passphrase script for key generation.
- **Related Sub-question**: SQ2, SQ5
@@ -0,0 +1,161 @@
# Fact Cards
## Fact #1
- **Statement**: Jetson Orin Nano series has firmware TPM (fTPM) support, introduced in JetPack 6.1 (July 2024). It implements TPM 2.0 via the TCG reference implementation running in OP-TEE.
- **Source**: Source #1, #2
- **Phase**: Phase 1
- **Target Audience**: Jetson Orin Nano developers
- **Confidence**: ✅ High
- **Related Dimension**: TPM capability
## Fact #2
- **Statement**: The fTPM seed is derived from hardware fuses by the MB2 secure bootloader. It is a per-device, unique, secure value — establishing hardware root of trust.
- **Source**: Source #2
- **Phase**: Phase 1
- **Target Audience**: Jetson Orin Nano developers
- **Confidence**: ✅ High
- **Related Dimension**: Hardware binding strength
## Fact #3
- **Statement**: fTPM provisioning currently supports offline method only (per-device during manufacturing). Online provisioning "will be available in a future release" (as of r36.4.4).
- **Source**: Source #2
- **Phase**: Phase 1
- **Target Audience**: Jetson device manufacturers
- **Confidence**: ✅ High
- **Related Dimension**: Implementation complexity
## Fact #4
- **Statement**: fTPM provisioning requires: per-device KDK0 generation, fuse burning, EK certificate generation via CA server, EKB encoding. This is a manufacturing-time process.
- **Source**: Source #2
- **Phase**: Phase 1
- **Target Audience**: Jetson device manufacturers
- **Confidence**: ✅ High
- **Related Dimension**: Implementation complexity
## Fact #5
- **Statement**: Users report fTPM PCR register values (specifically PCR7) not persisting/resetting correctly across reboots on Jetson Orin Nano with Debian-based systems.
- **Source**: Source #4
- **Phase**: Phase 1
- **Target Audience**: Jetson Orin Nano users attempting disk encryption
- **Confidence**: ⚠️ Medium (forum reports, not officially confirmed as bug vs. misconfiguration)
- **Related Dimension**: Production readiness
## Fact #6
- **Statement**: fTPM NV handles don't persist after reboot on Jetson Orin NX, suggesting broader persistence issues across the Orin family.
- **Source**: Source #5
- **Phase**: Phase 1
- **Target Audience**: Jetson Orin developers
- **Confidence**: ⚠️ Medium (forum reports from multiple users)
- **Related Dimension**: Production readiness
## Fact #7
- **Statement**: Docker containers can access host TPM via --device /dev/tpm0:/dev/tpm0 --device /dev/tpmrm0:/dev/tpmrm0. No --privileged flag needed for device-based mount.
- **Source**: Source #6, #7
- **Phase**: Phase 1
- **Target Audience**: Docker/container developers
- **Confidence**: ✅ High
- **Related Dimension**: Docker integration
## Fact #8
- **Statement**: TPM is a key wrapping/sealing device, not a storage device. Minimal storage capacity and slow. Proper pattern: seal encryption keys in TPM, store encrypted data elsewhere.
- **Source**: Source #6
- **Phase**: Phase 1
- **Target Audience**: General TPM users
- **Confidence**: ✅ High
- **Related Dimension**: Architecture pattern
## Fact #9
- **Statement**: tpm2-pytss (Python TPM2 bindings) is available on PyPI, supports Python 3.10-3.14, requires tpm2-tss >= 2.4.0. Provides ESAPI and FAPI interfaces.
- **Source**: Source #9
- **Phase**: Phase 1
- **Target Audience**: Python developers
- **Confidence**: ✅ High
- **Related Dimension**: Implementation tooling
## Fact #10
- **Statement**: Industry trend: hardware-enforced TEEs and attestation for AI model protection. Device-bound encryption ties models to specific devices, preventing unauthorized copying.
- **Source**: Source #8, #10
- **Phase**: Phase 1
- **Target Audience**: Edge AI security architects
- **Confidence**: ✅ High
- **Related Dimension**: Industry direction
## Fact #11
- **Statement**: TPM binding is machine-specific. If workloads migrate across hardware, TPM-sealed keys become inaccessible. This is a feature for edge devices (prevents extraction) but a constraint for SaaS/cloud deployments.
- **Source**: Source #6
- **Phase**: Phase 1
- **Target Audience**: Infrastructure architects
- **Confidence**: ✅ High
- **Related Dimension**: Deployment model compatibility
## Fact #12
- **Statement**: The current loader's binary-split scheme splits resources into small part (API, per-user/hw key) + big part (CDN, shared key). Designed to prevent model extraction on untrusted laptops.
- **Source**: Problem context (architecture.md, ADR-002)
- **Phase**: Phase 1
- **Target Audience**: Azaion team
- **Confidence**: ✅ High
- **Related Dimension**: Current architecture
## Fact #13
- **Statement**: The loader currently derives hardware-bound keys via SHA-384(email + password + hw_hash + salt). The hw_hash is SHA-384 of hardware fingerprint collected by HardwareService (CPU/GPU info via subprocess).
- **Source**: Problem context (architecture.md, security module docs)
- **Phase**: Phase 1
- **Target Audience**: Azaion team
- **Confidence**: ✅ High
- **Related Dimension**: Current key management
## Fact #14
- **Statement**: OP-TEE on Jetson Orin supports custom Trusted Applications that can perform cryptographic operations in the secure world (ARM TrustZone S-EL0).
- **Source**: Source #11
- **Phase**: Phase 1
- **Target Audience**: Jetson security developers
- **Confidence**: ✅ High
- **Related Dimension**: TPM capability
## Fact #15
- **Statement**: Jetson Orin LUKS disk encryption defaults to auto-decrypt on boot (defeating purpose). Requires modification to LUKS service for password-protected operation.
- **Source**: Source #12
- **Phase**: Phase 1
- **Target Audience**: Jetson Orin Nano practitioners
- **Confidence**: ✅ High
- **Related Dimension**: Disk encryption readiness
## Fact #16
- **Statement**: Orin Nano only supports REE FS for OP-TEE secure storage (file-system-based). RPMB (hardware replay-protected memory) is AGX Orin only. REE FS stores encrypted data at /data/tee/ on the normal world filesystem.
- **Source**: NVIDIA Jetson Linux Developer Guide — Secure Storage (r38.2)
- **Phase**: Phase 2
- **Target Audience**: Jetson Orin Nano developers
- **Confidence**: ✅ High
- **Related Dimension**: Storage security
## Fact #17
- **Statement**: tpm2-pytss FAPI provides create_seal(path, data), unseal(path), encrypt(path, plaintext), decrypt(path, ciphertext) — high-level Python API for TPM key operations.
- **Source**: tpm2-pytss documentation (readthedocs)
- **Phase**: Phase 2
- **Target Audience**: Python TPM developers
- **Confidence**: ✅ High
- **Related Dimension**: Implementation tooling
## Fact #18
- **Statement**: Alternative AI model protection without TPM: signed manifests with payload hashes, asymmetric signature verification on-device, dm-verity for runtime integrity. These work on any hardware.
- **Source**: Thistle Technologies, Tinfoil Containers blogs
- **Phase**: Phase 2
- **Target Audience**: Edge AI security architects
- **Confidence**: ✅ High
- **Related Dimension**: Non-TPM alternatives
## Fact #19
- **Statement**: TPM key sealing workflow: tpm2_createprimary → tpm2_create (with optional PCR policy) → tpm2_load → tpm2_startauthsession → tpm2_policypcr → tpm2_unseal. Keys are bound to device and optionally to boot state.
- **Source**: tpm2-tools tutorial, GitHub issues
- **Phase**: Phase 2
- **Target Audience**: TPM developers
- **Confidence**: ✅ High
- **Related Dimension**: Implementation workflow
## Fact #20
- **Statement**: The binary-split CDN offloading (big part on CDN, small part on API) serves a bandwidth/cost purpose separate from its security purpose. Even if security is handled by TPM, CDN offloading for large models may still be valuable.
- **Source**: Architecture analysis (ADR-002 rationale)
- **Phase**: Phase 2
- **Target Audience**: Azaion team
- **Confidence**: ✅ High
- **Related Dimension**: Architecture separation of concerns
@@ -0,0 +1,34 @@
# Comparison Framework
## Selected Framework Type
Decision Support
## Selected Dimensions
1. Solution overview
2. Threat model coverage
3. Hardware binding strength
4. Implementation cost
5. Maintenance cost
6. Risk assessment
7. Migration difficulty
8. Applicable scenarios
## Compared Solutions
- **A: Current binary-split scheme** (status quo)
- **B: TPM-only** (full replacement — eliminate binary-split)
- **C: Hybrid** (TPM for device binding + simplified download without split)
## Initial Population
| Dimension | A: Binary-Split (current) | B: TPM-Only | C: Hybrid (recommended) | Factual Basis |
|-----------|--------------------------|-------------|------------------------|---------------|
| Solution overview | Encrypt resource, split small (API) + big (CDN), per-user+hw key + shared key | TPM-sealed master key, single encrypted download, device-bound decryption | TPM-sealed key for device binding; single authenticated download from API/CDN; no split | Fact #12, #2, #8 |
| Threat model | Prevents extraction by requiring two servers; hardware fingerprint (software hash) ties to device | Prevents extraction via hardware fuse-derived key; attestation proves device identity; tamper-evident boot chain | Combines TPM device binding with authenticated download; single download point acceptable because device itself is trusted | Fact #2, #10, #11 |
| Hardware binding | SHA-384(email+password+hw_hash+salt) — software-computed, spoofable if hw strings are replicated | fTPM seed from hardware fuses — per-device unique, not software-spoofable | Same as B for binding; key sealed in TPM | Fact #2, #13 |
| Implementation cost | Already implemented | High: fTPM provisioning pipeline, tpm2-pytss integration, new security module, Docker device mounts, dual-path for SaaS | Medium: same TPM integration as B, but simpler download logic (remove split/merge code) | Fact #3, #4, #7, #9 |
| Maintenance cost | Moderate: two download paths (API+CDN), split/merge logic, two key types | Lower: single download path, single key type, but TPM provisioning infrastructure | Lowest: single download, TPM key management; CDN used for bandwidth only (no security split) | Fact #20 |
| Risk | Low (proven, in production) | High: fTPM persistence bugs (#5,#6), offline-only provisioning, REE FS (no RPMB on Nano) | Medium: same TPM risks as B, but fallback to legacy scheme mitigates | Fact #5, #6, #16 |
| Migration difficulty | N/A | Very high: all devices must be re-provisioned; no backward compatibility | Medium: feature-flag based; TPM-provisioned devices use new path, others use legacy | Fact #11 |
| Applicable scenarios | All current: laptops, edge, SaaS | Jetson Orin Nano (with fTPM) only; SaaS needs separate solution | Jetson Orin Nano gets TPM path; SaaS/non-TPM devices get simplified authenticated download (no split needed if server is trusted) | Fact #11, #18 |
@@ -0,0 +1,111 @@
# Reasoning Chain
## Dimension 1: Is binary-split still necessary for security?
### Fact Confirmation
The binary-split was designed for untrusted laptops (Fact #12): if an attacker compromises the CDN, they get 99% of the model but cannot reconstruct it without the API-held 1%. The threat is physical access to an untrusted device.
### Reference Comparison
On Jetson Orin Nano with fTPM (Fact #2): the encryption key is derived from hardware fuses. Even with full disk access, the attacker cannot extract the key without the specific TPM hardware. The device itself is the trust anchor, not the storage distribution.
### Conclusion
For TPM-equipped devices, split-storage adds complexity without adding security. The TPM hardware binding is strictly stronger than distributing fragments across servers. Binary-split's security purpose is obsolete on TPM devices.
### Confidence
✅ High — hardware-fuse-derived keys are fundamentally stronger than software-computed hashes.
---
## Dimension 2: Is CDN offloading still valuable without split?
### Fact Confirmation
ADR-002 lists two reasons for binary-split (Fact #20): (1) security (prevent single-point compromise) and (2) bandwidth/cost (large files on CDN, small metadata on API).
### Reference Comparison
If security is handled by TPM device binding, the CDN offloading benefit remains valid for large AI models. But the *splitting* mechanism (small+big parts) is unnecessary — a single encrypted file on CDN with an authenticated download URL achieves the same bandwidth benefit.
### Conclusion
CDN usage should remain for bandwidth optimization. But the split-and-merge encryption scheme can be replaced by a simpler pattern: encrypt the whole resource with a TPM-sealed key, store on CDN, download as single file.
### Confidence
✅ High — bandwidth and security are orthogonal concerns.
---
## Dimension 3: Can tpm2-pytss integrate with the Cython codebase?
### Fact Confirmation
tpm2-pytss (Fact #9, #17) is a Python library calling native tpm2-tss via CFFI. It provides FAPI with create_seal, unseal, encrypt, decrypt. The loader's security module is Cython (.pyx) calling Python cryptographic libraries.
### Reference Comparison
The current security.pyx already calls Python libraries (cryptography.hazmat). tpm2-pytss follows the same pattern — Python calls to a native library. Cython can call tpm2-pytss the same way.
### Conclusion
No architectural barrier. tpm2-pytss integrates naturally alongside existing cryptography library usage.
### Confidence
✅ High — same integration pattern as existing code.
---
## Dimension 4: What about SaaS/non-TPM deployments?
### Fact Confirmation
The loader now runs on both Jetson edge devices and SaaS web servers (Fact #11). TPM is machine-specific — works for fixed edge devices but SaaS VMs may not have TPM (or have vTPM with different trust properties).
### Reference Comparison
Alternative approaches exist for non-TPM environments (Fact #18): signed manifests, asymmetric signature verification, authenticated downloads. For SaaS servers that the company controls, the threat model is different — the server is trusted, so split-storage is unnecessary even without TPM.
### Conclusion
Two-tier strategy: (1) Jetson devices use TPM-sealed keys for strongest binding; (2) SaaS servers use standard authenticated download (no split needed since server is trusted infrastructure). The binary-split complexity is needed for neither scenario.
### Confidence
✅ High — different deployment contexts have different threat models.
---
## Dimension 5: fTPM production readiness
### Fact Confirmation
Forum reports (Fact #5, #6): PCR7 values not persisting across reboots; NV handles lost after reboot. RPMB not available on Orin Nano (Fact #16) — only REE FS.
### Reference Comparison
The proposed design does NOT rely on PCR-sealed keys or NV indexes. The key workflow uses FAPI create_seal/unseal with the Storage Root Key (SRK) hierarchy, which derives from the hardware fuse seed (Fact #2). This is independent of PCR persistence and NV storage issues.
### Conclusion
The PCR/NV persistence bugs are not blocking for this use case. FAPI seal/unseal under the SRK hierarchy uses the persistent primary key derived from fuses, not PCR-gated policies. However, this should be validated on actual hardware before committing.
### Confidence
⚠️ Medium — reasoning is sound but needs hardware validation.
---
## Dimension 6: Manufacturing pipeline impact
### Fact Confirmation
fTPM provisioning requires (Fact #3, #4): per-device KDK0 generation, fuse burning, EK certificate via CA, EKB encoding. Only offline provisioning supported.
### Reference Comparison
The current loader requires no manufacturing-time setup — credentials are provided at runtime. Adding fTPM provisioning is a significant operational change.
### Conclusion
fTPM provisioning is the biggest non-code cost. However, if Jetson devices are already manufactured by an OEM partner, fTPM provisioning can be integrated into the existing flashing pipeline. For development/testing, a simulated TPM (swtpm) can be used.
### Confidence
⚠️ Medium — depends on OEM manufacturing pipeline.
---
## Dimension 7: Migration path
### Fact Confirmation
Existing deployments use binary-split. New deployments can use TPM. Both must coexist during transition.
### Reference Comparison
Feature-flag pattern: detect at startup whether /dev/tpm0 exists and is provisioned. If yes, use TPM key path. If no, fall back to legacy binary-split. The API contracts (F1-F6) remain unchanged — the security layer is internal.
### Conclusion
A SecurityProvider abstraction (interface) with two implementations (LegacySecurityProvider, TpmSecurityProvider) enables clean coexistence. Detection is automatic. No API changes required.
### Confidence
✅ High — standard abstraction pattern, no external dependencies on migration.
@@ -0,0 +1,46 @@
# Validation Log
## Validation Scenario
A Jetson Orin Nano edge device with fTPM provisioned needs to download an AI model, decrypt it, and load it. A SaaS web server without TPM needs the same model.
## Expected Based on Conclusions
### Jetson Orin Nano (TPM path):
1. Loader starts, detects /dev/tpm0 → TpmSecurityProvider
2. POST /login → JWT auth (unchanged)
3. POST /load/{model} → single encrypted download from CDN via authenticated URL
4. TPM unseals the device-specific decryption key
5. Model decrypted and returned to caller
### SaaS web server (no-TPM path):
1. Loader starts, no /dev/tpm0 → LegacySecurityProvider (or SimplifiedSecurityProvider)
2. POST /login → JWT auth (unchanged)
3. POST /load/{model} → single authenticated download (no split needed — server is trusted)
4. Standard key derivation from credentials
5. Model decrypted and returned to caller
### Docker unlock (Jetson):
1. POST /unlock → authenticate
2. Download key → TPM-sealed key used instead of key fragment download
3. Decrypt archive → same as current but with TPM-derived key
4. docker load → unchanged
## Actual Validation Results
The scenario is consistent with the proposed architecture. Key observations:
- API endpoints remain identical (F1-F6 contracts preserved)
- The security layer change is internal — callers don't know which provider is active
- CDN is still used for bandwidth (large model storage) but serves single files, not split parts
- Upload flow (F3) simplifies: encrypt whole file, upload to CDN + register on API (no split)
## Counterexamples
1. **What if a device needs to be re-provisioned?** — fTPM provisioning is manufacturing-time. If a device's fTPM state is corrupted, it needs re-flashing. This is acceptable for edge devices (they're managed hardware) but must be documented.
2. **What if the same model needs to work across TPM and non-TPM devices?** — Models are encrypted per-deployment. TPM devices get a device-specific encrypted copy. Non-TPM devices get a credentials-encrypted copy. The API server handles the distinction.
## Review Checklist
- [x] Draft conclusions consistent with fact cards
- [x] No important dimensions missed
- [x] No over-extrapolation
- [x] Conclusions actionable/verifiable
## Conclusions Requiring Revision
None. The hybrid approach (Solution C) is validated as feasible and superior to both status quo and full-TPM-only.
@@ -0,0 +1,66 @@
# Security Analysis: TPM-Based Security Replacing Binary-Split
## Threat Model
### Asset Inventory
| Asset | Value | Current Protection | Proposed Protection (TPM) |
|-------|-------|--------------------|--------------------------|
| AI model files | High — core IP | AES-256-CBC, split storage (API+CDN), per-user+hw key | AES-256-CBC, TPM-sealed device key, single encrypted storage |
| Docker image archive | High — service IP | AES-256-CBC, key fragment from API | AES-256-CBC, TPM-sealed key (no network key download) |
| User credentials | Medium | In-memory only | In-memory only (unchanged) |
| JWT tokens | Medium | In-memory, no signature verification | In-memory (unchanged; signature verification is a separate concern) |
| CDN credentials | Medium | Encrypted cdn.yaml from API | Same (unchanged) |
| Encryption keys | Critical | SHA-384 derived, in memory | TPM-sealed, never in user-space memory in plaintext |
### Threat Actors
| Actor | Capability | Motivation |
|-------|-----------|-----------|
| Physical attacker (edge) | Physical access to Jetson device, can extract storage | Steal AI models |
| Network attacker | MITM, API/CDN compromise | Intercept models in transit |
| Insider (compromised server) | Access to API or CDN backend | Extract stored model fragments |
| Reverse engineer | Access to loader binary (.so files) | Extract key derivation logic, salts |
### Attack Vectors — Current vs Proposed
| Attack Vector | Current (Binary-Split) | Proposed (TPM) | Delta |
|--------------|----------------------|----------------|-------|
| **Extract model from disk** | Must obtain both CDN big part + API small part. If attacker has disk, big part is local. Need API access for small part. | Model encrypted with TPM-sealed key. Key cannot be extracted without the specific TPM hardware. | **Stronger** — hardware binding vs. server-side fragmentation |
| **Clone device** | Replicate hardware fingerprint strings (CPU model, GPU, etc.) → derive same SHA-384 key | Cannot clone fTPM — seed derived from hardware fuses, unique per chip | **Stronger** — fuse-based vs. string-based identity |
| **Compromise CDN** | Get big parts only — useless without small parts from API | Get encrypted files — useless without TPM-sealed key on target device | **Equivalent** — both require a second factor |
| **Compromise API** | Get small parts + key fragments. Combined with CDN data = full model | Get encrypted metadata. Key is TPM-sealed, not on API server | **Stronger** — API no longer holds key material |
| **Reverse-engineer loader binary** | Extract salt strings from .so → reconstruct SHA-384 key derivation → derive keys for any known email+password+hw combo | TPM key derivation is in hardware. Even with full .so source, keys are not reconstructable | **Stronger** — hardware vs. software key protection |
| **Memory dump at runtime** | Keys exist in Python process memory during encrypt/decrypt operations | With FAPI: encryption happens via TPM — key never enters user-space memory | **Stronger** — key stays in TPM |
| **Stolen credentials** | Attacker with email+password can derive all keys if they also know hw fingerprint | Credentials alone are insufficient — TPM-sealed key requires the physical device | **Stronger** — credentials are not sufficient |
## Per-Component Security Requirements
| Component | Requirement | Risk Level | Proposed Control |
|-----------|------------|------------|-----------------|
| SecurityProvider detection | Must correctly identify TPM availability; false positive → crash; false negative → weaker security | Medium | Check /dev/tpm0 existence + attempt TPM connection; fall back to legacy on any failure |
| TPM key sealing | Sealed key must only be unsealable on the provisioned device | High | Use FAPI create_seal under SRK hierarchy; no PCR policy (avoids persistence bugs); auth password optional |
| Docker device mount | /dev/tpm0 and /dev/tpmrm0 must be accessible in container | Medium | docker-compose.yml --device mounts; no --privileged |
| Legacy fallback | Must remain fully functional for non-TPM devices | High | Existing security module unchanged; SecurityProvider delegates to it |
| Key rotation | TPM-sealed keys should be rotatable without re-provisioning | Medium | Seal a wrapping key in TPM; actual resource keys wrapped by it; rotate resource keys independently |
| CDN authenticated download | Single-file download must use authenticated URLs (not public) | High | Signed S3 URLs with expiration; existing CDN auth mechanism |
## Security Controls Summary
### Authentication
- **Unchanged**: JWT Bearer tokens from Azaion Resource API
- **Enhanced (TPM path)**: Device attestation possible via EK certificate (future enhancement, not in initial scope)
### Data Protection
- **At rest**: AES-256-CBC encrypted resources. Key sealed in TPM (Jetson) or derived from credentials (legacy).
- **In transit**: HTTPS for all API/CDN calls (unchanged)
- **In TPM**: Encryption key never enters user-space memory. FAPI handles encrypt/decrypt within TPM boundary.
### Key Management
- **TPM path**: Master key sealed at provisioning time → stored in TPM NV or as sealed blob in REE FS → unsealed at runtime via FAPI → used to derive/unwrap resource-specific keys
- **Legacy path**: SHA-384 key derivation from email+password+hw_hash+salt (unchanged)
- **Key rotation**: Wrap resource keys with TPM-sealed master key; rotate resource keys without re-provisioning TPM
### Logging & Monitoring
- **Unchanged**: Loguru file + stdout/stderr logging
- **Addition**: Log SecurityProvider selection at startup (which path was chosen and why)
@@ -0,0 +1,112 @@
# Solution Draft: TPM-Based Security Replacing Binary-Split
## Product Solution Description
Replace the binary-split resource scheme with a TPM-aware security architecture that uses hardware-rooted keys on Jetson Orin Nano devices and simplified authenticated downloads elsewhere. The loader gains a `SecurityProvider` abstraction with two implementations: `TpmSecurityProvider` (fTPM-based, for provisioned Jetson devices) and `LegacySecurityProvider` (current scheme, for backward compatibility). The binary-split upload/download logic is simplified to single-file encrypted resources stored on CDN, with the split mechanism retained only in the legacy path.
```
┌─────────────────────────────────────────────┐
│ Loader (FastAPI) │
│ ┌────────────┐ ┌─────────────────────┐ │
│ │ HTTP API │───▶│ SecurityProvider │ │
│ │ (F1-F6) │ │ (interface) │ │
│ └────────────┘ └──────┬──────────────┘ │
│ ┌─────┴──────┐ │
│ ┌──────┴──┐ ┌──────┴───────┐ │
│ │ TpmSec │ │ LegacySec │ │
│ │ Provider│ │ Provider │ │
│ └────┬────┘ └──────┬──-────┘ │
│ │ │ │
│ /dev/tpm0 SHA-384 keys │
│ (fTPM) (current scheme) │
└─────────────────────────────────────────────┘
```
## Existing/Competitor Solutions Analysis
| Solution | Approach | Applicability |
|----------|----------|---------------|
| SecEdge SEC-TPM | Firmware TPM for edge AI device trust, model binding, attestation | Directly applicable — same problem space |
| Tinfoil Containers | TEE-based (Intel TDX / AMD SEV-SNP) with attestation | Cloud/data center focus; not applicable to Jetson ARM64 |
| Thistle OTA | Signed manifests + asymmetric verification, no hardware binding | Weaker than TPM but works without hardware support |
| Amulet (TEE-shielded inference) | OP-TEE based model obfuscation for ARM TrustZone | Interesting for inference protection; complementary to our approach |
| NVIDIA Confidential Computing | H200/B200 GPU TEEs | Data center only; not applicable to Orin Nano |
## Architecture
### Component: Security Provider Abstraction
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Python ABC + runtime detection | abc module, os.path.exists("/dev/tpm0") | Simple, no deps, auto-selects at startup | Detection is binary (TPM or not) | None | N/A | Zero | Best |
| Config-file based selection | YAML/env var SECURITY_PROVIDER=tpm\|legacy | Explicit control, testable | Manual configuration per device | Config management | N/A | Zero | Good |
**Recommendation**: Runtime detection with config override. Check /dev/tpm0 by default; allow SECURITY_PROVIDER env var to force a specific provider.
### Component: TPM Key Management
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| tpm2-pytss FAPI | tpm2-pytss (PyPI), tpm2-tss native lib | High-level Python API (create_seal, unseal, encrypt, decrypt); mature project | Requires tpm2-tss native lib installed; FAPI config needed | tpm2-tss >= 2.4.0, Python 3.11 | Hardware-rooted keys from device fuses | Low (open source) | Best |
| tpm2-tools via subprocess | tpm2-tools CLI, subprocess calls | No Python bindings needed; well-documented CLI | Subprocess overhead; harder to test; string parsing | tpm2-tools installed in container | Same | Low | Acceptable |
| Custom OP-TEE TA | C TA in OP-TEE, Python CA via libteec | Maximum control; no dependency on TPM stack | Very high development effort; C code in secure world | OP-TEE dev environment, ARM toolchain | Strongest (code runs in TrustZone) | High | Overkill |
**Recommendation**: tpm2-pytss FAPI. High-level API, Python-native, same pattern as existing cryptography library usage.
### Component: Resource Download (simplified)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Single encrypted file on CDN | boto3 (existing), CDN signed URLs | Removes split/merge complexity; single download | Larger download per request (no partial caching) | CDN config | Encrypted at rest + in transit | Same CDN cost | Best |
| Keep CDN big + API small (current) | Existing code | No migration needed | Unnecessary complexity for TPM path | Both API and CDN | Split-key defense | Same | Legacy only |
**Recommendation**: Single-file download for TPM path. Legacy path retains split for backward compatibility.
### Component: Docker Unlock (TPM-enhanced)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| TPM-sealed archive key | fTPM, tpm2-pytss | Key never leaves TPM; no network download needed for key | Requires provisioned fTPM | fTPM provisioned with sealed key | Strongest — offline decryption possible | Low | Best |
| Key fragment from API (current) | HTTPS download | Works without TPM | Requires network; key fragment in memory | API reachable | Current level | Zero | Legacy only |
**Recommendation**: TPM-sealed archive key for provisioned devices. The key can be sealed into the TPM during device provisioning, eliminating the need to download a key fragment at unlock time.
### Component: Migration/Coexistence
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Feature flag + SecurityProvider abstraction | ABC, env var, /dev/tpm0 detection | Clean separation; zero risk to existing deployments | Two code paths to maintain during transition | None | Both paths maintain security | Low | Best |
| Hard cutover | N/A | Simple (one path) | Breaks non-TPM devices | All devices must have TPM | N/A | High risk | Poor |
**Recommendation**: Feature flag with auto-detection. Gradual rollout.
## Testing Strategy
### Integration / Functional Tests
- SecurityProvider auto-detection: with and without /dev/tpm0
- TpmSecurityProvider: seal/unseal round-trip (requires TPM simulator — swtpm)
- LegacySecurityProvider: all existing tests pass unchanged
- Single-file download: encrypt → upload → download → decrypt round-trip
- Docker unlock with TPM-sealed key: decrypt archive without network key download
- Migration: same resource accessible via both providers (different encryption)
### Non-Functional Tests
- Performance: TPM seal/unseal latency vs current SHA-384 key derivation
- Performance: single-file download vs split download (expect improvement)
- Security: verify TPM-sealed key cannot be extracted without hardware
- Security: verify legacy path still works identically to current behavior
## References
- NVIDIA Jetson Linux Developer Guide r36.4.4 — Firmware TPM: https://docs.nvidia.com/jetson/archives/r36.4.4/DeveloperGuide/SD/Security/FirmwareTPM.html
- NVIDIA JetPack 6.1 Blog: https://developer.nvidia.com/blog/nvidia-jetpack-6-1-boosts-performance-and-security-through-camera-stack-optimizations-and-introduction-of-firmware-tpm/
- tpm2-pytss: https://github.com/tpm2-software/tpm2-pytss
- tpm2-pytss FAPI docs: https://tpm2-pytss.readthedocs.io/en/latest/fapi.html
- SecEdge — Securing Edge AI through Trusted Computing: https://www.secedge.com/tcg-blog-securing-edge-ai-through-trusted-computing/
- Thistle Technologies — Securing AI Models on Edge Devices: https://thistle.tech/blog/securing-ai-models-on-edge-devices
- NVIDIA Developer Forums — fTPM PCR issues: https://forums.developer.nvidia.com/t/access-ftpm-pcr-registers/328636
- Docker TPM access: https://devops.stackexchange.com/questions/8509/accessing-tpm-from-inside-a-docker-container
## Related Artifacts
- AC Assessment: `_docs/02_task_plans/tpm-replaces-binary-split/00_research/00_ac_assessment.md`
- Fact Cards: `_docs/02_task_plans/tpm-replaces-binary-split/00_research/02_fact_cards.md`
- Reasoning Chain: `_docs/02_task_plans/tpm-replaces-binary-split/00_research/04_reasoning_chain.md`
@@ -0,0 +1,798 @@
# Solution Draft 02: TPM Security Implementation Guide
## Overview
This document is a comprehensive implementation guide for replacing the binary-split resource scheme with TPM-based hardware-rooted security on Jetson Orin Nano devices. It covers fTPM provisioning, full-disk encryption, OS hardening, tamper-responsive enclosures, the simplified loader architecture, and a phased implementation plan.
Prerequisite reading: `solution_draft01.md` (architecture overview), `security_analysis.md` (threat model).
---
## 1. fTPM Fusing and Provisioning
### 1.1 Hardware Required
| Item | Purpose | Cost |
| --------------------------------------- | ----------------------------------------- | -------- |
| x86 Ubuntu host PC (20.04 or 22.04 LTS) | Runs NVIDIA flaekshing/fusing tools | Existing |
| USB-C cable (data-capable) | Connects host to Jetson in recovery mode | ~$10 |
| Jetson Orin Nano dev kit (expendable) | First fuse target; fusing is irreversible | ~$250 |
| Jetson Orin Nano dev kit (kept unfused) | Ongoing development and debugging | ~$250 |
No specialized lab equipment, JTAG probes, or custom tooling is required. The entire fusing and provisioning process runs on a standard PC.
### 1.2 Roles: ODM vs OEM
NVIDIA's fTPM docs describe two separate entities:
- **ODM (Original Design Manufacturer)**: Designs the fTPM integration, generates KDK0 per device, runs the CA server, signs EK certificates, creates firmware packages.
- **OEM (Original Equipment Manufacturer)**: Adds disk encryption keys, assembles hardware, burns fuses at the factory, ships the final product.
In large-scale manufacturing these are different companies with a formal key handoff. **In our case, we are both ODM and OEM** — we design, provision, flash, and deploy ourselves. NVIDIA covers this in their fTPM guide Appendix B with a **simplified single-entity flow** that eliminates the cross-company handoff and roughly halves the provisioning complexity.
### 1.3 Key Derivation Chain
The full derivation from hardware fuses to usable keys:
```
KDK0 (256-bit random, burned into SoC fuses at manufacturing)
├── Silicon_ID = KDF(key=KDK0, info=Device_SN)
│ Device_SN = OEM_ID || SN (unique per device)
├── fTPM_Seed = KDF(key=Silicon_ID, constant_str1)
│ Passed from MB2 bootloader to OP-TEE via encrypted TrustZone memory
├── fTPM_Root_Seed = KDF(key=fTPM_Seed, constant_str)
├── EPS = KDF(key=fTPM_Root_Seed, info=Device_SN, salt=EPS_Seed)
│ EPS_Seed is a 256-bit random number from odm_ekb_gen.py, stored in EKB
│ EPS (Endorsement Primary Seed) is the root identity of the fTPM entity
├── SRK = TPM2_CreatePrimary(EPS)
│ Deterministic — re-derived from EPS on every boot
│ Never stored persistently, never leaves the secure world
└── Sealed blobs (your encryption keys)
Encrypted under SRK, stored as files on disk
Only unsealable on this specific device
```
Every KDF step is one-way. Knowing a derived value does not reveal its parent. Two devices with different KDK0 values produce entirely different key trees.
### 1.4 Provisioning Process (Single-Entity / ODM+OEM Flow)
#### Step 1: Install BSP and FSKP Packages
```
mkdir ${BSP_TOP} && cd ${BSP_TOP}
tar jvxf jetson_linux_${rel_ver}_aarch64.tbz2
tar jvxf public_sources.tbz2
cd Linux_for_Tegra/rootfs
sudo tar jvxpf tegra_linux_sample-root-filesystem_${rel_ver}_aarch64.tbz2
cd ${BSP_TOP}/Linux_for_Tegra
sudo ./apply_binaries.sh
cd ${BSP_TOP}
tar jvxf fskp_partner_t234_${rel_ver}_aarch64.tbz2
```
#### Step 2: Generate PKC and SBK Keys (Secure Boot)
```
openssl genrsa -out pkc.pem 3072
python3 gen_sbk_key.py --out sbk.key
```
PKC (Public Key Cryptography) key signs all boot chain images. SBK (Secure Boot Key) encrypts them. Both are burned into fuses and used for every subsequent flash.
#### Step 3: Generate Per-Device KDK0 and Silicon_ID
```
python3 kdk_gen.py \
--oem-id ${OEM_ID} \
--sn ${DEVICE_SN} \
--output-dir ${KDK_DB}
```
Outputs per device: KDK0 (256-bit), Device_SN, Silicon_ID public key. **KDK0 must be discarded after the fuseblob and EKB are generated** — keeping it in storage risks leaks.
#### Step 4: Generate Fuseblob
```
python3 fskp_fuseburn.py \
--kdk-db ${KDK_DB} \
--pkc-key pkc.pem \
--sbk-key sbk.key \
--fuse-xml fuse_config.xml \
--output-dir ${FUSEBLOB_DB}
```
The fuse config XML specifies which fuses to burn: KDK0, PKC hash, SBK, OEM_K1, SECURITY_MODE, ARM_JTAG_DISABLE, etc.
#### Step 5: Generate fTPM EKB (EK Certificates + EPS Seed)
```
python3 odm_ekb_gen.py \
--kdk-db ${KDK_DB} \
--output-dir ${EKB_FTPM_DB}
```
This generates EK CSRs, signs them with your CA, and packages the EPS Seed + EK certificates into per-device EKB images. In the single-entity flow, you run your own CA:
```
python3 ftpm_manufacturer_ca_simulator.sh # Replace with real CA in production
```
Then merge with disk encryption keys:
```
python3 oem_ekb_gen.py \
--ekb-ftpm-db ${EKB_FTPM_DB} \
--user-keys sym2_t234.key \
--oem-k1 oem_k1.key \
--output-dir ${EKB_FINAL_DB}
```
#### Step 6: Burn Fuses (IRREVERSIBLE)
Put the device in USB recovery mode:
- If powered off: connect DC power (device enters recovery automatically on some carrier boards)
- If powered on: `sudo reboot --force forced-recover`
- Verify: `lsusb` shows NVIDIA device
Test first (dry run):
```
sudo ./odmfuse.sh --test -X fuse_config.xml -i 0x23 jetson-orin-nano-devkit
```
Burn for real:
```
sudo ./odmfuse.sh -X fuse_config.xml -i 0x23 jetson-orin-nano-devkit
```
After `SECURITY_MODE` fuse is burned (value 0x1), **all further fuse writes are blocked permanently** (except a few ODM-reserved fuses).
#### Step 7: Flash Signed + Encrypted Images
```
sudo ROOTFS_ENC=1 ./flash.sh \
-u pkc.pem \
-v sbk.key \
-i ./sym2_t234.key \
--ekb ${EKB_FINAL_DB}/ekb-${DEVICE_SN}.signed \
jetson-orin-nano-devkit \
nvme0n1p1
```
#### Step 8: On-Device fTPM Provisioning (One-Time)
After first boot, run the provisioning script on the device:
```
sudo ./ftpm_provisioning.sh
```
This queries EK certificates from the EKB, stores them in fTPM NV memory, takes fTPM ownership, and creates EK handles. Only needs to run once per device.
### 1.5 Difficulty Assessment
| Aspect | Difficulty | Notes |
| ----------------------------- | ------------------- | ---------------------------------------------------- |
| First device (learning curve) | Medium-High | NVIDIA docs are detailed but dense. Budget 2-3 days. |
| Subsequent devices (scripted) | Low | Same pipeline, different KDK0/SN per device. |
| Risk | High (irreversible) | Always test on expendable dev board first. |
| Automation potential | High | Entire pipeline is scriptable for factory floor. |
### 1.6 Known Issues
- `odmfuseread.sh` has a Python 3 compatibility bug: `getiterator()` deprecated. Fix: replace line 1946 in `tegraflash_impl_t234.py` with `xml_tree.iter('file')`.
- Forum reports of PCR7 values not persisting across reboots. Our design deliberately avoids PCR-sealed keys — we use FAPI seal/unseal under SRK hierarchy only.
- Forum reports of NV handle loss after reboot on some Orin devices. Not blocking for our use case (SRK is re-derived from fuses, not stored in NV).
---
## 2. Storage Encryption
### 2.1 Recommendation: Full-Disk Encryption
Encrypt the entire NVMe rootfs partition, not just selected model files.
**Why full disk instead of selective encryption:**
| Approach | Protects models | Protects logs/config/temp files | Custom code needed | Performance |
| ---------------------------- | --------------- | ------------------------------------------------ | --------------------------------------- | ---------------------------- |
| Selective (model files only) | Yes | No — metadata, logs, decrypted artifacts exposed | Yes — application-level encrypt/decrypt | Minimal |
| Full disk (LUKS) | Yes | Yes — everything on disk is ciphertext | No — kernel handles it transparently | Minimal (HW-accelerated AES) |
Full-disk encryption is built into NVIDIA's Jetson Linux stack. No application code changes needed for the disk layer.
### 2.2 How Full-Disk Encryption Works
```
Flashing (host PC):
gen_ekb → sym2_t234.key (DEK) + eks_t234.img (EKB image)
ROOTFS_ENC=1 flash.sh → rootfs encrypted with DEK, DEK packaged in EKB
Boot (on device):
MB2 reads KDK0 from fuses
→ derives K1
→ decrypts EKB
→ extracts DEK
→ passes DEK to dm-crypt kernel module
dm-crypt + LUKS mounts rootfs transparently
Application sees a normal filesystem — encryption is invisible
```
The application never touches the disk encryption key. It's handled entirely in the kernel, initialized before the OS starts.
### 2.3 Double Encryption (Defense in Depth)
For AI model files, two independent encryption layers:
1. **Layer 1 — Full Disk LUKS** (kernel): Protects everything on disk. Key derived from fuses via EKB. Transparent to applications.
2. **Layer 2 — Application-level TPM-sealed encryption**: Model files encrypted with a key sealed in the fTPM. Decrypted by the loader at runtime.
An attacker who somehow bypasses disk encryption (e.g., cold boot while the filesystem is mounted) still faces the application-level encryption. And vice versa.
### 2.4 Setup Steps
1. Generate encryption keys from OP-TEE source:
```
cd ${BSP_TOP}/Linux_for_Tegra/source/nvidia-jetson-optee-source
cd optee/samples/hwkey-agent/host/tool/gen_ekb/
sudo chmod +x example.sh && ./example.sh
```
Outputs: `sym2_t234.key` (DEK) and `eks_t234.img` (EKB image).
2. Place keys:
```
cp sym2_t234.key ${BSP_TOP}/Linux_for_Tegra/
cp eks_t234.img ${BSP_TOP}/Linux_for_Tegra/bootloader/
```
3. Verify EKB integrity:
```
hexdump -C -n 4 -s 0x24 eks_t234.img
# Must show magic bytes "EEKB"
```
4. Configure NVMe partition size in `flash_l4t_t234_nvme_rootfs_enc.xml`:
- Set `NUM_SECTORS` based on NVMe capacity (e.g., 900000000 for 500GB)
- Set `encrypted="true"` for the rootfs partition
5. Flash with encryption:
```
sudo ROOTFS_ENC=1 ./tools/kernel_flash/l4t_initrd_flash.sh \
--external-device nvme0n1p1 \
-c flash_l4t_t234_nvme_rootfs_enc.xml \
-i ./sym2_t234.key \
-u pkc.pem -v sbk.key \
jetson-orin-nano-devkit \
nvme0n1p1
```
---
## 3. Debug Access Strategy
### 3.1 The Problem
After Secure Boot fusing, JTAG disabling, and OS hardening, the device has no interactive access. How do you develop, debug, and perform field maintenance?
### 3.2 Solution: Dual-Image Approach
Standard embedded Linux practice: maintain two OS images, both signed with the same PKC key.
| Property | Development Image | Production Image |
| ---------------------------------- | --------------------------------- | ----------------------- |
| Secure Boot signature | Signed with PKC key | Signed with PKC key |
| Boots on fused device | Yes | Yes |
| SSH access | Yes (key-based only, no password) | No (sshd not installed) |
| Serial console | Enabled | Disabled |
| ptrace / /dev/mem | Allowed | Blocked (lockdown mode) |
| Debug tools (gdb, strace, tcpdump) | Installed | Not present |
| Getty on TTY | Running | Not spawned |
| Desktop environment | Optional | Not installed |
| Application | Your loader + inference | Your loader + inference |
Secure Boot verifies the **signature**, not the **contents** of the image. Both images are valid as long as they're signed with your PKC key. An attacker cannot create either image without the private key.
### 3.3 Workflow
**During development:**
1. Flash the dev image to a fused device
2. SSH in via key-based authentication
3. Develop, debug, iterate
4. When done, flash the prod image for deployment
**Production deployment:**
1. Flash the prod image at the factory
2. Device boots directly into your application
3. No shell, no SSH, no serial — only your FastAPI endpoints
**Field debug (emergency):**
1. Connect host PC via USB-C
2. Put device in USB recovery mode (silicon ROM, always available)
3. Reflash with the dev image (requires PKC private key to sign)
4. SSH in, diagnose, fix
5. Reflash with prod image, redeploy
USB recovery mode is hardwired in silicon. It always works regardless of what OS is installed. But after Secure Boot fusing, it **only accepts images signed with your PKC key**. An attacker who enters recovery mode but lacks the signing key is stuck.
### 3.4 Optional: Hardware Debug Jumper
A physical GPIO pin on the carrier board that, when shorted at boot, tells the init system to start SSH:
```
Boot → systemd reads GPIO pin → if HIGH: start sshd.service
→ if LOW: sshd not started (production behavior)
```
Opening the case to access the jumper triggers the tamper enclosure → keys are zeroized. So this is only useful during controlled maintenance with the tamper system temporarily disarmed.
### 3.5 PKC Key Security
The PKC private key is the crown jewel. Whoever holds it can create signed images that boot on any of your fused devices. Protect it accordingly:
- Store on an air-gapped machine or HSM (Hardware Security Module)
- Never store in git, CI/CD pipelines, or cloud storage
- Limit access to 1-2 people
- Consider splitting with Shamir's Secret Sharing for key ceremonies
---
## 4. Tamper Enclosure
### 4.1 Threat Model for Physical Access
| Attack | Without enclosure | With tamper-responsive enclosure |
| ---------------------------------- | ----------------------------- | ------------------------------------------------ |
| Unscrew case, desolder eMMC/NVMe | Easy (minutes) | Mesh breaks → key destroyed → data irrecoverable |
| Probe DRAM bus with logic analyzer | Moderate (requires soldering) | Case opening triggers zeroization first |
| Cold boot (freeze RAM) | Moderate | Temperature sensor triggers zeroization |
| Connect to board debug headers | Easy | Case must be opened → zeroization |
### 4.2 Option A: Zymkey HSM4 + Custom Enclosure (~$150-250/unit)
**Recommended for initial production runs (up to ~500 units).**
**Bill of Materials:**
| Component | Unit Cost | Source |
| -------------------------------------- | ------------- | ----------------------------- |
| Zymkey HSM4 (I2C security module) | ~$71 | zymbit.com |
| Custom aluminum enclosure | ~$30-80 | CNC shop / Alibaba at volume |
| Flex PCB tamper mesh panels (set of 6) | ~$10-30 | JLCPCB / PCBWay |
| CR2032 coin cell battery | ~$2 | Standard electronics supplier |
| 30 AWG perimeter wire (~2 ft) | ~$1 | Standard electronics supplier |
| Assembly labor + connectors | ~$20-40 | — |
| **Total** | **~$134-224** | — |
**How it works:**
```
┌──────── Aluminum Enclosure ────────┐
│ │
│ All inner walls lined with flex │
│ PCB tamper mesh (conductive traces │
│ in space-filling curve pattern) │
│ │
│ Mesh traces connect to Zymkey │
│ HSM4's 2 perimeter circuits │
│ │
│ ┌───────────┐ ┌───────────────┐ │
│ │ Zymkey │ │ Jetson Orin │ │
│ │ HSM4 │ │ Nano │ │
│ │ │ │ │ │
│ │ I2C ◄─────┤ │ GPIO header │ │
│ │ GPIO4 ◄───┤ │ │ │
│ │ │ │ │ │
│ │ [CR2032] │ │ │ │
│ │ (battery │ │ │ │
│ │ backup) │ │ │ │
│ └───────────┘ └───────────────┘ │
│ │
│ Tamper event (mesh broken, │
│ temperature anomaly, power loss │
│ without battery): │
│ → Zymkey destroys stored keys │
│ → Master encryption key is gone │
│ → Encrypted disk is permanently │
│ unrecoverable │
└─────────────────────────────────────┘
```
**Zymkey HSM4 features:**
- 2 independent perimeter breach detection circuits (connect to mesh)
- Accelerometer (shock/orientation tamper detection)
- Main power monitor
- Battery-backed RTC (36-60 months on CR2032)
- Secure key storage (ECC P-256, AES-256, SHA-256)
- I2C interface (fits Jetson's 40-pin GPIO header)
- Configurable tamper response: notify host, or destroy keys on breach
**Flex PCB tamper mesh design:**
- Use the KiCad anti-tamper mesh plugin to generate space-filling curve trace patterns
- Order from JLCPCB or PCBWay as flex PCBs (~$5-15 per panel)
- Attach to enclosure inner walls with adhesive
- Wire to Zymkey's perimeter circuit connectors (Hirose DF40HC)
- Any cut, drill, or peel that breaks a trace triggers the tamper event
### 4.3 Option B: Full DIY (~$80-150/unit)
**For higher volumes (500+ units) where per-unit cost matters.**
| Component | Unit Cost |
| ------------------------------------------------- | ------------ |
| STM32G4 microcontroller | ~$5 |
| Flex PCB tamper mesh (KiCad plugin) | ~$10-30 |
| Battery-backed SRAM (Cypress CY14B101 or similar) | ~$5 |
| Custom PCB for STM32 monitor circuit | ~$10-20 |
| Aluminum enclosure | ~$30-80 |
| Coin cell + holder | ~$3 |
| **Total** | **~$63-143** |
The STM32G4's high-resolution timer (sub-200ps) enables Time-Domain Reflectometry (TDR) monitoring of the mesh — sending pulses into the trace and detecting echoes when damage occurs. More sensitive than simple resistance monitoring.
The master encryption key is stored in battery-backed SRAM (not in the Jetson's fTPM). On tamper detection, the STM32 cuts power to the SRAM — key vanishes in microseconds.
More engineering effort upfront (firmware for STM32, PCB design, integration testing) but lower per-unit BOM.
### 4.4 Option C: Epoxy Potting (~$30-50/unit)
**Minimum viable physical protection.**
- Encapsulate the entire Jetson board + carrier in hardened epoxy resin
- Physical extraction requires grinding/dissolving the epoxy, which destroys the board and traces
- No active zeroization — if the attacker is patient and skilled enough, they can extract components
- Best combined with Options A or B: epoxy + active tamper mesh
### 4.5 Recommendation
| Production volume | Recommendation | Per-unit cost |
| -------------------- | ----------------------------------------- | ------------- |
| Prototype / first 10 | Option A (Zymkey HSM4) + Option C (epoxy) | ~$180-270 |
| 10-500 units | Option A (Zymkey HSM4) | ~$150-250 |
| 500+ units | Option B (custom STM32) | ~$80-150 |
All options fit within the $300/unit budget.
---
## 5. Simplified Loader Architecture
### 5.1 Current Architecture
```
main.py (FastAPI)
├── POST /login
│ → api_client.pyx: set_credentials, login()
│ → credentials.pyx: email, password
│ → security.pyx: get_hw_hash(hardware_info)
│ → hardware_service.pyx: CPU/GPU/RAM/serial strings
├── POST /load/{filename}
│ → api_client.pyx: load_big_small_resource(filename, folder)
│ 1. Fetch SMALL part from API (POST /resources/get/{folder})
│ → Decrypt with get_api_encryption_key(email+password+hw_hash+salt)
│ 2. Fetch BIG part from CDN (S3 download) or local cache
│ 3. Concatenate small + big
│ 4. Decrypt merged blob with get_resource_encryption_key() (fixed internal string)
│ → Return decrypted bytes
├── POST /upload/{filename}
│ → api_client.pyx: upload_big_small_resource(file, folder)
│ 1. Encrypt full resource with get_resource_encryption_key()
│ 2. Split at min(3KB, 30% of ciphertext)
│ 3. Upload big part to CDN
│ 4. Upload small part to API
└── POST /unlock
→ binary_split.py:
1. download_key_fragment(RESOURCE_API_URL, token) — HTTP GET from API
2. decrypt_archive(images.enc, SHA256(key_fragment)) — AES-CBC stream
3. docker load -i result.tar
```
**Security dependencies in current architecture:**
- `security.pyx`: SHA-384 key derivation from `email + password + hw_hash + salt`
- `hardware_service.pyx`: String-based hardware fingerprint (spoofable)
- `binary_split.py`: Key fragment downloaded from API server
- Split storage: security depends on attacker not having both API and CDN access
### 5.2 Proposed TPM Architecture
```
main.py (FastAPI) — routes and request/response contracts unchanged
├── POST /login
│ → api_client.pyx: set_credentials, login()
│ → credentials.pyx: email, password (unchanged — still needed for API auth)
│ → security_provider.pyx: auto-detect TPM or legacy
├── POST /load/{filename}
│ → api_client.pyx: load_resource(filename, folder)
│ [TPM path]:
│ 1. Fetch single encrypted file from CDN (S3 download)
│ 2. security_provider.decrypt(data)
│ → tpm_security_provider.pyx: FAPI.unseal() → master key → AES decrypt
│ 3. Return decrypted bytes
│ [Legacy path]:
│ (unchanged — load_big_small_resource as before)
├── POST /upload/{filename}
│ → api_client.pyx: upload_resource(file, folder)
│ [TPM path]:
│ 1. security_provider.encrypt(data)
│ → tpm_security_provider.pyx: AES encrypt with TPM-derived key
│ 2. Upload single file to CDN
│ [Legacy path]:
│ (unchanged — upload_big_small_resource as before)
└── POST /unlock
[TPM path]:
1. security_provider.unseal_archive_key()
→ tpm_security_provider.pyx: FAPI.unseal() → archive key (no network call)
2. decrypt_archive(images.enc, archive_key)
3. docker load -i result.tar
[Legacy path]:
(unchanged — download_key_fragment from API)
```
### 5.3 SecurityProvider Interface
```python
from abc import ABC, abstractmethod
class SecurityProvider(ABC):
@abstractmethod
def encrypt(self, data: bytes) -> bytes: ...
@abstractmethod
def decrypt(self, data: bytes) -> bytes: ...
@abstractmethod
def get_archive_key(self) -> bytes: ...
```
Two implementations:
- **TpmSecurityProvider**: Calls `tpm2-pytss` FAPI to unseal master key from TPM. Uses master key for AES-256-CBC encrypt/decrypt. Archive key is also TPM-sealed (no network download).
- **LegacySecurityProvider**: Wraps existing `security.pyx` logic unchanged. Key derivation from `email+password+hw_hash+salt`. Archive key downloaded from API.
### 5.4 Auto-Detection Logic
At startup:
```
1. Check env var SECURITY_PROVIDER
→ if "tpm": use TpmSecurityProvider (fail hard if TPM unavailable)
→ if "legacy": use LegacySecurityProvider
→ if unset: auto-detect (step 2)
2. Check os.path.exists("/dev/tpm0")
→ if True: attempt TPM connection via FAPI
→ if success: use TpmSecurityProvider
→ if failure: log warning, fall back to LegacySecurityProvider
→ if False: use LegacySecurityProvider
3. Log which provider was selected and why
```
### 5.5 What Changes, What Stays
| Component | TPM path | Legacy path | Notes |
| ------------------------------ | --------------------------------- | ------------------------- | ------------------------------------ |
| `main.py` routes | Unchanged | Unchanged | F1-F6 API contract preserved |
| JWT authentication | Unchanged | Unchanged | Still needed for API access |
| CDN download | Single file | Big/small split | CDN still used for bandwidth |
| AES-256-CBC encryption | Unchanged algorithm | Unchanged | Only the key source changes |
| Key source | TPM-sealed master key | SHA-384(email+pw+hw+salt) | Core difference |
| `hardware_service.pyx` | Not used | Used | TPM replaces string fingerprinting |
| `binary_split.py` key download | Eliminated | Used | TPM-sealed key is local |
| `security.pyx` | Wrapped in LegacySecurityProvider | Active | Not deleted — legacy devices need it |
### 5.6 Docker Container Changes
The loader runs in Docker. For TPM access:
```yaml
# docker-compose.yml additions for TPM path
services:
loader:
devices:
- /dev/tpm0:/dev/tpm0
- /dev/tpmrm0:/dev/tpmrm0
environment:
- SECURITY_PROVIDER=tpm # or leave unset for auto-detect
```
No `--privileged` flag needed. Device mounts are sufficient.
Container image needs additional packages:
- `tpm2-tss` (native library, >= 2.4.0)
- `tpm2-pytss` (Python bindings from PyPI)
- FAPI configuration file (`/etc/tpm2-tss/fapi-config.json`)
---
## 6. Implementation Phases
### Phase 0: Preparation (1 week)
| Task | Details |
| ------------------------ | ------------------------------------------------------------------------------------------------ |
| Order hardware | Second Jetson Orin Nano dev kit (expendable for fusing experiments) |
| Order Zymkey HSM4 | For tamper enclosure evaluation |
| Download NVIDIA packages | BSP (`jetson_linux_*_aarch64.tbz2`), sample rootfs, public sources, FSKP partner package |
| Set up host | Ubuntu 22.04 LTS on x86 machine, install `libftdi-dev`, `openssh-server`, `python3-cryptography` |
| Study NVIDIA docs | `r36.4.3` Security section: Secure Boot, Disk Encryption, Firmware TPM, FSKP |
### Phase 1: Secure Boot + Disk Encryption (1-2 weeks)
| Task | Details | Validation |
| ----------------------------- | ---------------------------------------------------------- | ------------------------------------------------ |
| Generate PKC + SBK keys | `openssl genrsa` + `gen_sbk_key.py` | Keys exist, correct format |
| Dry-run fuse burning | `odmfuse.sh --test` on expendable dev board | No errors, fuse values logged |
| Burn Secure Boot fuses | `odmfuse.sh` for real (PKC, SBK, SECURITY_MODE) | Device only boots signed images |
| Generate disk encryption keys | `gen_ekb/example.sh` | `sym2_t234.key` + `eks_t234.img` with EEKB magic |
| Flash encrypted rootfs | `ROOTFS_ENC=1 l4t_initrd_flash.sh` | Device boots, `lsblk` shows LUKS partition |
| Validate Secure Boot | Attempt to flash unsigned image → must fail | Unsigned flash rejected |
| Validate disk encryption | Remove NVMe, mount on another machine → must be ciphertext | Cannot read filesystem |
### Phase 2: fTPM Provisioning (1-2 weeks)
| Task | Details | Validation |
| ----------------------------------------- | ---------------------------------------- | ------------------------------------------- |
| Generate KDK0 + Silicon_ID | `kdk_gen.py` per device | KDK_DB populated |
| Generate fuseblob | `fskp_fuseburn.py` | Signed fuseblob files |
| Generate fTPM EKB | `odm_ekb_gen.py` + `oem_ekb_gen.py` | Per-device EKB images |
| Burn fTPM fuses | `odmfuse.sh` with KDK0 fuses | Fuses burned |
| Flash with fTPM EKB | `flash.sh` with EKB | Device boots with fTPM |
| On-device provisioning | `ftpm_provisioning.sh` | EK certificates in NV memory |
| Validate fTPM | `tpm2_getcap properties-fixed` | Shows manufacturer, firmware version |
| Test seal/unseal | `tpm2_create` + `tpm2_unseal` round-trip | Data sealed → unsealed correctly |
| Test seal on device A, unseal on device B | Copy sealed blob between devices | Unseal fails on device B (correct behavior) |
### Phase 3: OS Hardening (1 week)
| Task | Details | Validation |
| ---------------------------- | --------------------------------------------------------------- | --------------------------------------- |
| Create dev image recipe | SSH (key-only), serial console, ptrace allowed, debug tools | Can SSH in, run gdb |
| Create prod image recipe | No SSH, no serial, no ptrace, no shell, no desktop | No interactive access possible |
| Kernel config: lockdown mode | `CONFIG_SECURITY_LOCKDOWN_LSM=y`, `lockdown=confidentiality` | `/dev/mem` access denied, kexec blocked |
| Kernel config: disable debug | `CONFIG_STRICT_DEVMEM=y`, no `/dev/kmem` | Cannot read physical memory |
| Sysctl hardening | `kernel.yama.ptrace_scope=3`, `kernel.core_pattern=|/bin/false` | ptrace attach fails, no core dumps |
| Disable serial console | Remove `console=ttyTCU0` from kernel cmdline | No output on serial |
| Disable getty | Mask `getty@.service`, `serial-getty@.service` | No login prompt on any TTY |
| Sign both images | `flash.sh -u pkc.pem` for dev and prod images | Both boot on fused device |
| Validate prod image | Plug in keyboard, monitor, USB, Ethernet → no access | Device is a black box |
| Validate dev image | Flash dev image → SSH works | Can debug on fused device |
### Phase 4: Loader Code Changes (2-3 weeks)
| Task | Details | Tests |
| -------------------------------------------- | ---------------------------------------------------- | ------------------------------------------------ |
| Add `tpm2-tss`, `tpm2-pytss` to requirements | Match versions available in Jetson BSP | Imports work |
| Add `swtpm` to dev dependencies | TPM simulator for CI/testing | Simulator starts, `/dev/tpm0` available |
| Implement `SecurityProvider` ABC | `security_provider.pxd` + `.pyx` | Interface compiles |
| Implement `TpmSecurityProvider` | FAPI `create_seal`, `unseal`, AES encrypt/decrypt | Seal/unseal round-trip with swtpm |
| Implement `LegacySecurityProvider` | Wrap existing `security.pyx` | All existing tests pass unchanged |
| Add auto-detection logic | `/dev/tpm0` check + env var override | Correct provider selected in both cases |
| Refactor `load_resource` (TPM path) | Single file download + TPM decrypt | Download → decrypt → correct bytes |
| Refactor `upload_resource` (TPM path) | TPM encrypt + single file upload | Encrypt → upload → download → decrypt round-trip |
| Refactor Docker unlock (TPM path) | TPM unseal archive key, no API download | Unlock works without network key fragment |
| Update `docker-compose.yml` | Add `/dev/tpm0`, `/dev/tpmrm0` device mounts | Container can access TPM |
| Update `Dockerfile` | Install `tpm2-tss` native lib + `tpm2-pytss` | Build succeeds |
| Integration tests | Full flow with swtpm: login → load → upload → unlock | All paths work |
| Legacy regression tests | All existing e2e tests pass without TPM | No regression |
### Phase 5: Tamper Enclosure (2-4 weeks, parallel with Phase 4)
| Task | Details | Validation |
| ------------------------- | --------------------------------------------------------------- | --------------------------- |
| Evaluate Zymkey HSM4 | Connect to Orin Nano GPIO header, test I2C communication | Zymkey detected, LED blinks |
| Test perimeter circuits | Wire perimeter inputs, break wire → verify detection | Tamper event logged |
| Test key zeroization | Enable production mode, trigger tamper → verify key destruction | Key gone, device bricked |
| Design tamper mesh panels | KiCad anti-tamper mesh plugin, space-filling curves | Gerber files ready |
| Order flex PCBs | JLCPCB or PCBWay | Panels received |
| Design/source enclosure | Aluminum case, dimensions for Jetson + Zymkey + mesh panels | Enclosure received |
| Assemble prototype | Mount boards, wire mesh to Zymkey perimeter circuits | Physical prototype complete |
| Test tamper scenarios | Open case, drill, probe → all trigger zeroization | All breach paths detected |
| Temperature test | Cool enclosure below threshold → verify trigger | Cold boot attack prevented |
### Phase 6: Integration Testing (1-2 weeks)
| Test Scenario | Expected Result |
| --------------------------------------------------------------------------------- | -------------------------------------------------------- |
| Full stack: fused device + encrypted disk + fTPM + hardened OS + tamper enclosure | Device boots, runs inference, all security layers active |
| Attempt USB boot | Rejected (Secure Boot) |
| Attempt JTAG | No response (fused off) |
| Attempt SSH on prod image | Connection refused (no sshd) |
| Attempt serial console | No output |
| Remove NVMe, read on another machine | Ciphertext only |
| Copy sealed blob to different device | Unseal fails |
| Open tamper enclosure | Keys destroyed, device permanently bricked |
| Legacy device (no TPM) loads resources | Works via LegacySecurityProvider |
| Fused device loads resources | Works via TpmSecurityProvider |
| Docker unlock on TPM device | Works without network key download |
| Docker unlock on legacy device | Works via API key fragment (unchanged) |
### Timeline Summary
```
Week 1 Phase 0: Preparation (order hardware, download BSP)
Week 2-3 Phase 1: Secure Boot + Disk Encryption
Week 4-5 Phase 2: fTPM Provisioning
Week 6 Phase 3: OS Hardening
Week 7-9 Phase 4: Loader Code Changes
Week 7-10 Phase 5: Tamper Enclosure (parallel with Phase 4)
Week 11-12 Phase 6: Integration Testing
```
Total estimated duration: **10-12 weeks** (Phases 4 and 5 overlap).
---
## References
- NVIDIA Jetson Linux Developer Guide r36.4.3 — Firmware TPM: [https://docs.nvidia.com/jetson/archives/r36.4.3/DeveloperGuide/SD/Security/FirmwareTPM.html](https://docs.nvidia.com/jetson/archives/r36.4.3/DeveloperGuide/SD/Security/FirmwareTPM.html)
- NVIDIA Jetson Linux Developer Guide — Secure Boot: [https://docs.nvidia.com/jetson/archives/r36.2/DeveloperGuide/SD/Security/SecureBoot.html](https://docs.nvidia.com/jetson/archives/r36.2/DeveloperGuide/SD/Security/SecureBoot.html)
- NVIDIA Jetson Linux Developer Guide — Disk Encryption: [https://docs.nvidia.com/jetson/archives/r38.2.1/DeveloperGuide/SD/Security/DiskEncryption.html](https://docs.nvidia.com/jetson/archives/r38.2.1/DeveloperGuide/SD/Security/DiskEncryption.html)
- NVIDIA Jetson Linux Developer Guide — FSKP: [https://docs.nvidia.com/jetson/archives/r38.4/DeveloperGuide/SD/Security/FSKP.html](https://docs.nvidia.com/jetson/archives/r38.4/DeveloperGuide/SD/Security/FSKP.html)
- tpm2-pytss: [https://github.com/tpm2-software/tpm2-pytss](https://github.com/tpm2-software/tpm2-pytss)
- tpm2-pytss FAPI docs: [https://tpm2-pytss.readthedocs.io/en/latest/fapi.html](https://tpm2-pytss.readthedocs.io/en/latest/fapi.html)
- Zymbit HSM4: [https://www.zymbit.com/HSM4/](https://www.zymbit.com/HSM4/)
- Zymbit HSM4 perimeter detect: [https://docs.zymbit.com/tutorials/perimeter-detect/hsm4](https://docs.zymbit.com/tutorials/perimeter-detect/hsm4)
- KiCad anti-tamper mesh plugin: [https://hackaday.com/2021/03/14/an-anti-tamper-mesh-plugin-for-kicad/](https://hackaday.com/2021/03/14/an-anti-tamper-mesh-plugin-for-kicad/)
- Microchip PolarFire security mesh: [https://www.microchip.com/en-us/about/media-center/blog/2026/security-mesh-distributed-defense-across-your-design](https://www.microchip.com/en-us/about/media-center/blog/2026/security-mesh-distributed-defense-across-your-design)
- DoD GUARD Secure GPU Module: [https://www.cto.mil/wp-content/uploads/2025/04/Secure-Edge.pdf](https://www.cto.mil/wp-content/uploads/2025/04/Secure-Edge.pdf)
- Forecr MILBOX-ORNX (rugged enclosure): [https://forecr.io/products/jetson-orin-nx-orin-nano-rugged-compact-pc-milbox-ornx](https://forecr.io/products/jetson-orin-nx-orin-nano-rugged-compact-pc-milbox-ornx)
## Related Artifacts
- Solution Draft 01: `_docs/02_task_plans/tpm-replaces-binary-split/01_solution/solution_draft01.md`
- Security Analysis: `_docs/02_task_plans/tpm-replaces-binary-split/01_solution/security_analysis.md`
- Fact Cards: `_docs/02_task_plans/tpm-replaces-binary-split/00_research/02_fact_cards.md`
- Reasoning Chain: `_docs/02_task_plans/tpm-replaces-binary-split/00_research/04_reasoning_chain.md`
- Problem Statement: `_docs/02_task_plans/tpm-replaces-binary-split/problem.md`
@@ -0,0 +1,39 @@
# Problem: TPM-Based Security to Replace Binary-Split Resource Scheme
## Context
The Azaion Loader uses a binary-split resource scheme (ADR-002) where encrypted resources are split into a small part (uploaded to the authenticated API) and a large part (uploaded to CDN). Decryption requires both parts. This was designed for distributing AI models to **end-user laptops** where the device is untrusted — the loader shipped 99% of the model in the installer, and the remaining 1% (first 3KB) was downloaded at runtime to prevent extraction.
The distribution model has shifted to **SaaS** — services now run on web servers or **Jetson Orin Nano** edge devices. The Jetson Orin Nano includes a **TPM (Trusted Platform Module)** that can provide hardware-rooted security, potentially making the binary-split mechanism unnecessary overhead.
## Current Security Architecture
- **Binary-split scheme**: Resources encrypted with AES-256-CBC, split into small (≤3KB or 30%) + big parts, stored on separate servers (API + CDN)
- **Key derivation**: SHA-384 hashes combining email, password, hardware fingerprint, and salt
- **Docker unlock**: Key fragment downloaded from API, used to decrypt encrypted Docker image archive
- **Hardware binding**: SHA-384 hash of hardware fingerprint ties decryption to specific hardware
- **Cython compilation**: Core modules compiled to .so for IP protection
## Questions to Investigate
1. **TPM capabilities on Jetson Orin Nano**: What TPM version is available? What crypto operations does it support (key sealing, attestation, secure storage)? How does NVIDIA's security stack integrate with standard TPM APIs?
2. **TPM-based key management**: Can TPM replace the current key derivation scheme (SHA-384 of email+password+hw_hash+salt)? Can keys be sealed to TPM PCR values so they're only accessible on the intended device?
3. **Eliminating binary-split**: If TPM provides hardware-rooted trust (device can prove it's authentic), is the split-storage security model still necessary? Could the loader become a standard authenticated resource downloader with TPM-backed decryption?
4. **Docker image protection**: Can TPM-based disk encryption or sealed storage replace the current encrypted-archive-plus-key-fragment approach for Docker images?
5. **Migration path**: How would the transition work for existing deployments? Can both models (binary-split for legacy, TPM for new) coexist?
6. **Threat model comparison**: What threats does binary-split protect against that TPM doesn't (and vice versa)? Are there attack vectors specific to Jetson Orin Nano that need consideration?
7. **Implementation complexity**: What libraries/tools are available for TPM on ARM64/Jetson? (tpm2-tools, python-tpm2-pytss, etc.) What's the integration effort?
## Constraints
- Must support ARM64 (Jetson Orin Nano specifically)
- Must work within Docker containers (loader runs as a container with Docker socket mount)
- Cannot break existing API contracts (F1-F6 flows)
- Cython compilation requirement remains for IP protection
- Need to consider both SaaS web server and Jetson edge device deployments
@@ -2,9 +2,9 @@
**Task**: AZ-187_device_provisioning_script
**Name**: Device Provisioning Script
**Description**: Create a shell script that provisions a Jetson device identity (CompanionPC user) during the fuse/flash pipeline
**Description**: Interactive shell script that provisions Jetson device identities (CompanionPC users) during the fuse/flash pipeline
**Complexity**: 2 points
**Dependencies**: None
**Dependencies**: AZ-196 (POST /devices endpoint)
**Component**: DevOps
**Tracker**: AZ-187
**Epic**: AZ-181
@@ -15,48 +15,47 @@ Each Jetson needs a unique CompanionPC user account for API authentication. This
## Outcome
- Single script creates device identity and embeds credentials in the rootfs
- Integrates into the fuse/flash pipeline between odmfuse.sh and flash.sh
- Interactive `provision_devices.sh` detects connected Jetsons, registers identities via admin API, and runs fuse/flash pipeline
- Serial numbers are auto-assigned server-side (azj-0000, azj-0001, ...)
- Provisioning runbook documents the full end-to-end flow
## Scope
### Included
- provision_device.sh: generate device email (azaion-jetson-{serial}@azaion.com), random 32-char password
- Call admin API POST /users to create Users row with Role=CompanionPC
- Write credentials config file to rootfs image (at known path, e.g., /etc/azaion/device.conf)
- Idempotency: re-running for same serial doesn't create duplicate user
- Provisioning runbook: step-by-step from unboxing through fusing, flashing, and first boot
- `provision_devices.sh`: scan USB for Jetsons in recovery mode, interactive device selection, call admin API `POST /devices` for auto-generated serial/email/password, write credentials to rootfs, fuse, flash
- Configuration via `scripts/.env` (git-ignored), template at `scripts/.env.example`
- Dependency checks at startup (lsusb, curl, jq, L4T tools, sudo)
- Provisioning runbook: step-by-step for multi-device manufacturing flow
### Excluded
- fTPM provisioning (covered by NVIDIA's ftpm_provisioning.sh)
- Secure Boot fusing (covered by solution_draft02 Phase 1-2)
- OS hardening (covered by solution_draft02 Phase 3)
- Admin API user creation endpoint (assumed to exist)
- Admin API POST /devices endpoint implementation (AZ-196)
## Acceptance Criteria
**AC-1: Script creates CompanionPC user**
Given a new device serial AZJN-0042
When provision_device.sh is run with serial AZJN-0042
Then admin API has a new user azaion-jetson-0042@azaion.com with Role=CompanionPC
**AC-1: Script registers device via POST /devices**
Given the admin API has the POST /devices endpoint deployed
When provision_devices.sh is run and a device is selected
Then the admin API creates a new user with auto-assigned serial (e.g. azj-0000) and Role=CompanionPC
**AC-2: Credentials written to rootfs**
Given provision_device.sh completed successfully
When the rootfs image is inspected
Then /etc/azaion/device.conf contains the email and password
Given POST /devices returned serial, email, and password
When the provisioning step completes for a device
Then `$ROOTFS_DIR/etc/azaion/device.conf` contains the email and password with mode 600
**AC-3: Device can log in after flash**
Given a provisioned and flashed device boots for the first time
When the loader reads /etc/azaion/device.conf and calls POST /login
Then a valid JWT is returned
**AC-4: Idempotent re-run**
Given provision_device.sh was already run for serial AZJN-0042
When it is run again for the same serial
Then no duplicate user is created (existing user is reused or updated)
**AC-4: Multi-device support**
Given multiple Jetsons connected in recovery mode
When provision_devices.sh is run
Then the user can select individual devices or all, and each is provisioned sequentially
**AC-5: Runbook complete**
Given the provisioning runbook
When followed step-by-step on a new Jetson Orin Nano
Then the device is fully fused, flashed, provisioned, and can communicate with the admin API
When followed step-by-step on new Jetson Orin Nano devices
Then the devices are fully fused, flashed, provisioned, and can communicate with the admin API
@@ -1,66 +0,0 @@
# Resources Table & Update Check API
**Task**: AZ-183_resources_table_update_api
**Name**: Resources Table & Update Check API
**Description**: Add Resources table to admin API PostgreSQL DB and implement POST /get-update endpoint for fleet OTA updates
**Complexity**: 3 points
**Dependencies**: None
**Component**: Admin API
**Tracker**: AZ-183
**Epic**: AZ-181
## Problem
The fleet update system needs a server-side component that tracks published artifact versions and tells devices what needs updating. CI/CD publishes encrypted artifacts to CDN; the server must store metadata (version, URL, hash, encryption key) and serve it to devices on request.
## Outcome
- Resources table stores per-artifact metadata populated by CI/CD
- Devices call POST /get-update with their current versions and get back only what's newer
- Server-side memory cache handles 2000+ devices polling every 5 minutes without DB pressure
## Scope
### Included
- Resources table migration (resource_name, dev_stage, architecture, version, cdn_url, sha256, encryption_key, size_bytes, created_at)
- POST /get-update endpoint: accepts device's current versions + architecture + dev_stage, returns only newer resources
- Server-side memory cache invalidated on CI/CD publish
- Internal endpoint or direct DB write for CI/CD to publish new resource versions
### Excluded
- CI/CD pipeline changes (AZ-186)
- Loader-side update logic (AZ-185)
- Device provisioning (AZ-187)
## Acceptance Criteria
**AC-1: Resources table created**
Given the admin API database
When the migration runs
Then the Resources table exists with all required columns
**AC-2: Update check returns newer resources**
Given Resources table has annotations version 2026-04-13
When device sends POST /get-update with annotations version 2026-02-25
Then response includes annotations with version, cdn_url, sha256, encryption_key, size_bytes
**AC-3: Current device gets empty response**
Given device already has the latest version of all resources
When POST /get-update is called
Then response is an empty array
**AC-4: Memory cache avoids repeated DB queries**
Given 2000 devices polling every 5 minutes
When POST /get-update is called repeatedly
Then the latest versions are served from memory cache, not from DB on every request
**AC-5: Cache invalidated on publish**
Given a new resource version is published via CI/CD
When the publish endpoint/function completes
Then the next POST /get-update call returns the new version
## Constraints
- Must integrate with existing admin API (linq2db + PostgreSQL)
- encryption_key column must be stored securely (encrypted at rest in DB or via application-level encryption)
- Response must include encryption_key only over HTTPS with valid JWT
@@ -12,7 +12,7 @@ Implemented the loader's security modernization features across 2 batches:
### Batch 1 (10 points)
- **AZ-182** TPM Security Provider — SecurityProvider ABC with TPM/legacy detection, FAPI seal/unseal, graceful fallback
- **AZ-184** Resumable Download Manager — HTTP Range resume, SHA-256 verify, AES-256 decrypt, exponential backoff
- **AZ-187** Device Provisioning Script — provision_device.sh + runbook
- **AZ-187** Device Provisioning Script — provision_devices.sh + runbook
### Batch 2 (8 points)
- **AZ-185** Update Manager — background update loop, version collector, model + Docker image apply, self-update last