[AZ-187] Rules & cleanup

Made-with: Cursor
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-04-17 18:54:04 +03:00
parent cfed26ff8c
commit d883fdb3cc
33 changed files with 1917 additions and 515 deletions
+10
View File
@@ -0,0 +1,10 @@
---
description: Rules for installation and provisioning scripts
globs: scripts/**/*.sh
alwaysApply: false
---
# Automation Scripts
- Automate everything that can be automated. If a dependency can be downloaded and installed, do it automatically — never require the user to manually download and set up prerequisites.
- Use sensible defaults for paths and configuration (e.g. `/opt/` for system-wide tools). Allow overrides via environment variables for users who need non-standard locations.
+20 -11
View File
@@ -1,17 +1,17 @@
---
description: "Enforces concise, comment-free, environment-aware coding standards with strict scope discipline and test verification"
description: "Enforces readable, environment-aware coding standards with scope discipline, meaningful comments, and test verification"
alwaysApply: true
---
# Coding preferences
- Always prefer simple solution
- Prefer the simplest solution that satisfies all requirements, including maintainability. When in doubt between two approaches, choose the one with fewer moving parts — but never sacrifice correctness, error handling, or readability for brevity.
- Follow the Single Responsibility Principle — a class or method should have one reason to change:
- If a method is hard to name precisely from the caller's perspective, its responsibility is misplaced. Vague names like "candidate", "data", or "item" are a signal — fix the design, not just the name.
- Logic specific to a platform, variant, or environment belongs in the class that owns that variant, not in the general coordinator. Passing a dependency through is preferable to leaking variant-specific concepts into shared code.
- Only use static methods for pure, self-contained computations (constants, simple math, stateless lookups). If a static method involves resource access, side effects, OS interaction, or logic that varies across subclasses or environments — use an instance method or factory class instead. Before implementing a non-trivial static method, ask the user.
- Generate concise code
- Avoid boilerplate and unnecessary indirection, but never sacrifice readability for brevity.
- Never suppress errors silently — no `2>/dev/null`, empty `catch` blocks, bare `except: pass`, or discarded error returns. These hide the information you need most when something breaks. If an error is truly safe to ignore, log it or comment why.
- Do not put comments in the code, except in tests: every test must use the Arrange / Act / Assert pattern with language-appropriate comment syntax (`# Arrange` for Python, `// Arrange` for C#/Rust/JS/TS). Omit any section that is not needed (e.g. if there is no setup, skip Arrange; if act and assert are the same line, keep only Assert)
- Do not put logs unless it is an exception, or was asked specifically
- Do not add comments that merely narrate what the code does. Comments are appropriate for: non-obvious business rules, workarounds with references to issues/bugs, safety invariants, and public API contracts. Make comments as short and concise as possible. Exception: every test must use the Arrange / Act / Assert pattern with language-appropriate comment syntax (`# Arrange` for Python, `// Arrange` for C#/Rust/JS/TS). Omit any section that is not needed (e.g. if there is no setup, skip Arrange; if act and assert are the same line, keep only Assert)
- Do not add verbose debug/trace logs by default. Log exceptions, security events (auth failures, permission denials), and business-critical state transitions. Add debug-level logging only when asked.
- Do not put code annotations unless it was asked specifically
- Write code that takes into account the different environments: development, production
- You are careful to make changes that are requested or you are confident the changes are well understood and related to the change being requested
@@ -22,16 +22,25 @@ alwaysApply: true
- When a test fails due to a missing dependency, install it — do not fake or stub the module system. For normal packages, add them to the project's dependency file (requirements-test.txt, package.json devDependencies, test csproj, etc.) and install. Only consider stubbing if the dependency is heavy (e.g. hardware-specific SDK, large native toolchain) — and even then, ask the user first before choosing to stub.
- Do not solve environment or infrastructure problems (dependency resolution, import paths, service discovery, connection config) by hardcoding workarounds in source code. Fix them at the environment/configuration level.
- Before writing new infrastructure or workaround code, check how the existing codebase already handles the same concern. Follow established project patterns.
- If a file, class, or function has no remaining usages — delete it. Do not keep dead code "just in case"; git history preserves everything. Dead code rots: its dependencies drift, it misleads readers, and it breaks when the code it depends on evolves.
- If a file, class, or function has no remaining usages — delete it. Dead code rots: its dependencies drift, it misleads readers, and it breaks when the code it depends on evolves. However, before deletion verify that the symbol is not used via any of the following. If any applies, do NOT delete — leave it or ASK the user:
- Public API surface exported from the package and potentially consumed outside the workspace (see `workspace-boundary.mdc`)
- Reflection, dependency injection, or service registration (scan DI container registrations, `appsettings.json` / equivalent config, attribute-based discovery, plugin manifests)
- Dynamic dispatch from config/data (YAML/JSON references, string-based class lookups, route tables, command dispatchers)
- Test fixtures used only by currently-skipped tests — temporary skips may become active again
- Cross-repo references — if this workspace is part of a multi-repo system, grep sibling repos for shared contracts before deleting
- Focus on the areas of code relevant to the task
- Do not touch code that is unrelated to the task
- Always think about what other methods and areas of code might be affected by the code changes
- When you think you are done with changes, run the full test suite. Every failure — including pre-existing ones, collection errors, and import errors — is a **blocking gate**. Never silently ignore, skip, or proceed past a failing test. On any failure, stop and ask the user to choose one of:
- **Scope discipline**: focus edits on the task scope. The "scope" is:
- Files the task explicitly names
- Files that define interfaces the task changes
- Files that directly call, implement, or test the changed code
- **Adjacent hygiene is permitted** without asking: fixing imports you caused to break, updating obvious stale references within a file you already modify, deleting code that became dead because of your change.
- **Unrelated issues elsewhere**: do not silently fix them as part of this task. Either note them to the user at end of turn and ASK before expanding scope, or record in `_docs/_process_leftovers/` for later handling.
- Always think about what other methods and areas of code might be affected by the code changes, and surface the list to the user before modifying.
- When you think you are done with changes, run the full test suite. Every failure in tests that cover code you modified or that depend on code you modified is a **blocking gate**. For pre-existing failures in unrelated areas, report them to the user but do not block on them. Never silently ignore or skip a failure without reporting it. On any blocking failure, stop and ask the user to choose one of:
- **Investigate and fix** the failing test or source code
- **Remove the test** if it is obsolete or no longer relevant
- Do not rename any databases or tables or table columns without confirmation. Avoid such renaming if possible.
- Make sure we don't commit binaries, create and keep .gitignore up to date and delete binaries after you are done with the task
- Never force-push to main or dev branches
- Place all source code under the `src/` directory; keep project-level config, tests, and tooling at the repo root
- For new projects, place source code under `src/` (this works for all stacks including .NET). For existing projects, follow the established directory structure. Keep project-level config, tests, and tooling at the repo root.
+14
View File
@@ -23,3 +23,17 @@ globs: [".cursor/**"]
## Security
- All `.cursor/` files must be scanned for hidden Unicode before committing (see cursor-security.mdc)
## Quality Thresholds (canonical reference)
All rules and skills must reference the single source of truth below. Do NOT restate different numeric thresholds in individual rule or skill files.
| Concern | Threshold | Enforcement |
|---------|-----------|-------------|
| Test coverage on business logic | 75% | Aim (warn below); 100% on critical paths |
| Test scenario coverage (vs AC + restrictions) | 75% | Blocking in test-spec Phase 1 and Phase 3 |
| CI coverage gate | 75% | Fail build below |
| Lint errors (Critical/High) | 0 | Blocking pre-commit |
| Code-review auto-fix | Low + Medium (Style/Maint/Perf) + High (Style/Scope) | Critical and Security always escalate |
When a skill or rule needs to cite a threshold, link to this table instead of hardcoding a different number.
+3 -2
View File
@@ -5,6 +5,7 @@ alwaysApply: true
# Git Workflow
- Work on the `dev` branch
- Commit message format: `[TRACKER-ID-1] [TRACKER-ID-2] Summary of changes`
- Commit message total length must not exceed 30 characters
- Commit message subject line format: `[TRACKER-ID-1] [TRACKER-ID-2] Summary of changes`
- Subject line must not exceed 72 characters (standard Git convention for the first line). The 72-char limit applies to the subject ONLY, not the full commit message.
- A commit message body is optional. Add one when the subject alone cannot convey the why of the change. Wrap the body at 72 chars per line.
- Do NOT push or merge unless the user explicitly asks you to. Always ask first if there is a need.
+33 -11
View File
@@ -4,21 +4,43 @@ alwaysApply: true
---
# Sound Notification on Human Input
Whenever you are about to ask the user a question, request confirmation, present options for a decision, or otherwise pause and wait for human input, you MUST first run the appropriate shell command for the current OS:
## Sound commands per OS
Detect the OS from user system info or `uname -s`:
- **macOS**: `afplay /System/Library/Sounds/Glass.aiff &`
- **Linux**: `paplay /usr/share/sounds/freedesktop/stereo/bell.oga 2>/dev/null || aplay /usr/share/sounds/freedesktop/stereo/bell.oga 2>/dev/null || echo -e '\a' &`
- **Windows (PowerShell)**: `[System.Media.SystemSounds]::Exclamation.Play()`
Detect the OS from the user's system info or by running `uname -s` if unknown.
## When to play (play exactly once per trigger)
This applies to:
- Asking clarifying questions
- Presenting choices (e.g. via AskQuestion tool)
- Requesting approval for destructive actions
- Reporting that you are blocked and need guidance
- Any situation where the conversation will stall without user response
- Completing a task (final answer / deliverable ready for review)
Play the sound when your turn will end in one of these states:
Do NOT play the sound when:
- You are in the middle of executing a multi-step task and just providing a status update
1. You are about to call the AskQuestion tool — sound BEFORE the AskQuestion call
2. Your text ends with a direct question to the user that cannot be answered without their input (e.g., "Which option do you prefer?", "What is the database name?", "Confirm before I push?")
3. You are reporting that you are BLOCKED and cannot continue without user input (missing credentials, conflicting requirements, external approval required)
4. You have just completed a destructive or irreversible action the user asked to review (commit, push, deploy, data migration, file deletion)
## When NOT to play
- You are mid-execution and returning a progress update (the conversation is not stalling)
- You are answering a purely informational or factual question and no follow-up is required
- You have already played the sound once this turn for the same pause point
- Your response only contains text describing what you did or found, with no question, no block, no irreversible action
## "Trivial" definition
A response is trivial (no sound) when ALL of the following are true:
- No explicit question to the user
- No "I am blocked" report
- No destructive/irreversible action that needs review
If any one of those is present, the response is non-trivial — play the sound.
## Ordering
The sound command is a normal Shell tool call. Place it:
- **Immediately before an AskQuestion tool call** in the same message, or
- **As the last Shell call of the turn** if ending with a text-based question, block report, or post-destructive-action review
Do not play the sound as part of routine command execution — only at the pause points listed under "When to play".
+22 -10
View File
@@ -5,7 +5,7 @@ alwaysApply: true
# Agent Meta Rules
## Execution Safety
- Never run test suites, builds, Docker commands, or other long-running/resource-heavy/security-risky operations without asking the user first — unless it is explicitly stated in a skill or agent, or the user already asked to do so.
- Run the full test suite automatically when you believe code changes are complete (as required by coderule.mdc). For other long-running/resource-heavy/security-risky operations (builds, Docker commands, deployments, performance tests), ask the user first — unless explicitly stated in a skill or the user already asked to do so.
## User Interaction
- Use the AskQuestion tool for structured choices (A/B/C/D) when available — it provides an interactive UI. Fall back to plain-text questions if the tool is unavailable.
@@ -33,18 +33,30 @@ When the user reacts negatively to generated code ("WTF", "what the hell", "why
- "Before writing new infrastructure or workaround code, check how the existing codebase already handles the same concern. Follow established project patterns."
## Debugging Over Contemplation
When the root cause of a bug is not clear after ~5 minutes of reasoning, analysis, and assumption-making — **stop speculating and add debugging logs**. Observe actual runtime behavior before forming another theory. The pattern to follow:
Agents cannot measure wall-clock time between turns. Use observable counts from your own transcript instead.
**Trigger: stop speculating and instrument.** When you've formed **3 or more distinct hypotheses** about a bug without confirming any against runtime evidence (logs, stderr, debugger state, actual test failure messages) — stop and add debugging output. Re-reading the same code hoping to "spot it this time" counts as a new hypothesis that still has zero evidence.
Steps:
1. Identify the last known-good boundary (e.g., "request enters handler") and the known-bad result (e.g., "callback never fires").
2. Add targeted `print(..., flush=True)` or log statements at each intermediate step to narrow the gap.
3. Read the output. Let evidence drive the next step — not inference chains built on unverified assumptions.
2. Add targeted `print(..., flush=True)`, `console.error`, or logger statements at each intermediate step to narrow the gap.
3. Run the instrumented code. Read the output. Let evidence drive the next hypothesis — not inference chains.
Prolonged mental contemplation without evidence is a time sink. A 15-minute instrumented run beats 45 minutes of "could it be X? but then Y... unless Z..." reasoning.
An instrumented run producing real output beats any amount of "could it be X? but then Y..." reasoning.
## Long Investigation Retrospective
When a problem takes significantly longer than expected (>30 minutes), perform a post-mortem before closing out:
1. **Identify the bottleneck**: Was the delay caused by assumptions that turned out wrong? Missing visibility into runtime state? Incorrect mental model of a framework or language boundary?
2. **Extract the general lesson**: What category of mistake was this? (e.g., "Python cannot call Cython `cdef` methods", "engine errors silently swallowed", "wrong layer to fix the problem")
3. **Propose a preventive rule**: Formulate it as a short, actionable statement. Present it to the user for approval.
4. **Write it down**: Add the approved rule to the appropriate `.mdc` file so it applies to all future sessions.
Trigger a post-mortem when ANY of the following is true (all are observable in your own transcript):
- **10+ tool calls** were used to diagnose a single issue
- **Same file modified 3+ times** without tests going green
- **3+ distinct approaches** attempted before arriving at the fix
- Any phrase like "let me try X instead" appeared **more than twice**
- A fix was eventually found by reading docs/source the agent had dismissed earlier
Post-mortem steps:
1. **Identify the bottleneck**: wrong assumption? missing runtime visibility? incorrect mental model of a framework/language boundary? ignored evidence?
2. **Extract the general lesson**: what category of mistake was this? (e.g., "Python cannot call Cython `cdef` methods", "engine errors silently swallowed", "wrong layer to fix the problem")
3. **Propose a preventive rule**: short, actionable. Present to user for approval.
4. **Write it down**: add approved rule to the appropriate `.mdc` so it applies to future sessions.
+1 -1
View File
@@ -8,7 +8,7 @@ globs: ["**/*test*", "**/*spec*", "**/*Test*", "**/tests/**", "**/test/**"]
- One assertion per test when practical; name tests descriptively: `MethodName_Scenario_ExpectedResult`
- Test boundary conditions, error paths, and happy paths
- Use mocks only for external dependencies; prefer real implementations for internal code
- Aim for 80%+ coverage on business logic; 100% on critical paths
- Aim for 75%+ coverage on business logic; 100% on critical paths (code paths where a bug would cause data loss, security breaches, financial errors, or system outages — identify from acceptance criteria marked as must-have or from security_approach.md). The 75% threshold is canonical — see `cursor-meta.mdc` Quality Thresholds.
- Integration tests use real database (Postgres testcontainers or dedicated test DB)
- Never use Thread Sleep or fixed delays in tests; use polling or async waits
- Keep test data factories/builders for reusable test setup
+36
View File
@@ -12,3 +12,39 @@ alwaysApply: true
- Project name: AZAION
- All task IDs follow the format `AZ-<number>`
- Issue types: Epic, Story, Task, Bug, Subtask
## Tracker Availability Gate
- If Jira MCP returns **Unauthorized**, **errored**, **connection refused**, or any non-success response: **STOP** tracker operations and notify the user.
- The user must fix the Jira MCP connection before any further ticket creation/transition/query is attempted.
- Do NOT silently create local-only tasks, skip Jira steps, or pretend the write succeeded. The tracker is the source of truth — if a status transition is lost, the team loses visibility.
## Leftovers Mechanism (non-user-input blockers only)
When a **non-user** blocker prevents a tracker write (MCP down, network error, transient failure, ticket linkage recoverable later), record the deferred write in `_docs/_process_leftovers/<YYYY-MM-DD>_<topic>.md` and continue non-tracker work. Each entry must include:
- Timestamp (ISO 8601)
- What was blocked (ticket creation, status transition, comment, link)
- Full payload that would have been written (summary, description, story points, epic, target status) — so the write can be replayed later
- Reason for the blockage (MCP unavailable, auth expired, unknown epic ID pending user clarification, etc.)
### Hard gates that CANNOT be deferred to leftovers
Anything requiring user input MUST still block:
- Clarifications about requirements, scope, or priority
- Approval for destructive actions or irreversible changes
- Choice between alternatives (A/B/C decisions)
- Confirmation of assumptions that change task outcome
If a blocker of this kind appears, STOP and ASK — do not write to leftovers.
### Replay obligation
At the start of every `/autopilot` invocation, and before any new tracker write in any skill, check `_docs/_process_leftovers/` for pending entries. For each entry:
1. Attempt to replay the deferred write against the tracker
2. If replay succeeds → delete the leftover entry
3. If replay still fails → update the entry's timestamp and reason, continue
4. If the blocker now requires user input (e.g., MCP still down after N retries) → surface to the user
Autopilot must not progress past its own step 0 until all leftovers that CAN be replayed have been replayed.
+5
View File
@@ -55,6 +55,11 @@ After selecting the flow, apply its detection rules (first match wins) to determ
Every invocation follows this sequence:
```
0. Process leftovers (see `.cursor/rules/tracker.mdc` → Leftovers Mechanism):
- Read _docs/_process_leftovers/ if it exists
- For each entry, attempt replay against the tracker
- Delete successful replays, update failed ones with new timestamp + reason
- If any leftover still blocked AND requires user input → STOP and ASK
1. Read _docs/_autopilot_state.md (if exists)
2. Read all File Index files above
3. Cross-check state file against _docs/ folder structure (rules in state.md)
+26 -13
View File
@@ -28,7 +28,7 @@ The `implementer` agent is the specialist that writes all the code — it receiv
- **Integrated review**: `/code-review` skill runs automatically after each batch
- **Auto-start**: batches launch immediately — no user confirmation before a batch
- **Gate on failure**: user confirmation is required only when code review returns FAIL
- **Commit and push per batch**: after each batch is confirmed, commit and push to remote
- **Commit per batch**: after each batch is confirmed, commit. Ask the user whether to push to remote unless the user previously opted into auto-push for this session.
## Context Resolution
@@ -134,25 +134,38 @@ Only proceed to Step 9 when every AC has a corresponding test.
### 10. Auto-Fix Gate
Auto-fix loop with bounded retries (max 2 attempts) before escalating to user:
Bounded auto-fix loop — only applies to **mechanical** findings. Critical and Security findings are never auto-fixed.
1. If verdict is **PASS** or **PASS_WITH_WARNINGS**: show findings as info, continue automatically to step 11
2. If verdict is **FAIL** (attempt 1 or 2):
- Parse the code review findings (Critical and High severity items)
- For each finding, attempt an automated fix using the finding's location, description, and suggestion
- Re-run `/code-review` on the modified files
- If now PASS or PASS_WITH_WARNINGS → continue to step 11
- If still FAIL → increment retry counter, repeat from (2) up to max 2 attempts
3. If still **FAIL** after 2 auto-fix attempts: present all findings to user (**BLOCKING**). User must confirm fixes or accept before proceeding.
**Auto-fix eligibility matrix:**
Track `auto_fix_attempts` count in the batch report for retrospective analysis.
| Severity | Category | Auto-fix? |
|----------|----------|-----------|
| Low | any | yes |
| Medium | Style, Maintainability, Performance | yes |
| Medium | Bug, Spec-Gap, Security | escalate |
| High | Style, Scope | yes |
| High | Bug, Spec-Gap, Performance, Maintainability | escalate |
| Critical | any | escalate |
| any | Security | escalate |
### 11. Commit and Push
Flow:
1. If verdict is **PASS** or **PASS_WITH_WARNINGS**: show findings as info, continue to step 11
2. If verdict is **FAIL**:
- Partition findings into auto-fix-eligible and escalate (using the matrix above)
- For eligible findings, attempt fixes using location/description/suggestion, then re-run `/code-review` on modified files (max 2 rounds)
- If all remaining findings are auto-fix-eligible and re-review now passes → continue to step 11
- If any non-eligible finding exists at any point → stop auto-fixing, present the full list to the user (**BLOCKING**)
3. User must explicitly approve each non-auto-fix finding (accept, request manual fix, mark as out-of-scope) before proceeding.
Track `auto_fix_attempts` and `escalated_findings` in the batch report for retrospective analysis.
### 11. Commit (and optionally Push)
- After user confirms the batch (explicitly for FAIL, implicitly for PASS/PASS_WITH_WARNINGS):
- `git add` all changed files from the batch
- `git commit` with a message that includes ALL task IDs (tracker IDs or numeric prefixes) of tasks implemented in the batch, followed by a summary of what was implemented. Format: `[TASK-ID-1] [TASK-ID-2] ... Summary of changes`
- `git push` to the remote branch
- Ask the user whether to push to remote, unless the user previously opted into auto-push for this session
### 12. Update Tracker Status → In Testing
+1 -1
View File
@@ -119,7 +119,7 @@ Read and follow `steps/07_quality-checklist.md`.
|-----------|--------|
| Missing acceptance_criteria.md, restrictions.md, or input_data/ | **STOP** — planning cannot proceed |
| Ambiguous requirements | ASK user |
| Input data coverage below 70% | Search internet for supplementary data, ASK user to validate |
| Input data coverage below 75% | Search internet for supplementary data, ASK user to validate |
| Technology choice with multiple valid options | ASK user |
| Component naming | PROCEED, confirm at next BLOCKING gate |
| File structure within templates | PROCEED |
@@ -32,3 +32,17 @@
6. Applicable scenarios
7. Team capability requirements
8. Migration difficulty
## Decomposition Completeness Probes (Completeness Audit Reference)
Used during Step 1's Decomposition Completeness Audit. After generating sub-questions, ask each probe against the current decomposition. If a probe reveals an uncovered area, add a sub-question for it.
| Probe | What it catches |
|-------|-----------------|
| **What does this cost — in money, time, resources, or trade-offs?** | Budget, pricing, licensing, tax, opportunity cost, maintenance burden |
| **What are the hard constraints — physical, legal, regulatory, environmental?** | Regulations, certifications, spectrum/frequency rules, export controls, physics limits, IP restrictions |
| **What are the dependencies and assumptions that could break?** | Supply chain, vendor lock-in, API stability, single points of failure, standards evolution |
| **What does the operating environment actually look like?** | Terrain, weather, connectivity, infrastructure, power, latency, user skill level |
| **What failure modes exist and what happens when they trigger?** | Degraded operation, fallback, safety margins, blast radius, recovery time |
| **What do practitioners who solved similar problems say matters most?** | Field-tested priorities that don't appear in specs or papers |
| **What changes over time — and what looks stable now but isn't?** | Technology roadmaps, regulatory shifts, deprecation risk, scaling effects |
@@ -10,6 +10,12 @@
- [ ] Every citation can be directly verified by the user (source verifiability)
- [ ] Structure hierarchy is clear; executives can quickly locate information
## Decomposition Completeness
- [ ] Domain discovery search executed: searched "key factors when [problem domain]" before starting research
- [ ] Completeness probes applied: every probe from `references/comparison-frameworks.md` checked against sub-questions
- [ ] No uncovered areas remain: all gaps filled with sub-questions or justified as not applicable
## Internet Search Depth
- [ ] Every sub-question was searched with at least 3-5 different query variants
@@ -97,6 +97,16 @@ When decomposing questions, you must explicitly define the **boundaries of the r
**Common mistake**: User asks about "university classroom issues" but sources include policies targeting "K-12 students" — mismatched target populations will invalidate the entire research.
#### Decomposition Completeness Audit (MANDATORY)
After generating sub-questions, verify the decomposition covers all major dimensions of the problem — not just the ones that came to mind first.
1. **Domain discovery search**: Search the web for "key factors when [problem domain]" / "what to consider when [problem domain]" (e.g., "key factors GPS-denied navigation", "what to consider when choosing an edge deployment strategy"). Extract dimensions that practitioners and domain experts consider important but are absent from the current sub-questions.
2. **Run completeness probes**: Walk through each probe in `references/comparison-frameworks.md` → "Decomposition Completeness Probes" against the current sub-question list. For each probe, note whether it is covered, not applicable (state why), or missing.
3. **Fill gaps**: Add sub-questions (with search query variants) for any uncovered area. Do this before proceeding to Step 2.
Record the audit result in `00_question_decomposition.md` as a "Completeness Audit" section.
**Save action**:
1. Read all files from INPUT_DIR to ground the research in the project context
2. Create working directory `RESEARCH_DIR/`
@@ -109,6 +119,7 @@ When decomposing questions, you must explicitly define the **boundaries of the r
- List of decomposed sub-questions
- **Chosen perspectives** (at least 3 from the Perspective Rotation table) with rationale
- **Search query variants** for each sub-question (at least 3-5 per sub-question)
- **Completeness audit** (taxonomy cross-reference + domain discovery results)
4. Write TodoWrite to track progress
---
+35 -21
View File
@@ -102,32 +102,46 @@ After investigating, present:
- If user picks A → apply fixes, then re-run (loop back to step 2)
- If user picks B → return failure to the autopilot
**Any test skipped**this is also a **blocking gate**. Skipped tests mean something is wrong — either with the test, the environment, or the test design. **Never blindly remove a skipped test.** Always investigate the root cause first.
**Any skipped test**classify as legitimate or illegitimate before deciding whether to block.
#### Investigation Protocol for Skipped Tests
#### Legitimate skips (accept and proceed)
For each skipped test:
The code path genuinely cannot execute on this runner. Acceptable reasons:
1. **Read the test code** — understand what the test is supposed to verify and why it skips.
2. **Determine the root cause** — why did the skip condition fire?
- Is the test environment misconfigured? (e.g., wrong ports, missing env vars, service not started correctly)
- Is the test ordering wrong? (e.g., a fixture in an earlier test mutates shared state)
- Is a dependency missing? (e.g., package not installed, fixture file absent)
- Is the skip condition outdated? (e.g., code was refactored but the skip guard still checks the old behavior)
- Is the test fundamentally untestable in the current setup? (e.g., requires Docker restart, different OS, special hardware)
3. **Try to fix the root cause first** — the goal is to make the test run, not to delete it:
- Fix the environment or configuration
- Reorder tests or isolate shared state
- Install the missing dependency
- Update the skip condition to match current behavior
4. **Only remove as last resort** — if the test truly cannot run in any realistic test environment (e.g., requires hardware not available, duplicates another test with identical assertions), then removal is justified. Document the reasoning.
- Hardware not physically present (GPU, Apple Neural Engine, sensor, serial device)
- Operating system mismatch (Darwin-only test on Linux CI, Windows-only test on macOS)
- Feature-flag-gated test whose feature is intentionally disabled in this environment
- External service the project deliberately does not control (e.g., a third-party API with no sandbox, and the project has a documented contract test instead)
#### Categorization
For legitimate skips: verify the skip condition is accurate (the test would run if the hardware/OS were present), verify it has a clear reason string, and proceed.
- **explicit skip (dead code)**: Has `@pytest.mark.skip` — investigate whether the reason in the decorator is still valid. Often these are temporary skips that became permanent by accident.
- **runtime skip (unreachable)**: `pytest.skip()` fires inside the test body — investigate why the condition always triggers. Often fixable by adjusting test order, environment, or the condition itself.
- **environment mismatch**: Test assumes a different environment — investigate whether the test environment setup can be fixed.
- **missing fixture/data**: Data or service not available — investigate whether it can be provided.
#### Illegitimate skips (BLOCKING — must resolve)
The skip is a workaround for something we can and should fix. NOT acceptable reasons:
- Required service not running (database, message broker, downstream API we control) → fix: bring the service up, add a docker-compose dependency, or add a mock
- Missing test fixture, seed data, or sample file → fix: provide the data, generate it, or ASK the user for it
- Missing environment variable or credential → fix: add to `.env.example`, document, ASK user for the value
- Flaky-test quarantine with no tracking ticket → fix: create the ticket (or replay via leftovers if tracker is down)
- Inherited skip from a prior refactor that was never cleaned up → fix: clean it up now
- Test ordering mutates shared state → fix: isolate the state
**Rule of thumb**: if the reason for skipping is "we didn't set something up," that's not a valid skip — set it up. If the reason is "this hardware/OS isn't here," that's valid.
#### Resolution steps for illegitimate skips
1. Classify the skip (read the skip reason and test body)
2. If the fix is **mechanical** — start a container, install a dep, add a mock, reorder fixtures — attempt it automatically and re-run
3. If the fix requires **user input** — credentials, sample data, a business decision — BLOCK and ASK
4. Never silently mark the skip as "accepted" — every illegitimate skip must either be fixed or escalated
5. Removal is a last resort and requires explicit user approval with documented reasoning
#### Categorization cheatsheet
- **explicit skip (e.g. `@pytest.mark.skip`)**: check whether the reason in the decorator is still valid
- **conditional skip (e.g. `@pytest.mark.skipif`)**: check whether the condition is accurate and whether we can change the environment to make it false
- **runtime skip (e.g. `pytest.skip()` in body)**: check why the condition fires — often an ordering or environment bug
- **missing fixture/data**: treated as illegitimate unless user confirms the data is unavailable
After investigating, present findings:
+31 -20
View File
@@ -27,8 +27,11 @@ Analyze input data completeness and produce detailed black-box test specificatio
- **Save immediately**: write artifacts to disk after each phase; never accumulate unsaved work
- **Ask, don't assume**: when requirements are ambiguous, ask the user before proceeding
- **Spec, don't code**: this workflow produces test specifications, never test implementation code
- **No test without data**: every test scenario MUST have concrete test data; tests without data are removed
- **No test without expected result**: every test scenario MUST pair input data with a quantifiable expected result; a test that cannot compare actual output against a known-correct answer is not verifiable and must be removed
- **Every test must have a pass/fail criterion**. Two acceptable shapes:
- **Input/output shape**: concrete input data paired with a quantifiable expected result (exact value, tolerance, threshold, pattern, reference file). Typical for functional blackbox tests, performance tests with load data, data-processing pipelines.
- **Behavioral shape**: a trigger condition + observable system behavior + quantifiable pass/fail criterion, with no input data required. Typical for startup/shutdown tests, retry/backoff policies, state transitions, logging/metrics emission, resilience scenarios. Example criteria: "startup logs `service ready` within 5s", "retry emits 3 attempts with exponential backoff (base 100ms ± 20ms)", "on SIGTERM, service drains in-flight requests within 30s grace period", "health endpoint returns 503 while migrations run".
- For behavioral tests the observable (log line, metric value, state transition, emitted event, elapsed time) must still be quantifiable — the test must programmatically decide pass/fail.
- A test that cannot produce a pass/fail verdict through either shape is not verifiable and must be removed.
## Context Resolution
@@ -177,7 +180,7 @@ At the start of execution, create a TodoWrite with all four phases. Update statu
|------------|--------------------------|---------------|----------------|
| [file/data] | Yes/No | Yes/No | [missing, vague, no tolerance, etc.] |
9. Threshold: at least 70% coverage of scenarios AND every covered scenario has a quantifiable expected result (see `.cursor/rules/cursor-meta.mdc` Quality Thresholds table)
9. Threshold: at least 75% coverage of scenarios AND every covered scenario has a quantifiable expected result (see `.cursor/rules/cursor-meta.mdc` Quality Thresholds table)
10. If coverage is low, search the internet for supplementary data, assess quality with user, and if user agrees, add to `input_data/` and update `input_data/expected_results/results_report.md`
11. If expected results are missing or not quantifiable, ask user to provide them before proceeding
@@ -232,18 +235,26 @@ Capture any new questions, findings, or insights that arise during test specific
### Phase 3: Test Data Validation Gate (HARD GATE)
**Role**: Professional Quality Assurance Engineer
**Goal**: Ensure every test scenario produced in Phase 2 has concrete, sufficient test data. Remove tests that lack data. Verify final coverage stays above 70%.
**Goal**: Ensure every test scenario produced in Phase 2 has concrete, sufficient test data. Remove tests that lack data. Verify final coverage stays above 75%.
**Constraints**: This phase is MANDATORY and cannot be skipped.
#### Step 1 — Build the test-data and expected-result requirements checklist
#### Step 1 — Build the requirements checklist
Scan `blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, and `resource-limit-tests.md`. For every test scenario, extract:
Scan `blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, and `resource-limit-tests.md`. For every test scenario, classify its shape (input/output or behavioral) and extract:
**Input/output tests:**
| # | Test Scenario ID | Test Name | Required Input Data | Required Expected Result | Result Quantifiable? | Comparison Method | Input Provided? | Expected Result Provided? |
|---|-----------------|-----------|---------------------|-------------------------|---------------------|-------------------|----------------|--------------------------|
| 1 | [ID] | [name] | [data description] | [what system should output] | [Yes/No] | [exact/tolerance/pattern/threshold] | [Yes/No] | [Yes/No] |
Present this table to the user.
**Behavioral tests:**
| # | Test Scenario ID | Test Name | Trigger Condition | Observable Behavior | Pass/Fail Criterion | Quantifiable? |
|---|-----------------|-----------|-------------------|--------------------|--------------------|---------------|
| 1 | [ID] | [name] | [e.g., service receives SIGTERM] | [e.g., drain logs emitted, port closed] | [e.g., drain completes ≤30s] | [Yes/No] |
Present both tables to the user.
#### Step 2 — Ask user to provide missing test data AND expected results
@@ -315,20 +326,20 @@ After all removals, recalculate coverage:
**Decision**:
- **Coverage ≥ 70%** → Phase 3 **PASSED**. Present final summary to user.
- **Coverage < 70%** → Phase 3 **FAILED**. Report:
> ❌ Test coverage dropped to **X%** (minimum 70% required). The removed test scenarios left gaps in the following acceptance criteria / restrictions:
- **Coverage ≥ 75%** → Phase 3 **PASSED**. Present final summary to user.
- **Coverage < 75%** → Phase 3 **FAILED**. Report:
> ❌ Test coverage dropped to **X%** (minimum 75% required). The removed test scenarios left gaps in the following acceptance criteria / restrictions:
>
> | Uncovered Item | Type (AC/Restriction) | Missing Test Data Needed |
> |---|---|---|
>
> **Action required**: Provide the missing test data for the items above, or add alternative test scenarios that cover these items with data you can supply.
**BLOCKING**: Loop back to Step 2 with the uncovered items. Do NOT finalize until coverage ≥ 70%.
**BLOCKING**: Loop back to Step 2 with the uncovered items. Do NOT finalize until coverage ≥ 75%.
#### Phase 3 Completion
When coverage ≥ 70% and all remaining tests have validated data AND quantifiable expected results:
When coverage ≥ 75% and all remaining tests have validated data AND quantifiable expected results:
1. Present the final coverage report
2. List all removed tests (if any) with reasons
@@ -479,23 +490,23 @@ Create `scripts/run-performance-tests.sh` at the project root. The script must:
| Missing acceptance_criteria.md, restrictions.md, or input_data/ | **STOP** — specification cannot proceed |
| Missing input_data/expected_results/results_report.md | **STOP** — ask user to provide expected results mapping using the template |
| Ambiguous requirements | ASK user |
| Input data coverage below 70% (Phase 1) | Search internet for supplementary data, ASK user to validate |
| Input data coverage below 75% (Phase 1) | Search internet for supplementary data, ASK user to validate |
| Expected results missing or not quantifiable (Phase 1) | ASK user to provide quantifiable expected results before proceeding |
| Test scenario conflicts with restrictions | ASK user to clarify intent |
| System interfaces unclear (no architecture.md) | ASK user or derive from solution.md |
| Test data or expected result not provided for a test scenario (Phase 3) | WARN user and REMOVE the test |
| Final coverage below 70% after removals (Phase 3) | BLOCK — require user to supply data or accept reduced spec |
| Final coverage below 75% after removals (Phase 3) | BLOCK — require user to supply data or accept reduced spec |
## Common Mistakes
- **Referencing internals**: tests must be black-box — no internal module names, no direct DB queries against the system under test
- **Vague expected outcomes**: "works correctly" is not a test outcome; use specific measurable values
- **Missing expected results**: input data without a paired expected result is useless — the test cannot determine pass/fail without knowing what "correct" looks like
- **Non-quantifiable expected results**: "should return good results" is not verifiable; expected results must have exact values, tolerances, thresholds, or pattern matches that code can evaluate
- **Missing pass/fail criterion**: input/output tests without an expected result, OR behavioral tests without a measurable observable — both are unverifiable and must be removed
- **Non-quantifiable criteria**: "should return good results", "works correctly", "behaves properly" — not verifiable. Use exact values, tolerances, thresholds, pattern matches, or timing bounds that code can evaluate.
- **Forcing the wrong shape**: do not invent fake input data for a behavioral test (e.g., "input: SIGTERM signal") just to fit the input/output shape. Classify the test correctly and use the matching checklist.
- **Missing negative scenarios**: every positive scenario category should have corresponding negative/edge-case tests
- **Untraceable tests**: every test should trace to at least one AC or restriction
- **Writing test code**: this skill produces specifications, never implementation code
- **Tests without data**: every test scenario MUST have concrete test data AND a quantifiable expected result; a test spec without either is not executable and must be removed
## Trigger Conditions
@@ -516,7 +527,7 @@ When the user wants to:
│ → verify AC, restrictions, input_data (incl. expected_results.md) │
│ │
│ Phase 1: Input Data & Expected Results Completeness Analysis │
│ → assess input_data/ coverage vs AC scenarios (≥70%) │
│ → assess input_data/ coverage vs AC scenarios (≥75%) │
│ → verify every input has a quantifiable expected result │
│ → present input→expected-result pairing assessment │
│ [BLOCKING: user confirms input data + expected results coverage] │
@@ -538,8 +549,8 @@ When the user wants to:
│ → validate input data (quality + quantity) │
│ → validate expected results (quantifiable + comparison method) │
│ → remove tests without data or expected result, warn user │
│ → final coverage check (≥70% or FAIL + loop back) │
│ [BLOCKING: coverage ≥ 70% required to pass] │
│ → final coverage check (≥75% or FAIL + loop back) │
│ [BLOCKING: coverage ≥ 75% required to pass] │
│ │
│ Phase 4: Test Runner Script Generation │
│ → detect test runner + docker-compose + load tool │
+1
View File
@@ -13,3 +13,4 @@ test-results/
Logs/
*.enc
*.o
scripts/.env
@@ -0,0 +1,46 @@
# Acceptance Criteria Assessment
## Acceptance Criteria
| Criterion | Our Values | Researched Values | Cost/Timeline Impact | Status |
|-----------|-----------|-------------------|---------------------|--------|
| AC1: AI models not extractable | Binary-split: model split across API+CDN, requires both keys to reconstruct | TPM: models encrypted with device-sealed key, only decryptable on provisioned hardware. Industry standard for edge AI (SecEdge, NVIDIA Zero-Trust). Stronger guarantee than split-storage. | Medium — requires fTPM provisioning in manufacturing pipeline | Modified |
| AC2: Device authentication | Email/password → JWT → hardware-hashed key derivation | TPM attestation: device proves identity via EK certificate. Can coexist with existing JWT auth. Stronger — hardware fuse-derived, not software-computed. | Low — additive to existing auth | Modified |
| AC3: Keys bound to hardware | SHA-384(email+password+hw_hash+salt) from subprocess-collected CPU/GPU info | TPM-sealed keys bound to device fuses (MB2 bootloader seed). Significantly stronger — cannot be replicated by spoofing hardware strings. | Low — TPM key sealing replaces software key derivation | Modified |
| AC4: Existing API contracts preserved | F1-F6 flows must not break | Achievable — TPM changes are internal to the loader's security layer. API endpoints and contracts remain the same. | None | Unchanged |
| AC5: ARM64 Jetson Orin Nano support | Required | fTPM available on all Orin series (JetPack 6.1+). Python tooling (tpm2-pytss) supports ARM64. | None — natively supported | Unchanged |
| AC6: Works inside Docker containers | Docker socket mount | TPM accessible via --device /dev/tpm0 --device /dev/tpmrm0. No --privileged needed. | Low — add device mounts to docker-compose | Unchanged |
| AC7: Cython compilation remains | .pyx → .so for IP protection | tpm2-pytss is pure Python calling native tpm2-tss. Can be wrapped in Cython modules same as existing crypto code. | Low | Unchanged |
| AC8: Migration path exists | N/A (new requirement) | TPM+standard download and legacy binary-split can coexist via feature flag. TPM-provisioned devices use sealed keys; non-provisioned use legacy scheme. | Medium — dual code path during transition | Added |
## Restrictions Assessment
| Restriction | Our Values | Researched Values | Cost/Timeline Impact | Status |
|-------------|-----------|-------------------|---------------------|--------|
| R1: ARM64 Jetson Orin Nano | Hard requirement | fTPM fully supported on Orin Nano (JetPack 6.1+) | None | Unchanged |
| R2: Docker container | Socket mount for Docker-in-Docker | TPM device mount is separate from Docker socket. Both can coexist. | None | Unchanged |
| R3: fTPM provisioning at manufacturing | N/A (new) | Only offline provisioning supported (per-device during manufacturing). Requires: KDK0 gen, fuse burn, EK cert via CA, EKB encoding. This is a significant operational requirement. | High — new manufacturing step | Added |
| R4: fTPM maturity concerns | N/A (new) | PCR persistence issues reported on forums (PCR7 not resetting, NV handles lost after reboot). Not production-hardened for all use cases yet. | Medium — risk of instability | Added |
| R5: SaaS + Edge dual deployment | Both SaaS web servers and Jetson edge | TPM is machine-specific. Works perfectly for fixed edge devices. For SaaS/cloud VMs, need vTPM or alternative key management. Dual strategy may be needed. | Medium — different security models per deployment type | Added |
## Key Findings
1. **fTPM on Jetson Orin Nano is real and capable** — JetPack 6.1+ provides TPM 2.0 with hardware root of trust from device fuses. The security guarantees are stronger than the current software-computed hash-based scheme.
2. **Binary-split can be simplified but not immediately eliminated** — TPM provides device-bound encryption (model only decryptable on provisioned hardware). This makes the split-storage model unnecessary for the anti-extraction threat. However, the CDN offloading benefit of big/small split (bandwidth optimization) is orthogonal to security.
3. **Manufacturing pipeline impact is significant** — fTPM provisioning requires per-device fuse burning and EK certificate enrollment during manufacturing. This is a business process change, not just a code change.
4. **Known stability issues** — Forum reports of PCR values and NV handles not persisting across reboots. This needs investigation before production reliance.
5. **Docker integration is straightforward** — Device mount, no privileged mode needed. Python tooling (tpm2-pytss) is mature and supports the required Python version.
6. **Dual deployment model needs consideration** — Jetson edge devices get TPM. SaaS web servers likely don't have TPM. Need a strategy that works for both.
## Sources
- NVIDIA Jetson Linux Developer Guide r36.4.4 (L1)
- NVIDIA JetPack 6.1 Blog (L2)
- NVIDIA Developer Forums — PCR/NV persistence issues (L4)
- tpm2-pytss GitHub/PyPI (L1)
- SecEdge/TCG — Edge AI Trusted Computing (L3)
- DevOps StackExchange — Docker TPM access (L4)
@@ -0,0 +1,73 @@
# Question Decomposition
## Original Question
Can TPM-based security on Jetson Orin Nano replace the binary-split resource scheme, simplifying the loader to a standard authenticated resource downloader?
## Active Mode
Mode A Phase 1 — AC & Restrictions Assessment
## Question Type
Decision Support — weighing trade-offs of TPM vs binary-split security models
## Problem Context Summary
The Azaion Loader uses a binary-split scheme (ADR-002) designed for untrusted end-user laptops. The deployment model shifted to SaaS/Jetson Orin Nano edge devices where TPM provides hardware-rooted trust. The question is whether TPM makes binary-split obsolete.
## Research Subject Boundary Definition
| Dimension | Boundary |
|-----------|----------|
| Population | Jetson Orin Nano edge devices running containerized AI workloads |
| Geography | Global (no geographic restriction) |
| Timeframe | JetPack 6.1+ (July 2024 onwards, when fTPM was introduced) |
| Level | Production deployment (not development/prototyping) |
## Sub-Questions
### SQ1: What are the fTPM capabilities on Jetson Orin Nano?
- "Jetson Orin Nano TPM capabilities security JetPack 6.1"
- "NVIDIA fTPM OP-TEE architecture Orin"
- "Jetson Orin TPM 2.0 key sealing PCR operations"
- "Jetson fTPM provisioning manufacturing process"
- "Jetson Orin fTPM limitations known issues forums"
### SQ2: Can TPM-sealed keys replace the current key derivation scheme?
- "TPM key sealing vs SHA-384 key derivation comparison"
- "tpm2-pytss seal unseal Python example"
- "TPM sealed key Docker container access /dev/tpm0"
- "TPM hardware-bound encryption key management edge AI"
### SQ3: Is the binary-split storage model still needed with TPM?
- "binary split key fragment security model vs TPM hardware root of trust"
- "AI model protection TPM-based vs split storage"
- "edge device model protection TPM encryption vs distributed key"
- "when is split-key security necessary vs hardware security module"
### SQ4: What's the migration path?
- "TPM security migration coexist legacy encryption"
- "gradual TPM adoption edge devices existing fleet"
### SQ5: What are the implementation requirements?
- "tpm2-pytss ARM64 Jetson Linux Docker"
- "Jetson Orin fTPM LUKS disk encryption Docker container"
- "TPM2 tools Cython integration"
## Chosen Perspectives
1. **Implementer/Engineer**: Technical integration complexity, library maturity, Docker constraints, Cython compatibility
2. **Domain Expert (Security)**: Threat model comparison, attack surface analysis, defense-in-depth considerations
3. **Practitioner**: Real-world fTPM experiences on Jetson, known issues, production readiness
## Timeliness Sensitivity Assessment
- **Research Topic**: fTPM on Jetson Orin Nano for AI model protection
- **Sensitivity Level**: High
- **Rationale**: NVIDIA Jetson ecosystem updates frequently; fTPM introduced in JetPack 6.1 (July 2024); PCR persistence issues reported
- **Source Time Window**: 12 months
- **Priority official sources**:
1. NVIDIA Jetson Linux Developer Guide (r36.4.4+)
2. TCG TPM 2.0 Specification
3. tpm2-software GitHub (tpm2-tss, tpm2-tools, tpm2-pytss)
- **Key version information to verify**:
- JetPack: 6.1+ (r36.4+)
- tpm2-pytss: latest (supports Python 3.11)
- tpm2-tss: 2.4.0+
@@ -0,0 +1,139 @@
# Source Registry
## Source #1
- **Title**: NVIDIA JetPack 6.1 — fTPM Introduction Blog
- **Link**: https://developer.nvidia.com/blog/nvidia-jetpack-6-1-boosts-performance-and-security-through-camera-stack-optimizations-and-introduction-of-firmware-tpm/
- **Tier**: L2
- **Publication Date**: 2024-07
- **Timeliness Status**: Currently valid
- **Version Info**: JetPack 6.1
- **Target Audience**: Jetson developers/OEMs
- **Research Boundary Match**: Full match
- **Summary**: fTPM introduced in JetPack 6.1 for Orin series; provides device attestation and secure key storage without discrete TPM hardware.
- **Related Sub-question**: SQ1
## Source #2
- **Title**: Firmware TPM — NVIDIA Jetson Linux Developer Guide (r36.4.4)
- **Link**: https://docs.nvidia.com/jetson/archives/r36.4.4/DeveloperGuide/SD/Security/FirmwareTPM.html
- **Tier**: L1
- **Publication Date**: 2025-06
- **Timeliness Status**: Currently valid
- **Version Info**: r36.4.4 / JetPack 6.1
- **Target Audience**: Jetson device manufacturers and developers
- **Research Boundary Match**: Full match
- **Summary**: Comprehensive fTPM docs: architecture (OP-TEE + TrustZone), provisioning (offline only), PCR measured boot, key derivation from hardware fuse, EK certificate management. Per-device unique seed from MB2 bootloader.
- **Related Sub-question**: SQ1, SQ2
## Source #3
- **Title**: Security — NVIDIA Jetson Linux Developer Guide (r36.4.3)
- **Link**: https://docs.nvidia.com/jetson/archives/r36.4.3/DeveloperGuide/SD/Security.html
- **Tier**: L1
- **Publication Date**: 2025
- **Timeliness Status**: Currently valid
- **Version Info**: r36.4.3
- **Target Audience**: Jetson device manufacturers and developers
- **Research Boundary Match**: Full match
- **Summary**: Overview of Jetson security: Secure Boot, Disk Encryption (LUKS), OP-TEE, fTPM. Chain of trust from BootROM through fuses.
- **Related Sub-question**: SQ1, SQ3
## Source #4
- **Title**: Access ftpm pcr registers — NVIDIA Developer Forums
- **Link**: https://forums.developer.nvidia.com/t/access-ftpm-pcr-registers/328636
- **Tier**: L4
- **Publication Date**: 2024-2025
- **Timeliness Status**: Currently valid
- **Version Info**: JetPack 6.x / Debian-based
- **Target Audience**: Jetson Orin Nano developers
- **Research Boundary Match**: Full match
- **Summary**: Users report PCR7 values not persisting/resetting across reboots when using fTPM for disk encryption. Issues with cryptsetup integration.
- **Related Sub-question**: SQ1, SQ5
## Source #5
- **Title**: fTPM handles don't persist after reboot — NVIDIA Developer Forums
- **Link**: https://forums.developer.nvidia.com/t/ftpm-handles-dont-persist-after-a-reboot/344424
- **Tier**: L4
- **Publication Date**: 2024-2025
- **Timeliness Status**: Currently valid
- **Target Audience**: Jetson Orin NX developers
- **Research Boundary Match**: Full match (same Orin family)
- **Summary**: fTPM NV handles not persisting across reboots on Orin NX. Suggests broader persistence issues across Orin variants.
- **Related Sub-question**: SQ1, SQ5
## Source #6
- **Title**: Accessing TPM from inside a Docker Container — DevOps StackExchange
- **Link**: https://devops.stackexchange.com/questions/8509/accessing-tpm-from-inside-a-docker-container
- **Tier**: L4
- **Publication Date**: Various
- **Timeliness Status**: Currently valid
- **Target Audience**: DevOps engineers
- **Research Boundary Match**: Partial overlap (general Docker, not Jetson-specific)
- **Summary**: Mount /dev/tpm0 and /dev/tpmrm0 via --device flag. TPM is for key wrapping, not storage. Machine-specific binding.
- **Related Sub-question**: SQ2, SQ5
## Source #7
- **Title**: Docker container accessing virtual TPM — Medium
- **Link**: https://medium.com/@eng.fernandosilva/docker-container-accessing-virtual-tpm-device-from-vm-running-on-windows-11-hyper-v-6c1bbb0f0c5d
- **Tier**: L3
- **Publication Date**: 2024
- **Timeliness Status**: Currently valid
- **Target Audience**: Docker/DevOps practitioners
- **Research Boundary Match**: Partial overlap (Windows vTPM, but Docker access patterns apply)
- **Summary**: Docker --device /dev/tpm0:/dev/tpm0 --device /dev/tpmrm0:/dev/tpmrm0 for TPM access. No --privileged needed for device-based access.
- **Related Sub-question**: SQ5
## Source #8
- **Title**: Securing Edge AI through Trusted Computing — SecEdge/TCG Blog
- **Link**: https://www.secedge.com/tcg-blog-securing-edge-ai-through-trusted-computing/
- **Tier**: L3
- **Publication Date**: 2024-2025
- **Timeliness Status**: Currently valid
- **Target Audience**: Edge AI security architects
- **Research Boundary Match**: Full match
- **Summary**: TPM-based device trust for edge AI: device-bound encryption, model binding to specific hardware, attestation. Addresses unauthorized copying, tampering, and cloning threats.
- **Related Sub-question**: SQ3
## Source #9
- **Title**: tpm2-software/tpm2-pytss — GitHub
- **Link**: https://github.com/tpm2-software/tpm2-pytss
- **Tier**: L1
- **Publication Date**: 2026-02 (last update)
- **Timeliness Status**: Currently valid
- **Version Info**: Latest, supports Python 3.10-3.14
- **Target Audience**: Python developers using TPM
- **Research Boundary Match**: Full match
- **Summary**: Python bindings for TPM2 TSS. ESAPI, FAPI, marshaling support. Requires tpm2-tss >= 2.4.0. Available on PyPI.
- **Related Sub-question**: SQ5
## Source #10
- **Title**: Building a Zero-Trust Architecture for Confidential AI Factories — NVIDIA Blog
- **Link**: https://developer.nvidia.com/blog/building-a-zero-trust-architecture-for-confidential-ai-factories/
- **Tier**: L2
- **Publication Date**: 2024-2025
- **Timeliness Status**: Currently valid
- **Target Audience**: AI infrastructure architects
- **Research Boundary Match**: Reference only (cloud/data center focus, not edge)
- **Summary**: Zero-trust with TEEs and attestation for AI model protection. Hardware-enforced trust, model binding, three-way trust dilemma. Industry direction for AI model security.
- **Related Sub-question**: SQ3
## Source #11
- **Title**: OP-TEE — NVIDIA Jetson Linux Developer Guide (r36.4.4)
- **Link**: https://docs.nvidia.com/jetson/archives/r36.4.4/DeveloperGuide/SD/Security/OpTee.html
- **Tier**: L1
- **Publication Date**: 2025
- **Timeliness Status**: Currently valid
- **Version Info**: r36.4.4
- **Target Audience**: Jetson developers building Trusted Applications
- **Research Boundary Match**: Full match
- **Summary**: OP-TEE on Jetson Orin: TrustZone-based TEE, Client Application ↔ Trusted Application communication via libteec, crypto services available. Custom TAs can be built.
- **Related Sub-question**: SQ1, SQ2
## Source #12
- **Title**: LUKS Full Disk Encryption on Jetson Orin Nano — Piveral
- **Link**: https://nvidia-jetson.piveral.com/jetson-orin-nano/implementing-password-protected-luks-full-disk-encryption-on-jetson-orin-nano/
- **Tier**: L3
- **Publication Date**: 2024-2025
- **Timeliness Status**: Currently valid
- **Target Audience**: Jetson Orin Nano practitioners
- **Research Boundary Match**: Full match
- **Summary**: LUKS encryption on Orin Nano. Default auto-decrypt on boot defeats purpose. Must modify LUKS service for password prompts. gen_luks_passphrase script for key generation.
- **Related Sub-question**: SQ2, SQ5
@@ -0,0 +1,161 @@
# Fact Cards
## Fact #1
- **Statement**: Jetson Orin Nano series has firmware TPM (fTPM) support, introduced in JetPack 6.1 (July 2024). It implements TPM 2.0 via the TCG reference implementation running in OP-TEE.
- **Source**: Source #1, #2
- **Phase**: Phase 1
- **Target Audience**: Jetson Orin Nano developers
- **Confidence**: ✅ High
- **Related Dimension**: TPM capability
## Fact #2
- **Statement**: The fTPM seed is derived from hardware fuses by the MB2 secure bootloader. It is a per-device, unique, secure value — establishing hardware root of trust.
- **Source**: Source #2
- **Phase**: Phase 1
- **Target Audience**: Jetson Orin Nano developers
- **Confidence**: ✅ High
- **Related Dimension**: Hardware binding strength
## Fact #3
- **Statement**: fTPM provisioning currently supports offline method only (per-device during manufacturing). Online provisioning "will be available in a future release" (as of r36.4.4).
- **Source**: Source #2
- **Phase**: Phase 1
- **Target Audience**: Jetson device manufacturers
- **Confidence**: ✅ High
- **Related Dimension**: Implementation complexity
## Fact #4
- **Statement**: fTPM provisioning requires: per-device KDK0 generation, fuse burning, EK certificate generation via CA server, EKB encoding. This is a manufacturing-time process.
- **Source**: Source #2
- **Phase**: Phase 1
- **Target Audience**: Jetson device manufacturers
- **Confidence**: ✅ High
- **Related Dimension**: Implementation complexity
## Fact #5
- **Statement**: Users report fTPM PCR register values (specifically PCR7) not persisting/resetting correctly across reboots on Jetson Orin Nano with Debian-based systems.
- **Source**: Source #4
- **Phase**: Phase 1
- **Target Audience**: Jetson Orin Nano users attempting disk encryption
- **Confidence**: ⚠️ Medium (forum reports, not officially confirmed as bug vs. misconfiguration)
- **Related Dimension**: Production readiness
## Fact #6
- **Statement**: fTPM NV handles don't persist after reboot on Jetson Orin NX, suggesting broader persistence issues across the Orin family.
- **Source**: Source #5
- **Phase**: Phase 1
- **Target Audience**: Jetson Orin developers
- **Confidence**: ⚠️ Medium (forum reports from multiple users)
- **Related Dimension**: Production readiness
## Fact #7
- **Statement**: Docker containers can access host TPM via --device /dev/tpm0:/dev/tpm0 --device /dev/tpmrm0:/dev/tpmrm0. No --privileged flag needed for device-based mount.
- **Source**: Source #6, #7
- **Phase**: Phase 1
- **Target Audience**: Docker/container developers
- **Confidence**: ✅ High
- **Related Dimension**: Docker integration
## Fact #8
- **Statement**: TPM is a key wrapping/sealing device, not a storage device. Minimal storage capacity and slow. Proper pattern: seal encryption keys in TPM, store encrypted data elsewhere.
- **Source**: Source #6
- **Phase**: Phase 1
- **Target Audience**: General TPM users
- **Confidence**: ✅ High
- **Related Dimension**: Architecture pattern
## Fact #9
- **Statement**: tpm2-pytss (Python TPM2 bindings) is available on PyPI, supports Python 3.10-3.14, requires tpm2-tss >= 2.4.0. Provides ESAPI and FAPI interfaces.
- **Source**: Source #9
- **Phase**: Phase 1
- **Target Audience**: Python developers
- **Confidence**: ✅ High
- **Related Dimension**: Implementation tooling
## Fact #10
- **Statement**: Industry trend: hardware-enforced TEEs and attestation for AI model protection. Device-bound encryption ties models to specific devices, preventing unauthorized copying.
- **Source**: Source #8, #10
- **Phase**: Phase 1
- **Target Audience**: Edge AI security architects
- **Confidence**: ✅ High
- **Related Dimension**: Industry direction
## Fact #11
- **Statement**: TPM binding is machine-specific. If workloads migrate across hardware, TPM-sealed keys become inaccessible. This is a feature for edge devices (prevents extraction) but a constraint for SaaS/cloud deployments.
- **Source**: Source #6
- **Phase**: Phase 1
- **Target Audience**: Infrastructure architects
- **Confidence**: ✅ High
- **Related Dimension**: Deployment model compatibility
## Fact #12
- **Statement**: The current loader's binary-split scheme splits resources into small part (API, per-user/hw key) + big part (CDN, shared key). Designed to prevent model extraction on untrusted laptops.
- **Source**: Problem context (architecture.md, ADR-002)
- **Phase**: Phase 1
- **Target Audience**: Azaion team
- **Confidence**: ✅ High
- **Related Dimension**: Current architecture
## Fact #13
- **Statement**: The loader currently derives hardware-bound keys via SHA-384(email + password + hw_hash + salt). The hw_hash is SHA-384 of hardware fingerprint collected by HardwareService (CPU/GPU info via subprocess).
- **Source**: Problem context (architecture.md, security module docs)
- **Phase**: Phase 1
- **Target Audience**: Azaion team
- **Confidence**: ✅ High
- **Related Dimension**: Current key management
## Fact #14
- **Statement**: OP-TEE on Jetson Orin supports custom Trusted Applications that can perform cryptographic operations in the secure world (ARM TrustZone S-EL0).
- **Source**: Source #11
- **Phase**: Phase 1
- **Target Audience**: Jetson security developers
- **Confidence**: ✅ High
- **Related Dimension**: TPM capability
## Fact #15
- **Statement**: Jetson Orin LUKS disk encryption defaults to auto-decrypt on boot (defeating purpose). Requires modification to LUKS service for password-protected operation.
- **Source**: Source #12
- **Phase**: Phase 1
- **Target Audience**: Jetson Orin Nano practitioners
- **Confidence**: ✅ High
- **Related Dimension**: Disk encryption readiness
## Fact #16
- **Statement**: Orin Nano only supports REE FS for OP-TEE secure storage (file-system-based). RPMB (hardware replay-protected memory) is AGX Orin only. REE FS stores encrypted data at /data/tee/ on the normal world filesystem.
- **Source**: NVIDIA Jetson Linux Developer Guide — Secure Storage (r38.2)
- **Phase**: Phase 2
- **Target Audience**: Jetson Orin Nano developers
- **Confidence**: ✅ High
- **Related Dimension**: Storage security
## Fact #17
- **Statement**: tpm2-pytss FAPI provides create_seal(path, data), unseal(path), encrypt(path, plaintext), decrypt(path, ciphertext) — high-level Python API for TPM key operations.
- **Source**: tpm2-pytss documentation (readthedocs)
- **Phase**: Phase 2
- **Target Audience**: Python TPM developers
- **Confidence**: ✅ High
- **Related Dimension**: Implementation tooling
## Fact #18
- **Statement**: Alternative AI model protection without TPM: signed manifests with payload hashes, asymmetric signature verification on-device, dm-verity for runtime integrity. These work on any hardware.
- **Source**: Thistle Technologies, Tinfoil Containers blogs
- **Phase**: Phase 2
- **Target Audience**: Edge AI security architects
- **Confidence**: ✅ High
- **Related Dimension**: Non-TPM alternatives
## Fact #19
- **Statement**: TPM key sealing workflow: tpm2_createprimary → tpm2_create (with optional PCR policy) → tpm2_load → tpm2_startauthsession → tpm2_policypcr → tpm2_unseal. Keys are bound to device and optionally to boot state.
- **Source**: tpm2-tools tutorial, GitHub issues
- **Phase**: Phase 2
- **Target Audience**: TPM developers
- **Confidence**: ✅ High
- **Related Dimension**: Implementation workflow
## Fact #20
- **Statement**: The binary-split CDN offloading (big part on CDN, small part on API) serves a bandwidth/cost purpose separate from its security purpose. Even if security is handled by TPM, CDN offloading for large models may still be valuable.
- **Source**: Architecture analysis (ADR-002 rationale)
- **Phase**: Phase 2
- **Target Audience**: Azaion team
- **Confidence**: ✅ High
- **Related Dimension**: Architecture separation of concerns
@@ -0,0 +1,34 @@
# Comparison Framework
## Selected Framework Type
Decision Support
## Selected Dimensions
1. Solution overview
2. Threat model coverage
3. Hardware binding strength
4. Implementation cost
5. Maintenance cost
6. Risk assessment
7. Migration difficulty
8. Applicable scenarios
## Compared Solutions
- **A: Current binary-split scheme** (status quo)
- **B: TPM-only** (full replacement — eliminate binary-split)
- **C: Hybrid** (TPM for device binding + simplified download without split)
## Initial Population
| Dimension | A: Binary-Split (current) | B: TPM-Only | C: Hybrid (recommended) | Factual Basis |
|-----------|--------------------------|-------------|------------------------|---------------|
| Solution overview | Encrypt resource, split small (API) + big (CDN), per-user+hw key + shared key | TPM-sealed master key, single encrypted download, device-bound decryption | TPM-sealed key for device binding; single authenticated download from API/CDN; no split | Fact #12, #2, #8 |
| Threat model | Prevents extraction by requiring two servers; hardware fingerprint (software hash) ties to device | Prevents extraction via hardware fuse-derived key; attestation proves device identity; tamper-evident boot chain | Combines TPM device binding with authenticated download; single download point acceptable because device itself is trusted | Fact #2, #10, #11 |
| Hardware binding | SHA-384(email+password+hw_hash+salt) — software-computed, spoofable if hw strings are replicated | fTPM seed from hardware fuses — per-device unique, not software-spoofable | Same as B for binding; key sealed in TPM | Fact #2, #13 |
| Implementation cost | Already implemented | High: fTPM provisioning pipeline, tpm2-pytss integration, new security module, Docker device mounts, dual-path for SaaS | Medium: same TPM integration as B, but simpler download logic (remove split/merge code) | Fact #3, #4, #7, #9 |
| Maintenance cost | Moderate: two download paths (API+CDN), split/merge logic, two key types | Lower: single download path, single key type, but TPM provisioning infrastructure | Lowest: single download, TPM key management; CDN used for bandwidth only (no security split) | Fact #20 |
| Risk | Low (proven, in production) | High: fTPM persistence bugs (#5,#6), offline-only provisioning, REE FS (no RPMB on Nano) | Medium: same TPM risks as B, but fallback to legacy scheme mitigates | Fact #5, #6, #16 |
| Migration difficulty | N/A | Very high: all devices must be re-provisioned; no backward compatibility | Medium: feature-flag based; TPM-provisioned devices use new path, others use legacy | Fact #11 |
| Applicable scenarios | All current: laptops, edge, SaaS | Jetson Orin Nano (with fTPM) only; SaaS needs separate solution | Jetson Orin Nano gets TPM path; SaaS/non-TPM devices get simplified authenticated download (no split needed if server is trusted) | Fact #11, #18 |
@@ -0,0 +1,111 @@
# Reasoning Chain
## Dimension 1: Is binary-split still necessary for security?
### Fact Confirmation
The binary-split was designed for untrusted laptops (Fact #12): if an attacker compromises the CDN, they get 99% of the model but cannot reconstruct it without the API-held 1%. The threat is physical access to an untrusted device.
### Reference Comparison
On Jetson Orin Nano with fTPM (Fact #2): the encryption key is derived from hardware fuses. Even with full disk access, the attacker cannot extract the key without the specific TPM hardware. The device itself is the trust anchor, not the storage distribution.
### Conclusion
For TPM-equipped devices, split-storage adds complexity without adding security. The TPM hardware binding is strictly stronger than distributing fragments across servers. Binary-split's security purpose is obsolete on TPM devices.
### Confidence
✅ High — hardware-fuse-derived keys are fundamentally stronger than software-computed hashes.
---
## Dimension 2: Is CDN offloading still valuable without split?
### Fact Confirmation
ADR-002 lists two reasons for binary-split (Fact #20): (1) security (prevent single-point compromise) and (2) bandwidth/cost (large files on CDN, small metadata on API).
### Reference Comparison
If security is handled by TPM device binding, the CDN offloading benefit remains valid for large AI models. But the *splitting* mechanism (small+big parts) is unnecessary — a single encrypted file on CDN with an authenticated download URL achieves the same bandwidth benefit.
### Conclusion
CDN usage should remain for bandwidth optimization. But the split-and-merge encryption scheme can be replaced by a simpler pattern: encrypt the whole resource with a TPM-sealed key, store on CDN, download as single file.
### Confidence
✅ High — bandwidth and security are orthogonal concerns.
---
## Dimension 3: Can tpm2-pytss integrate with the Cython codebase?
### Fact Confirmation
tpm2-pytss (Fact #9, #17) is a Python library calling native tpm2-tss via CFFI. It provides FAPI with create_seal, unseal, encrypt, decrypt. The loader's security module is Cython (.pyx) calling Python cryptographic libraries.
### Reference Comparison
The current security.pyx already calls Python libraries (cryptography.hazmat). tpm2-pytss follows the same pattern — Python calls to a native library. Cython can call tpm2-pytss the same way.
### Conclusion
No architectural barrier. tpm2-pytss integrates naturally alongside existing cryptography library usage.
### Confidence
✅ High — same integration pattern as existing code.
---
## Dimension 4: What about SaaS/non-TPM deployments?
### Fact Confirmation
The loader now runs on both Jetson edge devices and SaaS web servers (Fact #11). TPM is machine-specific — works for fixed edge devices but SaaS VMs may not have TPM (or have vTPM with different trust properties).
### Reference Comparison
Alternative approaches exist for non-TPM environments (Fact #18): signed manifests, asymmetric signature verification, authenticated downloads. For SaaS servers that the company controls, the threat model is different — the server is trusted, so split-storage is unnecessary even without TPM.
### Conclusion
Two-tier strategy: (1) Jetson devices use TPM-sealed keys for strongest binding; (2) SaaS servers use standard authenticated download (no split needed since server is trusted infrastructure). The binary-split complexity is needed for neither scenario.
### Confidence
✅ High — different deployment contexts have different threat models.
---
## Dimension 5: fTPM production readiness
### Fact Confirmation
Forum reports (Fact #5, #6): PCR7 values not persisting across reboots; NV handles lost after reboot. RPMB not available on Orin Nano (Fact #16) — only REE FS.
### Reference Comparison
The proposed design does NOT rely on PCR-sealed keys or NV indexes. The key workflow uses FAPI create_seal/unseal with the Storage Root Key (SRK) hierarchy, which derives from the hardware fuse seed (Fact #2). This is independent of PCR persistence and NV storage issues.
### Conclusion
The PCR/NV persistence bugs are not blocking for this use case. FAPI seal/unseal under the SRK hierarchy uses the persistent primary key derived from fuses, not PCR-gated policies. However, this should be validated on actual hardware before committing.
### Confidence
⚠️ Medium — reasoning is sound but needs hardware validation.
---
## Dimension 6: Manufacturing pipeline impact
### Fact Confirmation
fTPM provisioning requires (Fact #3, #4): per-device KDK0 generation, fuse burning, EK certificate via CA, EKB encoding. Only offline provisioning supported.
### Reference Comparison
The current loader requires no manufacturing-time setup — credentials are provided at runtime. Adding fTPM provisioning is a significant operational change.
### Conclusion
fTPM provisioning is the biggest non-code cost. However, if Jetson devices are already manufactured by an OEM partner, fTPM provisioning can be integrated into the existing flashing pipeline. For development/testing, a simulated TPM (swtpm) can be used.
### Confidence
⚠️ Medium — depends on OEM manufacturing pipeline.
---
## Dimension 7: Migration path
### Fact Confirmation
Existing deployments use binary-split. New deployments can use TPM. Both must coexist during transition.
### Reference Comparison
Feature-flag pattern: detect at startup whether /dev/tpm0 exists and is provisioned. If yes, use TPM key path. If no, fall back to legacy binary-split. The API contracts (F1-F6) remain unchanged — the security layer is internal.
### Conclusion
A SecurityProvider abstraction (interface) with two implementations (LegacySecurityProvider, TpmSecurityProvider) enables clean coexistence. Detection is automatic. No API changes required.
### Confidence
✅ High — standard abstraction pattern, no external dependencies on migration.
@@ -0,0 +1,46 @@
# Validation Log
## Validation Scenario
A Jetson Orin Nano edge device with fTPM provisioned needs to download an AI model, decrypt it, and load it. A SaaS web server without TPM needs the same model.
## Expected Based on Conclusions
### Jetson Orin Nano (TPM path):
1. Loader starts, detects /dev/tpm0 → TpmSecurityProvider
2. POST /login → JWT auth (unchanged)
3. POST /load/{model} → single encrypted download from CDN via authenticated URL
4. TPM unseals the device-specific decryption key
5. Model decrypted and returned to caller
### SaaS web server (no-TPM path):
1. Loader starts, no /dev/tpm0 → LegacySecurityProvider (or SimplifiedSecurityProvider)
2. POST /login → JWT auth (unchanged)
3. POST /load/{model} → single authenticated download (no split needed — server is trusted)
4. Standard key derivation from credentials
5. Model decrypted and returned to caller
### Docker unlock (Jetson):
1. POST /unlock → authenticate
2. Download key → TPM-sealed key used instead of key fragment download
3. Decrypt archive → same as current but with TPM-derived key
4. docker load → unchanged
## Actual Validation Results
The scenario is consistent with the proposed architecture. Key observations:
- API endpoints remain identical (F1-F6 contracts preserved)
- The security layer change is internal — callers don't know which provider is active
- CDN is still used for bandwidth (large model storage) but serves single files, not split parts
- Upload flow (F3) simplifies: encrypt whole file, upload to CDN + register on API (no split)
## Counterexamples
1. **What if a device needs to be re-provisioned?** — fTPM provisioning is manufacturing-time. If a device's fTPM state is corrupted, it needs re-flashing. This is acceptable for edge devices (they're managed hardware) but must be documented.
2. **What if the same model needs to work across TPM and non-TPM devices?** — Models are encrypted per-deployment. TPM devices get a device-specific encrypted copy. Non-TPM devices get a credentials-encrypted copy. The API server handles the distinction.
## Review Checklist
- [x] Draft conclusions consistent with fact cards
- [x] No important dimensions missed
- [x] No over-extrapolation
- [x] Conclusions actionable/verifiable
## Conclusions Requiring Revision
None. The hybrid approach (Solution C) is validated as feasible and superior to both status quo and full-TPM-only.
@@ -0,0 +1,66 @@
# Security Analysis: TPM-Based Security Replacing Binary-Split
## Threat Model
### Asset Inventory
| Asset | Value | Current Protection | Proposed Protection (TPM) |
|-------|-------|--------------------|--------------------------|
| AI model files | High — core IP | AES-256-CBC, split storage (API+CDN), per-user+hw key | AES-256-CBC, TPM-sealed device key, single encrypted storage |
| Docker image archive | High — service IP | AES-256-CBC, key fragment from API | AES-256-CBC, TPM-sealed key (no network key download) |
| User credentials | Medium | In-memory only | In-memory only (unchanged) |
| JWT tokens | Medium | In-memory, no signature verification | In-memory (unchanged; signature verification is a separate concern) |
| CDN credentials | Medium | Encrypted cdn.yaml from API | Same (unchanged) |
| Encryption keys | Critical | SHA-384 derived, in memory | TPM-sealed, never in user-space memory in plaintext |
### Threat Actors
| Actor | Capability | Motivation |
|-------|-----------|-----------|
| Physical attacker (edge) | Physical access to Jetson device, can extract storage | Steal AI models |
| Network attacker | MITM, API/CDN compromise | Intercept models in transit |
| Insider (compromised server) | Access to API or CDN backend | Extract stored model fragments |
| Reverse engineer | Access to loader binary (.so files) | Extract key derivation logic, salts |
### Attack Vectors — Current vs Proposed
| Attack Vector | Current (Binary-Split) | Proposed (TPM) | Delta |
|--------------|----------------------|----------------|-------|
| **Extract model from disk** | Must obtain both CDN big part + API small part. If attacker has disk, big part is local. Need API access for small part. | Model encrypted with TPM-sealed key. Key cannot be extracted without the specific TPM hardware. | **Stronger** — hardware binding vs. server-side fragmentation |
| **Clone device** | Replicate hardware fingerprint strings (CPU model, GPU, etc.) → derive same SHA-384 key | Cannot clone fTPM — seed derived from hardware fuses, unique per chip | **Stronger** — fuse-based vs. string-based identity |
| **Compromise CDN** | Get big parts only — useless without small parts from API | Get encrypted files — useless without TPM-sealed key on target device | **Equivalent** — both require a second factor |
| **Compromise API** | Get small parts + key fragments. Combined with CDN data = full model | Get encrypted metadata. Key is TPM-sealed, not on API server | **Stronger** — API no longer holds key material |
| **Reverse-engineer loader binary** | Extract salt strings from .so → reconstruct SHA-384 key derivation → derive keys for any known email+password+hw combo | TPM key derivation is in hardware. Even with full .so source, keys are not reconstructable | **Stronger** — hardware vs. software key protection |
| **Memory dump at runtime** | Keys exist in Python process memory during encrypt/decrypt operations | With FAPI: encryption happens via TPM — key never enters user-space memory | **Stronger** — key stays in TPM |
| **Stolen credentials** | Attacker with email+password can derive all keys if they also know hw fingerprint | Credentials alone are insufficient — TPM-sealed key requires the physical device | **Stronger** — credentials are not sufficient |
## Per-Component Security Requirements
| Component | Requirement | Risk Level | Proposed Control |
|-----------|------------|------------|-----------------|
| SecurityProvider detection | Must correctly identify TPM availability; false positive → crash; false negative → weaker security | Medium | Check /dev/tpm0 existence + attempt TPM connection; fall back to legacy on any failure |
| TPM key sealing | Sealed key must only be unsealable on the provisioned device | High | Use FAPI create_seal under SRK hierarchy; no PCR policy (avoids persistence bugs); auth password optional |
| Docker device mount | /dev/tpm0 and /dev/tpmrm0 must be accessible in container | Medium | docker-compose.yml --device mounts; no --privileged |
| Legacy fallback | Must remain fully functional for non-TPM devices | High | Existing security module unchanged; SecurityProvider delegates to it |
| Key rotation | TPM-sealed keys should be rotatable without re-provisioning | Medium | Seal a wrapping key in TPM; actual resource keys wrapped by it; rotate resource keys independently |
| CDN authenticated download | Single-file download must use authenticated URLs (not public) | High | Signed S3 URLs with expiration; existing CDN auth mechanism |
## Security Controls Summary
### Authentication
- **Unchanged**: JWT Bearer tokens from Azaion Resource API
- **Enhanced (TPM path)**: Device attestation possible via EK certificate (future enhancement, not in initial scope)
### Data Protection
- **At rest**: AES-256-CBC encrypted resources. Key sealed in TPM (Jetson) or derived from credentials (legacy).
- **In transit**: HTTPS for all API/CDN calls (unchanged)
- **In TPM**: Encryption key never enters user-space memory. FAPI handles encrypt/decrypt within TPM boundary.
### Key Management
- **TPM path**: Master key sealed at provisioning time → stored in TPM NV or as sealed blob in REE FS → unsealed at runtime via FAPI → used to derive/unwrap resource-specific keys
- **Legacy path**: SHA-384 key derivation from email+password+hw_hash+salt (unchanged)
- **Key rotation**: Wrap resource keys with TPM-sealed master key; rotate resource keys without re-provisioning TPM
### Logging & Monitoring
- **Unchanged**: Loguru file + stdout/stderr logging
- **Addition**: Log SecurityProvider selection at startup (which path was chosen and why)
@@ -0,0 +1,112 @@
# Solution Draft: TPM-Based Security Replacing Binary-Split
## Product Solution Description
Replace the binary-split resource scheme with a TPM-aware security architecture that uses hardware-rooted keys on Jetson Orin Nano devices and simplified authenticated downloads elsewhere. The loader gains a `SecurityProvider` abstraction with two implementations: `TpmSecurityProvider` (fTPM-based, for provisioned Jetson devices) and `LegacySecurityProvider` (current scheme, for backward compatibility). The binary-split upload/download logic is simplified to single-file encrypted resources stored on CDN, with the split mechanism retained only in the legacy path.
```
┌─────────────────────────────────────────────┐
│ Loader (FastAPI) │
│ ┌────────────┐ ┌─────────────────────┐ │
│ │ HTTP API │───▶│ SecurityProvider │ │
│ │ (F1-F6) │ │ (interface) │ │
│ └────────────┘ └──────┬──────────────┘ │
│ ┌─────┴──────┐ │
│ ┌──────┴──┐ ┌──────┴───────┐ │
│ │ TpmSec │ │ LegacySec │ │
│ │ Provider│ │ Provider │ │
│ └────┬────┘ └──────┬──-────┘ │
│ │ │ │
│ /dev/tpm0 SHA-384 keys │
│ (fTPM) (current scheme) │
└─────────────────────────────────────────────┘
```
## Existing/Competitor Solutions Analysis
| Solution | Approach | Applicability |
|----------|----------|---------------|
| SecEdge SEC-TPM | Firmware TPM for edge AI device trust, model binding, attestation | Directly applicable — same problem space |
| Tinfoil Containers | TEE-based (Intel TDX / AMD SEV-SNP) with attestation | Cloud/data center focus; not applicable to Jetson ARM64 |
| Thistle OTA | Signed manifests + asymmetric verification, no hardware binding | Weaker than TPM but works without hardware support |
| Amulet (TEE-shielded inference) | OP-TEE based model obfuscation for ARM TrustZone | Interesting for inference protection; complementary to our approach |
| NVIDIA Confidential Computing | H200/B200 GPU TEEs | Data center only; not applicable to Orin Nano |
## Architecture
### Component: Security Provider Abstraction
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Python ABC + runtime detection | abc module, os.path.exists("/dev/tpm0") | Simple, no deps, auto-selects at startup | Detection is binary (TPM or not) | None | N/A | Zero | Best |
| Config-file based selection | YAML/env var SECURITY_PROVIDER=tpm\|legacy | Explicit control, testable | Manual configuration per device | Config management | N/A | Zero | Good |
**Recommendation**: Runtime detection with config override. Check /dev/tpm0 by default; allow SECURITY_PROVIDER env var to force a specific provider.
### Component: TPM Key Management
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| tpm2-pytss FAPI | tpm2-pytss (PyPI), tpm2-tss native lib | High-level Python API (create_seal, unseal, encrypt, decrypt); mature project | Requires tpm2-tss native lib installed; FAPI config needed | tpm2-tss >= 2.4.0, Python 3.11 | Hardware-rooted keys from device fuses | Low (open source) | Best |
| tpm2-tools via subprocess | tpm2-tools CLI, subprocess calls | No Python bindings needed; well-documented CLI | Subprocess overhead; harder to test; string parsing | tpm2-tools installed in container | Same | Low | Acceptable |
| Custom OP-TEE TA | C TA in OP-TEE, Python CA via libteec | Maximum control; no dependency on TPM stack | Very high development effort; C code in secure world | OP-TEE dev environment, ARM toolchain | Strongest (code runs in TrustZone) | High | Overkill |
**Recommendation**: tpm2-pytss FAPI. High-level API, Python-native, same pattern as existing cryptography library usage.
### Component: Resource Download (simplified)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Single encrypted file on CDN | boto3 (existing), CDN signed URLs | Removes split/merge complexity; single download | Larger download per request (no partial caching) | CDN config | Encrypted at rest + in transit | Same CDN cost | Best |
| Keep CDN big + API small (current) | Existing code | No migration needed | Unnecessary complexity for TPM path | Both API and CDN | Split-key defense | Same | Legacy only |
**Recommendation**: Single-file download for TPM path. Legacy path retains split for backward compatibility.
### Component: Docker Unlock (TPM-enhanced)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| TPM-sealed archive key | fTPM, tpm2-pytss | Key never leaves TPM; no network download needed for key | Requires provisioned fTPM | fTPM provisioned with sealed key | Strongest — offline decryption possible | Low | Best |
| Key fragment from API (current) | HTTPS download | Works without TPM | Requires network; key fragment in memory | API reachable | Current level | Zero | Legacy only |
**Recommendation**: TPM-sealed archive key for provisioned devices. The key can be sealed into the TPM during device provisioning, eliminating the need to download a key fragment at unlock time.
### Component: Migration/Coexistence
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Feature flag + SecurityProvider abstraction | ABC, env var, /dev/tpm0 detection | Clean separation; zero risk to existing deployments | Two code paths to maintain during transition | None | Both paths maintain security | Low | Best |
| Hard cutover | N/A | Simple (one path) | Breaks non-TPM devices | All devices must have TPM | N/A | High risk | Poor |
**Recommendation**: Feature flag with auto-detection. Gradual rollout.
## Testing Strategy
### Integration / Functional Tests
- SecurityProvider auto-detection: with and without /dev/tpm0
- TpmSecurityProvider: seal/unseal round-trip (requires TPM simulator — swtpm)
- LegacySecurityProvider: all existing tests pass unchanged
- Single-file download: encrypt → upload → download → decrypt round-trip
- Docker unlock with TPM-sealed key: decrypt archive without network key download
- Migration: same resource accessible via both providers (different encryption)
### Non-Functional Tests
- Performance: TPM seal/unseal latency vs current SHA-384 key derivation
- Performance: single-file download vs split download (expect improvement)
- Security: verify TPM-sealed key cannot be extracted without hardware
- Security: verify legacy path still works identically to current behavior
## References
- NVIDIA Jetson Linux Developer Guide r36.4.4 — Firmware TPM: https://docs.nvidia.com/jetson/archives/r36.4.4/DeveloperGuide/SD/Security/FirmwareTPM.html
- NVIDIA JetPack 6.1 Blog: https://developer.nvidia.com/blog/nvidia-jetpack-6-1-boosts-performance-and-security-through-camera-stack-optimizations-and-introduction-of-firmware-tpm/
- tpm2-pytss: https://github.com/tpm2-software/tpm2-pytss
- tpm2-pytss FAPI docs: https://tpm2-pytss.readthedocs.io/en/latest/fapi.html
- SecEdge — Securing Edge AI through Trusted Computing: https://www.secedge.com/tcg-blog-securing-edge-ai-through-trusted-computing/
- Thistle Technologies — Securing AI Models on Edge Devices: https://thistle.tech/blog/securing-ai-models-on-edge-devices
- NVIDIA Developer Forums — fTPM PCR issues: https://forums.developer.nvidia.com/t/access-ftpm-pcr-registers/328636
- Docker TPM access: https://devops.stackexchange.com/questions/8509/accessing-tpm-from-inside-a-docker-container
## Related Artifacts
- AC Assessment: `_docs/02_task_plans/tpm-replaces-binary-split/00_research/00_ac_assessment.md`
- Fact Cards: `_docs/02_task_plans/tpm-replaces-binary-split/00_research/02_fact_cards.md`
- Reasoning Chain: `_docs/02_task_plans/tpm-replaces-binary-split/00_research/04_reasoning_chain.md`
@@ -0,0 +1,798 @@
# Solution Draft 02: TPM Security Implementation Guide
## Overview
This document is a comprehensive implementation guide for replacing the binary-split resource scheme with TPM-based hardware-rooted security on Jetson Orin Nano devices. It covers fTPM provisioning, full-disk encryption, OS hardening, tamper-responsive enclosures, the simplified loader architecture, and a phased implementation plan.
Prerequisite reading: `solution_draft01.md` (architecture overview), `security_analysis.md` (threat model).
---
## 1. fTPM Fusing and Provisioning
### 1.1 Hardware Required
| Item | Purpose | Cost |
| --------------------------------------- | ----------------------------------------- | -------- |
| x86 Ubuntu host PC (20.04 or 22.04 LTS) | Runs NVIDIA flaekshing/fusing tools | Existing |
| USB-C cable (data-capable) | Connects host to Jetson in recovery mode | ~$10 |
| Jetson Orin Nano dev kit (expendable) | First fuse target; fusing is irreversible | ~$250 |
| Jetson Orin Nano dev kit (kept unfused) | Ongoing development and debugging | ~$250 |
No specialized lab equipment, JTAG probes, or custom tooling is required. The entire fusing and provisioning process runs on a standard PC.
### 1.2 Roles: ODM vs OEM
NVIDIA's fTPM docs describe two separate entities:
- **ODM (Original Design Manufacturer)**: Designs the fTPM integration, generates KDK0 per device, runs the CA server, signs EK certificates, creates firmware packages.
- **OEM (Original Equipment Manufacturer)**: Adds disk encryption keys, assembles hardware, burns fuses at the factory, ships the final product.
In large-scale manufacturing these are different companies with a formal key handoff. **In our case, we are both ODM and OEM** — we design, provision, flash, and deploy ourselves. NVIDIA covers this in their fTPM guide Appendix B with a **simplified single-entity flow** that eliminates the cross-company handoff and roughly halves the provisioning complexity.
### 1.3 Key Derivation Chain
The full derivation from hardware fuses to usable keys:
```
KDK0 (256-bit random, burned into SoC fuses at manufacturing)
├── Silicon_ID = KDF(key=KDK0, info=Device_SN)
│ Device_SN = OEM_ID || SN (unique per device)
├── fTPM_Seed = KDF(key=Silicon_ID, constant_str1)
│ Passed from MB2 bootloader to OP-TEE via encrypted TrustZone memory
├── fTPM_Root_Seed = KDF(key=fTPM_Seed, constant_str)
├── EPS = KDF(key=fTPM_Root_Seed, info=Device_SN, salt=EPS_Seed)
│ EPS_Seed is a 256-bit random number from odm_ekb_gen.py, stored in EKB
│ EPS (Endorsement Primary Seed) is the root identity of the fTPM entity
├── SRK = TPM2_CreatePrimary(EPS)
│ Deterministic — re-derived from EPS on every boot
│ Never stored persistently, never leaves the secure world
└── Sealed blobs (your encryption keys)
Encrypted under SRK, stored as files on disk
Only unsealable on this specific device
```
Every KDF step is one-way. Knowing a derived value does not reveal its parent. Two devices with different KDK0 values produce entirely different key trees.
### 1.4 Provisioning Process (Single-Entity / ODM+OEM Flow)
#### Step 1: Install BSP and FSKP Packages
```
mkdir ${BSP_TOP} && cd ${BSP_TOP}
tar jvxf jetson_linux_${rel_ver}_aarch64.tbz2
tar jvxf public_sources.tbz2
cd Linux_for_Tegra/rootfs
sudo tar jvxpf tegra_linux_sample-root-filesystem_${rel_ver}_aarch64.tbz2
cd ${BSP_TOP}/Linux_for_Tegra
sudo ./apply_binaries.sh
cd ${BSP_TOP}
tar jvxf fskp_partner_t234_${rel_ver}_aarch64.tbz2
```
#### Step 2: Generate PKC and SBK Keys (Secure Boot)
```
openssl genrsa -out pkc.pem 3072
python3 gen_sbk_key.py --out sbk.key
```
PKC (Public Key Cryptography) key signs all boot chain images. SBK (Secure Boot Key) encrypts them. Both are burned into fuses and used for every subsequent flash.
#### Step 3: Generate Per-Device KDK0 and Silicon_ID
```
python3 kdk_gen.py \
--oem-id ${OEM_ID} \
--sn ${DEVICE_SN} \
--output-dir ${KDK_DB}
```
Outputs per device: KDK0 (256-bit), Device_SN, Silicon_ID public key. **KDK0 must be discarded after the fuseblob and EKB are generated** — keeping it in storage risks leaks.
#### Step 4: Generate Fuseblob
```
python3 fskp_fuseburn.py \
--kdk-db ${KDK_DB} \
--pkc-key pkc.pem \
--sbk-key sbk.key \
--fuse-xml fuse_config.xml \
--output-dir ${FUSEBLOB_DB}
```
The fuse config XML specifies which fuses to burn: KDK0, PKC hash, SBK, OEM_K1, SECURITY_MODE, ARM_JTAG_DISABLE, etc.
#### Step 5: Generate fTPM EKB (EK Certificates + EPS Seed)
```
python3 odm_ekb_gen.py \
--kdk-db ${KDK_DB} \
--output-dir ${EKB_FTPM_DB}
```
This generates EK CSRs, signs them with your CA, and packages the EPS Seed + EK certificates into per-device EKB images. In the single-entity flow, you run your own CA:
```
python3 ftpm_manufacturer_ca_simulator.sh # Replace with real CA in production
```
Then merge with disk encryption keys:
```
python3 oem_ekb_gen.py \
--ekb-ftpm-db ${EKB_FTPM_DB} \
--user-keys sym2_t234.key \
--oem-k1 oem_k1.key \
--output-dir ${EKB_FINAL_DB}
```
#### Step 6: Burn Fuses (IRREVERSIBLE)
Put the device in USB recovery mode:
- If powered off: connect DC power (device enters recovery automatically on some carrier boards)
- If powered on: `sudo reboot --force forced-recover`
- Verify: `lsusb` shows NVIDIA device
Test first (dry run):
```
sudo ./odmfuse.sh --test -X fuse_config.xml -i 0x23 jetson-orin-nano-devkit
```
Burn for real:
```
sudo ./odmfuse.sh -X fuse_config.xml -i 0x23 jetson-orin-nano-devkit
```
After `SECURITY_MODE` fuse is burned (value 0x1), **all further fuse writes are blocked permanently** (except a few ODM-reserved fuses).
#### Step 7: Flash Signed + Encrypted Images
```
sudo ROOTFS_ENC=1 ./flash.sh \
-u pkc.pem \
-v sbk.key \
-i ./sym2_t234.key \
--ekb ${EKB_FINAL_DB}/ekb-${DEVICE_SN}.signed \
jetson-orin-nano-devkit \
nvme0n1p1
```
#### Step 8: On-Device fTPM Provisioning (One-Time)
After first boot, run the provisioning script on the device:
```
sudo ./ftpm_provisioning.sh
```
This queries EK certificates from the EKB, stores them in fTPM NV memory, takes fTPM ownership, and creates EK handles. Only needs to run once per device.
### 1.5 Difficulty Assessment
| Aspect | Difficulty | Notes |
| ----------------------------- | ------------------- | ---------------------------------------------------- |
| First device (learning curve) | Medium-High | NVIDIA docs are detailed but dense. Budget 2-3 days. |
| Subsequent devices (scripted) | Low | Same pipeline, different KDK0/SN per device. |
| Risk | High (irreversible) | Always test on expendable dev board first. |
| Automation potential | High | Entire pipeline is scriptable for factory floor. |
### 1.6 Known Issues
- `odmfuseread.sh` has a Python 3 compatibility bug: `getiterator()` deprecated. Fix: replace line 1946 in `tegraflash_impl_t234.py` with `xml_tree.iter('file')`.
- Forum reports of PCR7 values not persisting across reboots. Our design deliberately avoids PCR-sealed keys — we use FAPI seal/unseal under SRK hierarchy only.
- Forum reports of NV handle loss after reboot on some Orin devices. Not blocking for our use case (SRK is re-derived from fuses, not stored in NV).
---
## 2. Storage Encryption
### 2.1 Recommendation: Full-Disk Encryption
Encrypt the entire NVMe rootfs partition, not just selected model files.
**Why full disk instead of selective encryption:**
| Approach | Protects models | Protects logs/config/temp files | Custom code needed | Performance |
| ---------------------------- | --------------- | ------------------------------------------------ | --------------------------------------- | ---------------------------- |
| Selective (model files only) | Yes | No — metadata, logs, decrypted artifacts exposed | Yes — application-level encrypt/decrypt | Minimal |
| Full disk (LUKS) | Yes | Yes — everything on disk is ciphertext | No — kernel handles it transparently | Minimal (HW-accelerated AES) |
Full-disk encryption is built into NVIDIA's Jetson Linux stack. No application code changes needed for the disk layer.
### 2.2 How Full-Disk Encryption Works
```
Flashing (host PC):
gen_ekb → sym2_t234.key (DEK) + eks_t234.img (EKB image)
ROOTFS_ENC=1 flash.sh → rootfs encrypted with DEK, DEK packaged in EKB
Boot (on device):
MB2 reads KDK0 from fuses
→ derives K1
→ decrypts EKB
→ extracts DEK
→ passes DEK to dm-crypt kernel module
dm-crypt + LUKS mounts rootfs transparently
Application sees a normal filesystem — encryption is invisible
```
The application never touches the disk encryption key. It's handled entirely in the kernel, initialized before the OS starts.
### 2.3 Double Encryption (Defense in Depth)
For AI model files, two independent encryption layers:
1. **Layer 1 — Full Disk LUKS** (kernel): Protects everything on disk. Key derived from fuses via EKB. Transparent to applications.
2. **Layer 2 — Application-level TPM-sealed encryption**: Model files encrypted with a key sealed in the fTPM. Decrypted by the loader at runtime.
An attacker who somehow bypasses disk encryption (e.g., cold boot while the filesystem is mounted) still faces the application-level encryption. And vice versa.
### 2.4 Setup Steps
1. Generate encryption keys from OP-TEE source:
```
cd ${BSP_TOP}/Linux_for_Tegra/source/nvidia-jetson-optee-source
cd optee/samples/hwkey-agent/host/tool/gen_ekb/
sudo chmod +x example.sh && ./example.sh
```
Outputs: `sym2_t234.key` (DEK) and `eks_t234.img` (EKB image).
2. Place keys:
```
cp sym2_t234.key ${BSP_TOP}/Linux_for_Tegra/
cp eks_t234.img ${BSP_TOP}/Linux_for_Tegra/bootloader/
```
3. Verify EKB integrity:
```
hexdump -C -n 4 -s 0x24 eks_t234.img
# Must show magic bytes "EEKB"
```
4. Configure NVMe partition size in `flash_l4t_t234_nvme_rootfs_enc.xml`:
- Set `NUM_SECTORS` based on NVMe capacity (e.g., 900000000 for 500GB)
- Set `encrypted="true"` for the rootfs partition
5. Flash with encryption:
```
sudo ROOTFS_ENC=1 ./tools/kernel_flash/l4t_initrd_flash.sh \
--external-device nvme0n1p1 \
-c flash_l4t_t234_nvme_rootfs_enc.xml \
-i ./sym2_t234.key \
-u pkc.pem -v sbk.key \
jetson-orin-nano-devkit \
nvme0n1p1
```
---
## 3. Debug Access Strategy
### 3.1 The Problem
After Secure Boot fusing, JTAG disabling, and OS hardening, the device has no interactive access. How do you develop, debug, and perform field maintenance?
### 3.2 Solution: Dual-Image Approach
Standard embedded Linux practice: maintain two OS images, both signed with the same PKC key.
| Property | Development Image | Production Image |
| ---------------------------------- | --------------------------------- | ----------------------- |
| Secure Boot signature | Signed with PKC key | Signed with PKC key |
| Boots on fused device | Yes | Yes |
| SSH access | Yes (key-based only, no password) | No (sshd not installed) |
| Serial console | Enabled | Disabled |
| ptrace / /dev/mem | Allowed | Blocked (lockdown mode) |
| Debug tools (gdb, strace, tcpdump) | Installed | Not present |
| Getty on TTY | Running | Not spawned |
| Desktop environment | Optional | Not installed |
| Application | Your loader + inference | Your loader + inference |
Secure Boot verifies the **signature**, not the **contents** of the image. Both images are valid as long as they're signed with your PKC key. An attacker cannot create either image without the private key.
### 3.3 Workflow
**During development:**
1. Flash the dev image to a fused device
2. SSH in via key-based authentication
3. Develop, debug, iterate
4. When done, flash the prod image for deployment
**Production deployment:**
1. Flash the prod image at the factory
2. Device boots directly into your application
3. No shell, no SSH, no serial — only your FastAPI endpoints
**Field debug (emergency):**
1. Connect host PC via USB-C
2. Put device in USB recovery mode (silicon ROM, always available)
3. Reflash with the dev image (requires PKC private key to sign)
4. SSH in, diagnose, fix
5. Reflash with prod image, redeploy
USB recovery mode is hardwired in silicon. It always works regardless of what OS is installed. But after Secure Boot fusing, it **only accepts images signed with your PKC key**. An attacker who enters recovery mode but lacks the signing key is stuck.
### 3.4 Optional: Hardware Debug Jumper
A physical GPIO pin on the carrier board that, when shorted at boot, tells the init system to start SSH:
```
Boot → systemd reads GPIO pin → if HIGH: start sshd.service
→ if LOW: sshd not started (production behavior)
```
Opening the case to access the jumper triggers the tamper enclosure → keys are zeroized. So this is only useful during controlled maintenance with the tamper system temporarily disarmed.
### 3.5 PKC Key Security
The PKC private key is the crown jewel. Whoever holds it can create signed images that boot on any of your fused devices. Protect it accordingly:
- Store on an air-gapped machine or HSM (Hardware Security Module)
- Never store in git, CI/CD pipelines, or cloud storage
- Limit access to 1-2 people
- Consider splitting with Shamir's Secret Sharing for key ceremonies
---
## 4. Tamper Enclosure
### 4.1 Threat Model for Physical Access
| Attack | Without enclosure | With tamper-responsive enclosure |
| ---------------------------------- | ----------------------------- | ------------------------------------------------ |
| Unscrew case, desolder eMMC/NVMe | Easy (minutes) | Mesh breaks → key destroyed → data irrecoverable |
| Probe DRAM bus with logic analyzer | Moderate (requires soldering) | Case opening triggers zeroization first |
| Cold boot (freeze RAM) | Moderate | Temperature sensor triggers zeroization |
| Connect to board debug headers | Easy | Case must be opened → zeroization |
### 4.2 Option A: Zymkey HSM4 + Custom Enclosure (~$150-250/unit)
**Recommended for initial production runs (up to ~500 units).**
**Bill of Materials:**
| Component | Unit Cost | Source |
| -------------------------------------- | ------------- | ----------------------------- |
| Zymkey HSM4 (I2C security module) | ~$71 | zymbit.com |
| Custom aluminum enclosure | ~$30-80 | CNC shop / Alibaba at volume |
| Flex PCB tamper mesh panels (set of 6) | ~$10-30 | JLCPCB / PCBWay |
| CR2032 coin cell battery | ~$2 | Standard electronics supplier |
| 30 AWG perimeter wire (~2 ft) | ~$1 | Standard electronics supplier |
| Assembly labor + connectors | ~$20-40 | — |
| **Total** | **~$134-224** | — |
**How it works:**
```
┌──────── Aluminum Enclosure ────────┐
│ │
│ All inner walls lined with flex │
│ PCB tamper mesh (conductive traces │
│ in space-filling curve pattern) │
│ │
│ Mesh traces connect to Zymkey │
│ HSM4's 2 perimeter circuits │
│ │
│ ┌───────────┐ ┌───────────────┐ │
│ │ Zymkey │ │ Jetson Orin │ │
│ │ HSM4 │ │ Nano │ │
│ │ │ │ │ │
│ │ I2C ◄─────┤ │ GPIO header │ │
│ │ GPIO4 ◄───┤ │ │ │
│ │ │ │ │ │
│ │ [CR2032] │ │ │ │
│ │ (battery │ │ │ │
│ │ backup) │ │ │ │
│ └───────────┘ └───────────────┘ │
│ │
│ Tamper event (mesh broken, │
│ temperature anomaly, power loss │
│ without battery): │
│ → Zymkey destroys stored keys │
│ → Master encryption key is gone │
│ → Encrypted disk is permanently │
│ unrecoverable │
└─────────────────────────────────────┘
```
**Zymkey HSM4 features:**
- 2 independent perimeter breach detection circuits (connect to mesh)
- Accelerometer (shock/orientation tamper detection)
- Main power monitor
- Battery-backed RTC (36-60 months on CR2032)
- Secure key storage (ECC P-256, AES-256, SHA-256)
- I2C interface (fits Jetson's 40-pin GPIO header)
- Configurable tamper response: notify host, or destroy keys on breach
**Flex PCB tamper mesh design:**
- Use the KiCad anti-tamper mesh plugin to generate space-filling curve trace patterns
- Order from JLCPCB or PCBWay as flex PCBs (~$5-15 per panel)
- Attach to enclosure inner walls with adhesive
- Wire to Zymkey's perimeter circuit connectors (Hirose DF40HC)
- Any cut, drill, or peel that breaks a trace triggers the tamper event
### 4.3 Option B: Full DIY (~$80-150/unit)
**For higher volumes (500+ units) where per-unit cost matters.**
| Component | Unit Cost |
| ------------------------------------------------- | ------------ |
| STM32G4 microcontroller | ~$5 |
| Flex PCB tamper mesh (KiCad plugin) | ~$10-30 |
| Battery-backed SRAM (Cypress CY14B101 or similar) | ~$5 |
| Custom PCB for STM32 monitor circuit | ~$10-20 |
| Aluminum enclosure | ~$30-80 |
| Coin cell + holder | ~$3 |
| **Total** | **~$63-143** |
The STM32G4's high-resolution timer (sub-200ps) enables Time-Domain Reflectometry (TDR) monitoring of the mesh — sending pulses into the trace and detecting echoes when damage occurs. More sensitive than simple resistance monitoring.
The master encryption key is stored in battery-backed SRAM (not in the Jetson's fTPM). On tamper detection, the STM32 cuts power to the SRAM — key vanishes in microseconds.
More engineering effort upfront (firmware for STM32, PCB design, integration testing) but lower per-unit BOM.
### 4.4 Option C: Epoxy Potting (~$30-50/unit)
**Minimum viable physical protection.**
- Encapsulate the entire Jetson board + carrier in hardened epoxy resin
- Physical extraction requires grinding/dissolving the epoxy, which destroys the board and traces
- No active zeroization — if the attacker is patient and skilled enough, they can extract components
- Best combined with Options A or B: epoxy + active tamper mesh
### 4.5 Recommendation
| Production volume | Recommendation | Per-unit cost |
| -------------------- | ----------------------------------------- | ------------- |
| Prototype / first 10 | Option A (Zymkey HSM4) + Option C (epoxy) | ~$180-270 |
| 10-500 units | Option A (Zymkey HSM4) | ~$150-250 |
| 500+ units | Option B (custom STM32) | ~$80-150 |
All options fit within the $300/unit budget.
---
## 5. Simplified Loader Architecture
### 5.1 Current Architecture
```
main.py (FastAPI)
├── POST /login
│ → api_client.pyx: set_credentials, login()
│ → credentials.pyx: email, password
│ → security.pyx: get_hw_hash(hardware_info)
│ → hardware_service.pyx: CPU/GPU/RAM/serial strings
├── POST /load/{filename}
│ → api_client.pyx: load_big_small_resource(filename, folder)
│ 1. Fetch SMALL part from API (POST /resources/get/{folder})
│ → Decrypt with get_api_encryption_key(email+password+hw_hash+salt)
│ 2. Fetch BIG part from CDN (S3 download) or local cache
│ 3. Concatenate small + big
│ 4. Decrypt merged blob with get_resource_encryption_key() (fixed internal string)
│ → Return decrypted bytes
├── POST /upload/{filename}
│ → api_client.pyx: upload_big_small_resource(file, folder)
│ 1. Encrypt full resource with get_resource_encryption_key()
│ 2. Split at min(3KB, 30% of ciphertext)
│ 3. Upload big part to CDN
│ 4. Upload small part to API
└── POST /unlock
→ binary_split.py:
1. download_key_fragment(RESOURCE_API_URL, token) — HTTP GET from API
2. decrypt_archive(images.enc, SHA256(key_fragment)) — AES-CBC stream
3. docker load -i result.tar
```
**Security dependencies in current architecture:**
- `security.pyx`: SHA-384 key derivation from `email + password + hw_hash + salt`
- `hardware_service.pyx`: String-based hardware fingerprint (spoofable)
- `binary_split.py`: Key fragment downloaded from API server
- Split storage: security depends on attacker not having both API and CDN access
### 5.2 Proposed TPM Architecture
```
main.py (FastAPI) — routes and request/response contracts unchanged
├── POST /login
│ → api_client.pyx: set_credentials, login()
│ → credentials.pyx: email, password (unchanged — still needed for API auth)
│ → security_provider.pyx: auto-detect TPM or legacy
├── POST /load/{filename}
│ → api_client.pyx: load_resource(filename, folder)
│ [TPM path]:
│ 1. Fetch single encrypted file from CDN (S3 download)
│ 2. security_provider.decrypt(data)
│ → tpm_security_provider.pyx: FAPI.unseal() → master key → AES decrypt
│ 3. Return decrypted bytes
│ [Legacy path]:
│ (unchanged — load_big_small_resource as before)
├── POST /upload/{filename}
│ → api_client.pyx: upload_resource(file, folder)
│ [TPM path]:
│ 1. security_provider.encrypt(data)
│ → tpm_security_provider.pyx: AES encrypt with TPM-derived key
│ 2. Upload single file to CDN
│ [Legacy path]:
│ (unchanged — upload_big_small_resource as before)
└── POST /unlock
[TPM path]:
1. security_provider.unseal_archive_key()
→ tpm_security_provider.pyx: FAPI.unseal() → archive key (no network call)
2. decrypt_archive(images.enc, archive_key)
3. docker load -i result.tar
[Legacy path]:
(unchanged — download_key_fragment from API)
```
### 5.3 SecurityProvider Interface
```python
from abc import ABC, abstractmethod
class SecurityProvider(ABC):
@abstractmethod
def encrypt(self, data: bytes) -> bytes: ...
@abstractmethod
def decrypt(self, data: bytes) -> bytes: ...
@abstractmethod
def get_archive_key(self) -> bytes: ...
```
Two implementations:
- **TpmSecurityProvider**: Calls `tpm2-pytss` FAPI to unseal master key from TPM. Uses master key for AES-256-CBC encrypt/decrypt. Archive key is also TPM-sealed (no network download).
- **LegacySecurityProvider**: Wraps existing `security.pyx` logic unchanged. Key derivation from `email+password+hw_hash+salt`. Archive key downloaded from API.
### 5.4 Auto-Detection Logic
At startup:
```
1. Check env var SECURITY_PROVIDER
→ if "tpm": use TpmSecurityProvider (fail hard if TPM unavailable)
→ if "legacy": use LegacySecurityProvider
→ if unset: auto-detect (step 2)
2. Check os.path.exists("/dev/tpm0")
→ if True: attempt TPM connection via FAPI
→ if success: use TpmSecurityProvider
→ if failure: log warning, fall back to LegacySecurityProvider
→ if False: use LegacySecurityProvider
3. Log which provider was selected and why
```
### 5.5 What Changes, What Stays
| Component | TPM path | Legacy path | Notes |
| ------------------------------ | --------------------------------- | ------------------------- | ------------------------------------ |
| `main.py` routes | Unchanged | Unchanged | F1-F6 API contract preserved |
| JWT authentication | Unchanged | Unchanged | Still needed for API access |
| CDN download | Single file | Big/small split | CDN still used for bandwidth |
| AES-256-CBC encryption | Unchanged algorithm | Unchanged | Only the key source changes |
| Key source | TPM-sealed master key | SHA-384(email+pw+hw+salt) | Core difference |
| `hardware_service.pyx` | Not used | Used | TPM replaces string fingerprinting |
| `binary_split.py` key download | Eliminated | Used | TPM-sealed key is local |
| `security.pyx` | Wrapped in LegacySecurityProvider | Active | Not deleted — legacy devices need it |
### 5.6 Docker Container Changes
The loader runs in Docker. For TPM access:
```yaml
# docker-compose.yml additions for TPM path
services:
loader:
devices:
- /dev/tpm0:/dev/tpm0
- /dev/tpmrm0:/dev/tpmrm0
environment:
- SECURITY_PROVIDER=tpm # or leave unset for auto-detect
```
No `--privileged` flag needed. Device mounts are sufficient.
Container image needs additional packages:
- `tpm2-tss` (native library, >= 2.4.0)
- `tpm2-pytss` (Python bindings from PyPI)
- FAPI configuration file (`/etc/tpm2-tss/fapi-config.json`)
---
## 6. Implementation Phases
### Phase 0: Preparation (1 week)
| Task | Details |
| ------------------------ | ------------------------------------------------------------------------------------------------ |
| Order hardware | Second Jetson Orin Nano dev kit (expendable for fusing experiments) |
| Order Zymkey HSM4 | For tamper enclosure evaluation |
| Download NVIDIA packages | BSP (`jetson_linux_*_aarch64.tbz2`), sample rootfs, public sources, FSKP partner package |
| Set up host | Ubuntu 22.04 LTS on x86 machine, install `libftdi-dev`, `openssh-server`, `python3-cryptography` |
| Study NVIDIA docs | `r36.4.3` Security section: Secure Boot, Disk Encryption, Firmware TPM, FSKP |
### Phase 1: Secure Boot + Disk Encryption (1-2 weeks)
| Task | Details | Validation |
| ----------------------------- | ---------------------------------------------------------- | ------------------------------------------------ |
| Generate PKC + SBK keys | `openssl genrsa` + `gen_sbk_key.py` | Keys exist, correct format |
| Dry-run fuse burning | `odmfuse.sh --test` on expendable dev board | No errors, fuse values logged |
| Burn Secure Boot fuses | `odmfuse.sh` for real (PKC, SBK, SECURITY_MODE) | Device only boots signed images |
| Generate disk encryption keys | `gen_ekb/example.sh` | `sym2_t234.key` + `eks_t234.img` with EEKB magic |
| Flash encrypted rootfs | `ROOTFS_ENC=1 l4t_initrd_flash.sh` | Device boots, `lsblk` shows LUKS partition |
| Validate Secure Boot | Attempt to flash unsigned image → must fail | Unsigned flash rejected |
| Validate disk encryption | Remove NVMe, mount on another machine → must be ciphertext | Cannot read filesystem |
### Phase 2: fTPM Provisioning (1-2 weeks)
| Task | Details | Validation |
| ----------------------------------------- | ---------------------------------------- | ------------------------------------------- |
| Generate KDK0 + Silicon_ID | `kdk_gen.py` per device | KDK_DB populated |
| Generate fuseblob | `fskp_fuseburn.py` | Signed fuseblob files |
| Generate fTPM EKB | `odm_ekb_gen.py` + `oem_ekb_gen.py` | Per-device EKB images |
| Burn fTPM fuses | `odmfuse.sh` with KDK0 fuses | Fuses burned |
| Flash with fTPM EKB | `flash.sh` with EKB | Device boots with fTPM |
| On-device provisioning | `ftpm_provisioning.sh` | EK certificates in NV memory |
| Validate fTPM | `tpm2_getcap properties-fixed` | Shows manufacturer, firmware version |
| Test seal/unseal | `tpm2_create` + `tpm2_unseal` round-trip | Data sealed → unsealed correctly |
| Test seal on device A, unseal on device B | Copy sealed blob between devices | Unseal fails on device B (correct behavior) |
### Phase 3: OS Hardening (1 week)
| Task | Details | Validation |
| ---------------------------- | --------------------------------------------------------------- | --------------------------------------- |
| Create dev image recipe | SSH (key-only), serial console, ptrace allowed, debug tools | Can SSH in, run gdb |
| Create prod image recipe | No SSH, no serial, no ptrace, no shell, no desktop | No interactive access possible |
| Kernel config: lockdown mode | `CONFIG_SECURITY_LOCKDOWN_LSM=y`, `lockdown=confidentiality` | `/dev/mem` access denied, kexec blocked |
| Kernel config: disable debug | `CONFIG_STRICT_DEVMEM=y`, no `/dev/kmem` | Cannot read physical memory |
| Sysctl hardening | `kernel.yama.ptrace_scope=3`, `kernel.core_pattern=|/bin/false` | ptrace attach fails, no core dumps |
| Disable serial console | Remove `console=ttyTCU0` from kernel cmdline | No output on serial |
| Disable getty | Mask `getty@.service`, `serial-getty@.service` | No login prompt on any TTY |
| Sign both images | `flash.sh -u pkc.pem` for dev and prod images | Both boot on fused device |
| Validate prod image | Plug in keyboard, monitor, USB, Ethernet → no access | Device is a black box |
| Validate dev image | Flash dev image → SSH works | Can debug on fused device |
### Phase 4: Loader Code Changes (2-3 weeks)
| Task | Details | Tests |
| -------------------------------------------- | ---------------------------------------------------- | ------------------------------------------------ |
| Add `tpm2-tss`, `tpm2-pytss` to requirements | Match versions available in Jetson BSP | Imports work |
| Add `swtpm` to dev dependencies | TPM simulator for CI/testing | Simulator starts, `/dev/tpm0` available |
| Implement `SecurityProvider` ABC | `security_provider.pxd` + `.pyx` | Interface compiles |
| Implement `TpmSecurityProvider` | FAPI `create_seal`, `unseal`, AES encrypt/decrypt | Seal/unseal round-trip with swtpm |
| Implement `LegacySecurityProvider` | Wrap existing `security.pyx` | All existing tests pass unchanged |
| Add auto-detection logic | `/dev/tpm0` check + env var override | Correct provider selected in both cases |
| Refactor `load_resource` (TPM path) | Single file download + TPM decrypt | Download → decrypt → correct bytes |
| Refactor `upload_resource` (TPM path) | TPM encrypt + single file upload | Encrypt → upload → download → decrypt round-trip |
| Refactor Docker unlock (TPM path) | TPM unseal archive key, no API download | Unlock works without network key fragment |
| Update `docker-compose.yml` | Add `/dev/tpm0`, `/dev/tpmrm0` device mounts | Container can access TPM |
| Update `Dockerfile` | Install `tpm2-tss` native lib + `tpm2-pytss` | Build succeeds |
| Integration tests | Full flow with swtpm: login → load → upload → unlock | All paths work |
| Legacy regression tests | All existing e2e tests pass without TPM | No regression |
### Phase 5: Tamper Enclosure (2-4 weeks, parallel with Phase 4)
| Task | Details | Validation |
| ------------------------- | --------------------------------------------------------------- | --------------------------- |
| Evaluate Zymkey HSM4 | Connect to Orin Nano GPIO header, test I2C communication | Zymkey detected, LED blinks |
| Test perimeter circuits | Wire perimeter inputs, break wire → verify detection | Tamper event logged |
| Test key zeroization | Enable production mode, trigger tamper → verify key destruction | Key gone, device bricked |
| Design tamper mesh panels | KiCad anti-tamper mesh plugin, space-filling curves | Gerber files ready |
| Order flex PCBs | JLCPCB or PCBWay | Panels received |
| Design/source enclosure | Aluminum case, dimensions for Jetson + Zymkey + mesh panels | Enclosure received |
| Assemble prototype | Mount boards, wire mesh to Zymkey perimeter circuits | Physical prototype complete |
| Test tamper scenarios | Open case, drill, probe → all trigger zeroization | All breach paths detected |
| Temperature test | Cool enclosure below threshold → verify trigger | Cold boot attack prevented |
### Phase 6: Integration Testing (1-2 weeks)
| Test Scenario | Expected Result |
| --------------------------------------------------------------------------------- | -------------------------------------------------------- |
| Full stack: fused device + encrypted disk + fTPM + hardened OS + tamper enclosure | Device boots, runs inference, all security layers active |
| Attempt USB boot | Rejected (Secure Boot) |
| Attempt JTAG | No response (fused off) |
| Attempt SSH on prod image | Connection refused (no sshd) |
| Attempt serial console | No output |
| Remove NVMe, read on another machine | Ciphertext only |
| Copy sealed blob to different device | Unseal fails |
| Open tamper enclosure | Keys destroyed, device permanently bricked |
| Legacy device (no TPM) loads resources | Works via LegacySecurityProvider |
| Fused device loads resources | Works via TpmSecurityProvider |
| Docker unlock on TPM device | Works without network key download |
| Docker unlock on legacy device | Works via API key fragment (unchanged) |
### Timeline Summary
```
Week 1 Phase 0: Preparation (order hardware, download BSP)
Week 2-3 Phase 1: Secure Boot + Disk Encryption
Week 4-5 Phase 2: fTPM Provisioning
Week 6 Phase 3: OS Hardening
Week 7-9 Phase 4: Loader Code Changes
Week 7-10 Phase 5: Tamper Enclosure (parallel with Phase 4)
Week 11-12 Phase 6: Integration Testing
```
Total estimated duration: **10-12 weeks** (Phases 4 and 5 overlap).
---
## References
- NVIDIA Jetson Linux Developer Guide r36.4.3 — Firmware TPM: [https://docs.nvidia.com/jetson/archives/r36.4.3/DeveloperGuide/SD/Security/FirmwareTPM.html](https://docs.nvidia.com/jetson/archives/r36.4.3/DeveloperGuide/SD/Security/FirmwareTPM.html)
- NVIDIA Jetson Linux Developer Guide — Secure Boot: [https://docs.nvidia.com/jetson/archives/r36.2/DeveloperGuide/SD/Security/SecureBoot.html](https://docs.nvidia.com/jetson/archives/r36.2/DeveloperGuide/SD/Security/SecureBoot.html)
- NVIDIA Jetson Linux Developer Guide — Disk Encryption: [https://docs.nvidia.com/jetson/archives/r38.2.1/DeveloperGuide/SD/Security/DiskEncryption.html](https://docs.nvidia.com/jetson/archives/r38.2.1/DeveloperGuide/SD/Security/DiskEncryption.html)
- NVIDIA Jetson Linux Developer Guide — FSKP: [https://docs.nvidia.com/jetson/archives/r38.4/DeveloperGuide/SD/Security/FSKP.html](https://docs.nvidia.com/jetson/archives/r38.4/DeveloperGuide/SD/Security/FSKP.html)
- tpm2-pytss: [https://github.com/tpm2-software/tpm2-pytss](https://github.com/tpm2-software/tpm2-pytss)
- tpm2-pytss FAPI docs: [https://tpm2-pytss.readthedocs.io/en/latest/fapi.html](https://tpm2-pytss.readthedocs.io/en/latest/fapi.html)
- Zymbit HSM4: [https://www.zymbit.com/HSM4/](https://www.zymbit.com/HSM4/)
- Zymbit HSM4 perimeter detect: [https://docs.zymbit.com/tutorials/perimeter-detect/hsm4](https://docs.zymbit.com/tutorials/perimeter-detect/hsm4)
- KiCad anti-tamper mesh plugin: [https://hackaday.com/2021/03/14/an-anti-tamper-mesh-plugin-for-kicad/](https://hackaday.com/2021/03/14/an-anti-tamper-mesh-plugin-for-kicad/)
- Microchip PolarFire security mesh: [https://www.microchip.com/en-us/about/media-center/blog/2026/security-mesh-distributed-defense-across-your-design](https://www.microchip.com/en-us/about/media-center/blog/2026/security-mesh-distributed-defense-across-your-design)
- DoD GUARD Secure GPU Module: [https://www.cto.mil/wp-content/uploads/2025/04/Secure-Edge.pdf](https://www.cto.mil/wp-content/uploads/2025/04/Secure-Edge.pdf)
- Forecr MILBOX-ORNX (rugged enclosure): [https://forecr.io/products/jetson-orin-nx-orin-nano-rugged-compact-pc-milbox-ornx](https://forecr.io/products/jetson-orin-nx-orin-nano-rugged-compact-pc-milbox-ornx)
## Related Artifacts
- Solution Draft 01: `_docs/02_task_plans/tpm-replaces-binary-split/01_solution/solution_draft01.md`
- Security Analysis: `_docs/02_task_plans/tpm-replaces-binary-split/01_solution/security_analysis.md`
- Fact Cards: `_docs/02_task_plans/tpm-replaces-binary-split/00_research/02_fact_cards.md`
- Reasoning Chain: `_docs/02_task_plans/tpm-replaces-binary-split/00_research/04_reasoning_chain.md`
- Problem Statement: `_docs/02_task_plans/tpm-replaces-binary-split/problem.md`
@@ -0,0 +1,39 @@
# Problem: TPM-Based Security to Replace Binary-Split Resource Scheme
## Context
The Azaion Loader uses a binary-split resource scheme (ADR-002) where encrypted resources are split into a small part (uploaded to the authenticated API) and a large part (uploaded to CDN). Decryption requires both parts. This was designed for distributing AI models to **end-user laptops** where the device is untrusted — the loader shipped 99% of the model in the installer, and the remaining 1% (first 3KB) was downloaded at runtime to prevent extraction.
The distribution model has shifted to **SaaS** — services now run on web servers or **Jetson Orin Nano** edge devices. The Jetson Orin Nano includes a **TPM (Trusted Platform Module)** that can provide hardware-rooted security, potentially making the binary-split mechanism unnecessary overhead.
## Current Security Architecture
- **Binary-split scheme**: Resources encrypted with AES-256-CBC, split into small (≤3KB or 30%) + big parts, stored on separate servers (API + CDN)
- **Key derivation**: SHA-384 hashes combining email, password, hardware fingerprint, and salt
- **Docker unlock**: Key fragment downloaded from API, used to decrypt encrypted Docker image archive
- **Hardware binding**: SHA-384 hash of hardware fingerprint ties decryption to specific hardware
- **Cython compilation**: Core modules compiled to .so for IP protection
## Questions to Investigate
1. **TPM capabilities on Jetson Orin Nano**: What TPM version is available? What crypto operations does it support (key sealing, attestation, secure storage)? How does NVIDIA's security stack integrate with standard TPM APIs?
2. **TPM-based key management**: Can TPM replace the current key derivation scheme (SHA-384 of email+password+hw_hash+salt)? Can keys be sealed to TPM PCR values so they're only accessible on the intended device?
3. **Eliminating binary-split**: If TPM provides hardware-rooted trust (device can prove it's authentic), is the split-storage security model still necessary? Could the loader become a standard authenticated resource downloader with TPM-backed decryption?
4. **Docker image protection**: Can TPM-based disk encryption or sealed storage replace the current encrypted-archive-plus-key-fragment approach for Docker images?
5. **Migration path**: How would the transition work for existing deployments? Can both models (binary-split for legacy, TPM for new) coexist?
6. **Threat model comparison**: What threats does binary-split protect against that TPM doesn't (and vice versa)? Are there attack vectors specific to Jetson Orin Nano that need consideration?
7. **Implementation complexity**: What libraries/tools are available for TPM on ARM64/Jetson? (tpm2-tools, python-tpm2-pytss, etc.) What's the integration effort?
## Constraints
- Must support ARM64 (Jetson Orin Nano specifically)
- Must work within Docker containers (loader runs as a container with Docker socket mount)
- Cannot break existing API contracts (F1-F6 flows)
- Cython compilation requirement remains for IP protection
- Need to consider both SaaS web server and Jetson edge device deployments
@@ -2,9 +2,9 @@
**Task**: AZ-187_device_provisioning_script
**Name**: Device Provisioning Script
**Description**: Create a shell script that provisions a Jetson device identity (CompanionPC user) during the fuse/flash pipeline
**Description**: Interactive shell script that provisions Jetson device identities (CompanionPC users) during the fuse/flash pipeline
**Complexity**: 2 points
**Dependencies**: None
**Dependencies**: AZ-196 (POST /devices endpoint)
**Component**: DevOps
**Tracker**: AZ-187
**Epic**: AZ-181
@@ -15,48 +15,47 @@ Each Jetson needs a unique CompanionPC user account for API authentication. This
## Outcome
- Single script creates device identity and embeds credentials in the rootfs
- Integrates into the fuse/flash pipeline between odmfuse.sh and flash.sh
- Interactive `provision_devices.sh` detects connected Jetsons, registers identities via admin API, and runs fuse/flash pipeline
- Serial numbers are auto-assigned server-side (azj-0000, azj-0001, ...)
- Provisioning runbook documents the full end-to-end flow
## Scope
### Included
- provision_device.sh: generate device email (azaion-jetson-{serial}@azaion.com), random 32-char password
- Call admin API POST /users to create Users row with Role=CompanionPC
- Write credentials config file to rootfs image (at known path, e.g., /etc/azaion/device.conf)
- Idempotency: re-running for same serial doesn't create duplicate user
- Provisioning runbook: step-by-step from unboxing through fusing, flashing, and first boot
- `provision_devices.sh`: scan USB for Jetsons in recovery mode, interactive device selection, call admin API `POST /devices` for auto-generated serial/email/password, write credentials to rootfs, fuse, flash
- Configuration via `scripts/.env` (git-ignored), template at `scripts/.env.example`
- Dependency checks at startup (lsusb, curl, jq, L4T tools, sudo)
- Provisioning runbook: step-by-step for multi-device manufacturing flow
### Excluded
- fTPM provisioning (covered by NVIDIA's ftpm_provisioning.sh)
- Secure Boot fusing (covered by solution_draft02 Phase 1-2)
- OS hardening (covered by solution_draft02 Phase 3)
- Admin API user creation endpoint (assumed to exist)
- Admin API POST /devices endpoint implementation (AZ-196)
## Acceptance Criteria
**AC-1: Script creates CompanionPC user**
Given a new device serial AZJN-0042
When provision_device.sh is run with serial AZJN-0042
Then admin API has a new user azaion-jetson-0042@azaion.com with Role=CompanionPC
**AC-1: Script registers device via POST /devices**
Given the admin API has the POST /devices endpoint deployed
When provision_devices.sh is run and a device is selected
Then the admin API creates a new user with auto-assigned serial (e.g. azj-0000) and Role=CompanionPC
**AC-2: Credentials written to rootfs**
Given provision_device.sh completed successfully
When the rootfs image is inspected
Then /etc/azaion/device.conf contains the email and password
Given POST /devices returned serial, email, and password
When the provisioning step completes for a device
Then `$ROOTFS_DIR/etc/azaion/device.conf` contains the email and password with mode 600
**AC-3: Device can log in after flash**
Given a provisioned and flashed device boots for the first time
When the loader reads /etc/azaion/device.conf and calls POST /login
Then a valid JWT is returned
**AC-4: Idempotent re-run**
Given provision_device.sh was already run for serial AZJN-0042
When it is run again for the same serial
Then no duplicate user is created (existing user is reused or updated)
**AC-4: Multi-device support**
Given multiple Jetsons connected in recovery mode
When provision_devices.sh is run
Then the user can select individual devices or all, and each is provisioned sequentially
**AC-5: Runbook complete**
Given the provisioning runbook
When followed step-by-step on a new Jetson Orin Nano
Then the device is fully fused, flashed, provisioned, and can communicate with the admin API
When followed step-by-step on new Jetson Orin Nano devices
Then the devices are fully fused, flashed, provisioned, and can communicate with the admin API
@@ -1,66 +0,0 @@
# Resources Table & Update Check API
**Task**: AZ-183_resources_table_update_api
**Name**: Resources Table & Update Check API
**Description**: Add Resources table to admin API PostgreSQL DB and implement POST /get-update endpoint for fleet OTA updates
**Complexity**: 3 points
**Dependencies**: None
**Component**: Admin API
**Tracker**: AZ-183
**Epic**: AZ-181
## Problem
The fleet update system needs a server-side component that tracks published artifact versions and tells devices what needs updating. CI/CD publishes encrypted artifacts to CDN; the server must store metadata (version, URL, hash, encryption key) and serve it to devices on request.
## Outcome
- Resources table stores per-artifact metadata populated by CI/CD
- Devices call POST /get-update with their current versions and get back only what's newer
- Server-side memory cache handles 2000+ devices polling every 5 minutes without DB pressure
## Scope
### Included
- Resources table migration (resource_name, dev_stage, architecture, version, cdn_url, sha256, encryption_key, size_bytes, created_at)
- POST /get-update endpoint: accepts device's current versions + architecture + dev_stage, returns only newer resources
- Server-side memory cache invalidated on CI/CD publish
- Internal endpoint or direct DB write for CI/CD to publish new resource versions
### Excluded
- CI/CD pipeline changes (AZ-186)
- Loader-side update logic (AZ-185)
- Device provisioning (AZ-187)
## Acceptance Criteria
**AC-1: Resources table created**
Given the admin API database
When the migration runs
Then the Resources table exists with all required columns
**AC-2: Update check returns newer resources**
Given Resources table has annotations version 2026-04-13
When device sends POST /get-update with annotations version 2026-02-25
Then response includes annotations with version, cdn_url, sha256, encryption_key, size_bytes
**AC-3: Current device gets empty response**
Given device already has the latest version of all resources
When POST /get-update is called
Then response is an empty array
**AC-4: Memory cache avoids repeated DB queries**
Given 2000 devices polling every 5 minutes
When POST /get-update is called repeatedly
Then the latest versions are served from memory cache, not from DB on every request
**AC-5: Cache invalidated on publish**
Given a new resource version is published via CI/CD
When the publish endpoint/function completes
Then the next POST /get-update call returns the new version
## Constraints
- Must integrate with existing admin API (linq2db + PostgreSQL)
- encryption_key column must be stored securely (encrypted at rest in DB or via application-level encryption)
- Response must include encryption_key only over HTTPS with valid JWT
@@ -12,7 +12,7 @@ Implemented the loader's security modernization features across 2 batches:
### Batch 1 (10 points)
- **AZ-182** TPM Security Provider — SecurityProvider ABC with TPM/legacy detection, FAPI seal/unseal, graceful fallback
- **AZ-184** Resumable Download Manager — HTTP Range resume, SHA-256 verify, AES-256 decrypt, exponential backoff
- **AZ-187** Device Provisioning Script — provision_device.sh + runbook
- **AZ-187** Device Provisioning Script — provision_devices.sh + runbook
### Batch 2 (8 points)
- **AZ-185** Update Manager — background update loop, version collector, model + Docker image apply, self-update last
-111
View File
@@ -1,111 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SERIAL=""
API_URL=""
ROOTFS_DIR=""
usage() {
echo "Usage: provision_device.sh --serial <SERIAL> --api-url <ADMIN_API_BASE_URL> --rootfs-dir <STAGING_ROOTFS_PATH>" >&2
}
while [[ $# -gt 0 ]]; do
case "$1" in
--serial)
SERIAL="${2:-}"
shift 2
;;
--api-url)
API_URL="${2:-}"
shift 2
;;
--rootfs-dir)
ROOTFS_DIR="${2:-}"
shift 2
;;
--help|-h)
usage
exit 0
;;
*)
echo "Unknown option: $1" >&2
usage
exit 1
;;
esac
done
if [[ -z "$SERIAL" || -z "$API_URL" || -z "$ROOTFS_DIR" ]]; then
echo "Missing required arguments." >&2
usage
exit 1
fi
API_URL="${API_URL%/}"
normalize_serial_suffix() {
local s
s="$(printf '%s' "$1" | tr '[:upper:]' '[:lower:]')"
if [[ "$s" == *-* ]]; then
printf '%s' "${s##*-}"
else
printf '%s' "${s//-/}"
fi
}
EMAIL_SUFFIX="$(normalize_serial_suffix "$SERIAL")"
EMAIL="azaion-jetson-${EMAIL_SUFFIX}@azaion.com"
PASSWORD="$(openssl rand -hex 16)"
echo "Provisioning device identity for serial: $SERIAL"
echo "Target admin API: $API_URL"
echo "Device email: $EMAIL"
build_post_json() {
python3 -c 'import json,sys; print(json.dumps({"email":sys.argv[1],"password":sys.argv[2],"role":"CompanionPC"}))' "$1" "$2"
}
POST_JSON="$(build_post_json "$EMAIL" "$PASSWORD")"
TMP_BODY="$(mktemp)"
trap 'rm -f "$TMP_BODY"' EXIT
HTTP_CODE="$(
curl -sS -o "$TMP_BODY" -w "%{http_code}" \
-X POST "${API_URL}/users" \
-H "Content-Type: application/json" \
-d "$POST_JSON"
)"
if [[ "$HTTP_CODE" == "409" ]]; then
echo "User already exists; updating password for re-provision"
PATCH_JSON="$(build_post_json "$EMAIL" "$PASSWORD")"
HTTP_CODE="$(
curl -sS -o "$TMP_BODY" -w "%{http_code}" \
-X PATCH "${API_URL}/users/password" \
-H "Content-Type: application/json" \
-d "$PATCH_JSON"
)"
fi
if [[ "$HTTP_CODE" != "200" && "$HTTP_CODE" != "201" ]]; then
echo "Admin API error HTTP $HTTP_CODE" >&2
cat "$TMP_BODY" >&2
echo >&2
exit 1
fi
CONF_DIR="${ROOTFS_DIR}/etc/azaion"
mkdir -p "$CONF_DIR"
CONF_PATH="${CONF_DIR}/device.conf"
{
printf 'AZAION_DEVICE_EMAIL=%s\n' "$EMAIL"
printf 'AZAION_DEVICE_PASSWORD=%s\n' "$PASSWORD"
} > "$CONF_PATH"
chmod 600 "$CONF_PATH"
echo "Wrote $CONF_PATH"
echo "Provisioning finished successfully"
-224
View File
@@ -1,224 +0,0 @@
import json
import subprocess
import threading
from http.server import BaseHTTPRequestHandler, HTTPServer
from pathlib import Path
from urllib.parse import urlparse
import pytest
import requests
REPO_ROOT = Path(__file__).resolve().parents[1]
PROVISION_SCRIPT = REPO_ROOT / "scripts" / "provision_device.sh"
class _ProvisionTestState:
lock = threading.Lock()
users: dict[str, dict] = {}
def _read_json_body(handler: BaseHTTPRequestHandler) -> dict:
length = int(handler.headers.get("Content-Length", "0"))
raw = handler.rfile.read(length) if length else b"{}"
return json.loads(raw.decode("utf-8"))
def _send_json(handler: BaseHTTPRequestHandler, code: int, payload: dict | None = None):
body = b""
if payload is not None:
body = json.dumps(payload).encode("utf-8")
handler.send_response(code)
handler.send_header("Content-Type", "application/json")
handler.send_header("Content-Length", str(len(body)))
handler.end_headers()
if body:
handler.wfile.write(body)
class _AdminMockHandler(BaseHTTPRequestHandler):
def log_message(self, _format, *_args):
return
def do_POST(self):
parsed = urlparse(self.path)
if parsed.path != "/users":
self.send_error(404)
return
body = _read_json_body(self)
email = body.get("email", "")
password = body.get("password", "")
role = body.get("role", "")
with _ProvisionTestState.lock:
if email in _ProvisionTestState.users:
_send_json(self, 409, {"detail": "exists"})
return
_ProvisionTestState.users[email] = {"password": password, "role": role}
_send_json(self, 201, {"email": email, "role": role})
def do_PATCH(self):
parsed = urlparse(self.path)
if parsed.path != "/users/password":
self.send_error(404)
return
body = _read_json_body(self)
email = body.get("email", "")
password = body.get("password", "")
with _ProvisionTestState.lock:
if email not in _ProvisionTestState.users:
self.send_error(404)
return
_ProvisionTestState.users[email]["password"] = password
_send_json(self, 200, {"status": "ok"})
def handle_login_post(self):
body = _read_json_body(self)
email = body.get("email", "")
password = body.get("password", "")
with _ProvisionTestState.lock:
row = _ProvisionTestState.users.get(email)
if not row or row["password"] != password or row["role"] != "CompanionPC":
_send_json(self, 401, {"detail": "invalid"})
return
_send_json(self, 200, {"token": "provision-test-jwt"})
def _handler_factory():
class H(_AdminMockHandler):
def do_POST(self):
parsed = urlparse(self.path)
if parsed.path == "/login":
self.handle_login_post()
return
super().do_POST()
return H
@pytest.fixture
def mock_admin_server():
# Arrange
with _ProvisionTestState.lock:
_ProvisionTestState.users.clear()
server = HTTPServer(("127.0.0.1", 0), _handler_factory())
thread = threading.Thread(target=server.serve_forever, daemon=True)
thread.start()
host, port = server.server_address
base = f"http://{host}:{port}"
yield base
server.shutdown()
server.server_close()
thread.join(timeout=5)
def _run_provision(serial: str, api_url: str, rootfs: Path) -> subprocess.CompletedProcess:
return subprocess.run(
[str(PROVISION_SCRIPT), "--serial", serial, "--api-url", api_url, "--rootfs-dir", str(rootfs)],
capture_output=True,
text=True,
check=False,
)
def _parse_device_conf(path: Path) -> dict[str, str]:
out: dict[str, str] = {}
for line in path.read_text(encoding="utf-8").splitlines():
if "=" not in line:
continue
key, _, val = line.partition("=")
out[key.strip()] = val.strip()
return out
def test_provision_creates_companionpc_user(mock_admin_server, tmp_path):
# Arrange
rootfs = tmp_path / "rootfs"
serial = "AZJN-0042"
expected_email = "azaion-jetson-0042@azaion.com"
# Act
result = _run_provision(serial, mock_admin_server, rootfs)
# Assert
assert result.returncode == 0, result.stderr + result.stdout
with _ProvisionTestState.lock:
row = _ProvisionTestState.users.get(expected_email)
assert row is not None
assert row["role"] == "CompanionPC"
assert len(row["password"]) == 32
def test_provision_writes_device_conf(mock_admin_server, tmp_path):
# Arrange
rootfs = tmp_path / "rootfs"
serial = "AZJN-0042"
conf_path = rootfs / "etc" / "azaion" / "device.conf"
# Act
result = _run_provision(serial, mock_admin_server, rootfs)
# Assert
assert result.returncode == 0, result.stderr + result.stdout
assert conf_path.is_file()
data = _parse_device_conf(conf_path)
assert data["AZAION_DEVICE_EMAIL"] == "azaion-jetson-0042@azaion.com"
assert len(data["AZAION_DEVICE_PASSWORD"]) == 32
assert data["AZAION_DEVICE_PASSWORD"].isalnum()
def test_credentials_allow_login_after_provision(mock_admin_server, tmp_path):
# Arrange
rootfs = tmp_path / "rootfs"
serial = "AZJN-0042"
conf_path = rootfs / "etc" / "azaion" / "device.conf"
# Act
prov = _run_provision(serial, mock_admin_server, rootfs)
assert prov.returncode == 0, prov.stderr + prov.stdout
creds = _parse_device_conf(conf_path)
login_resp = requests.post(
f"{mock_admin_server}/login",
json={"email": creds["AZAION_DEVICE_EMAIL"], "password": creds["AZAION_DEVICE_PASSWORD"]},
timeout=5,
)
# Assert
assert login_resp.status_code == 200
assert login_resp.json().get("token") == "provision-test-jwt"
def test_provision_idempotent_no_duplicate_user(mock_admin_server, tmp_path):
# Arrange
rootfs = tmp_path / "rootfs"
serial = "AZJN-0042"
expected_email = "azaion-jetson-0042@azaion.com"
# Act
first = _run_provision(serial, mock_admin_server, rootfs)
second = _run_provision(serial, mock_admin_server, rootfs)
# Assert
assert first.returncode == 0, first.stderr + first.stdout
assert second.returncode == 0, second.stderr + second.stdout
with _ProvisionTestState.lock:
assert expected_email in _ProvisionTestState.users
assert len(_ProvisionTestState.users) == 1
def test_runbook_documents_end_to_end_flow():
# Arrange
runbook = REPO_ROOT / "_docs" / "02_document" / "deployment" / "provisioning_runbook.md"
text = runbook.read_text(encoding="utf-8")
# Act
markers = [
"prerequisites" in text.lower(),
"provision_device.sh" in text,
"device.conf" in text,
"POST" in text and "/users" in text,
"flash" in text.lower(),
"login" in text.lower(),
]
# Assert
assert runbook.is_file()
assert all(markers)