Update demo replay validation and testing documentation
ci/woodpecker/push/02-build-push Pipeline failed

- Modified the autodev state to reflect the current testing phase and details of the new `jetson-e2e` tests.
- Enhanced the "How to Test" documentation to provide clearer instructions on the demo replay validation process, including video and tlog alignment steps.
- Updated architectural documentation to include the new demo replay operator flow and its dependencies.
- Documented the removal of deprecated auto-sync features and clarified the operator-facing UI for replay validation.
- Added new entries in the dependencies table for upcoming tasks related to the demo replay flow.

These changes improve clarity and usability for operators and developers working with the demo replay system.
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-06-20 11:24:43 +03:00
parent 12d0008763
commit 1f634c2604
175 changed files with 20701 additions and 41 deletions
@@ -0,0 +1,48 @@
# Comparison & Analysis Frameworks — Reference
## General Dimensions (select as needed)
1. Goal / What problem does it solve
2. Working mechanism / Process
3. Input / Output / Boundaries
4. Advantages / Disadvantages / Trade-offs
5. Applicable scenarios / Boundary conditions
6. Cost / Benefit / Risk
7. Historical evolution / Future trends
8. Security / Permissions / Controllability
## Concept Comparison Specific Dimensions
1. Definition & essence
2. Trigger / invocation method
3. Execution agent
4. Input/output & type constraints
5. Determinism & repeatability
6. Resource & context management
7. Composition & reuse patterns
8. Security boundaries & permission control
## Decision Support Specific Dimensions
1. Solution overview
2. Implementation cost
3. Maintenance cost
4. Risk assessment
5. Expected benefit
6. Applicable scenarios
7. Team capability requirements
8. Migration difficulty
## Decomposition Completeness Probes (Completeness Audit Reference)
Used during Step 1's Decomposition Completeness Audit. After generating sub-questions, ask each probe against the current decomposition. If a probe reveals an uncovered area, add a sub-question for it.
| Probe | What it catches |
|-------|-----------------|
| **What does this cost — in money, time, resources, or trade-offs?** | Budget, pricing, licensing, tax, opportunity cost, maintenance burden |
| **What are the hard constraints — physical, legal, regulatory, environmental?** | Regulations, certifications, spectrum/frequency rules, export controls, physics limits, IP restrictions |
| **What are the dependencies and assumptions that could break?** | Supply chain, vendor lock-in, API stability, single points of failure, standards evolution |
| **What does the operating environment actually look like?** | Terrain, weather, connectivity, infrastructure, power, latency, user skill level |
| **What failure modes exist and what happens when they trigger?** | Degraded operation, fallback, safety margins, blast radius, recovery time |
| **What do practitioners who solved similar problems say matters most?** | Field-tested priorities that don't appear in specs or papers |
| **What changes over time — and what looks stable now but isn't?** | Technology roadmaps, regulatory shifts, deprecation risk, scaling effects |
@@ -0,0 +1,75 @@
# Novelty Sensitivity Assessment — Reference
## Novelty Sensitivity Classification
| Sensitivity Level | Typical Domains | Source Time Window | Description |
|-------------------|-----------------|-------------------|-------------|
| **Critical** | AI/LLMs, blockchain, cryptocurrency | 3-6 months | Technology iterates extremely fast; info from months ago may be completely outdated |
| **High** | Cloud services, frontend frameworks, API interfaces | 6-12 months | Frequent version updates; must confirm current version |
| **Medium** | Programming languages, databases, operating systems | 1-2 years | Relatively stable but still evolving |
| **Low** | Algorithm fundamentals, design patterns, theoretical concepts | No limit | Core principles change slowly |
## Critical Sensitivity Domain Special Rules
When the research topic involves the following domains, special rules must be enforced:
**Trigger word identification**:
- AI-related: LLM, GPT, Claude, Gemini, AI Agent, RAG, vector database, prompt engineering
- Cloud-native: Kubernetes new versions, Serverless, container runtimes
- Cutting-edge tech: Web3, quantum computing, AR/VR
**Mandatory rules**:
1. **Search with time constraints**:
- Use `time_range: "month"` or `time_range: "week"` to limit search results
- Prefer `start_date: "YYYY-MM-DD"` set to within the last 3 months
2. **Elevate official source priority**:
- Must first consult official documentation, official blogs, official Changelogs
- GitHub Release Notes, official X/Twitter announcements
- Academic papers (arXiv and other preprint platforms)
3. **Mandatory version number annotation**:
- Any technical description must annotate the current version number
- Example: "Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) supports..."
- Prohibit vague statements like "the latest version supports..."
4. **Outdated information handling**:
- Technical blogs/tutorials older than 6 months -> historical reference only, cannot serve as factual evidence
- Version inconsistency found -> must verify current version before using
- Obviously outdated descriptions (e.g., "will support in the future" but now already supported) -> discard directly
5. **Cross-validation**:
- Highly sensitive information must be confirmed from at least 2 independent sources
- Priority: Official docs > Official blogs > Authoritative tech media > Personal blogs
6. **Official download/release page direct verification (BLOCKING)**:
- Must directly visit official download pages to verify platform support (don't rely on search engine caches)
- Use `WebFetch` to directly extract download page content
- Search results about "coming soon" or "planned support" may be outdated; must verify in real time
- Platform support is frequently changing information; cannot infer from old sources
7. **Product-specific protocol/feature name search (BLOCKING)**:
- Beyond searching the product name, must additionally search protocol/standard names the product supports
- Common protocols/standards to search:
- AI tools: MCP, ACP (Agent Client Protocol), LSP, DAP
- Cloud services: OAuth, OIDC, SAML
- Data exchange: GraphQL, gRPC, REST
- Search format: `"<product_name> <protocol_name> support"` or `"<product_name> <protocol_name> integration"`
## Timeliness Assessment Output Template
```markdown
## Timeliness Sensitivity Assessment
- **Research Topic**: [topic]
- **Sensitivity Level**: Critical / High / Medium / Low
- **Rationale**: [why this level]
- **Source Time Window**: [X months/years]
- **Priority official sources to consult**:
1. [Official source 1]
2. [Official source 2]
- **Key version information to verify**:
- [Product/technology 1]: Current version ____
- [Product/technology 2]: Current version ____
```
@@ -0,0 +1,124 @@
# Quality Checklists — Reference
## General Quality
- [ ] All core conclusions have L1/L2 tier factual support
- [ ] No use of vague words like "possibly", "probably" without annotating uncertainty
- [ ] Comparison dimensions are complete with no key differences missed
- [ ] At least one real use case validates conclusions
- [ ] References are complete with accessible links
- [ ] Every citation can be directly verified by the user (source verifiability)
- [ ] Structure hierarchy is clear; executives can quickly locate information
## Decomposition Completeness
- [ ] Domain discovery search executed: searched "key factors when [problem domain]" before starting research
- [ ] Completeness probes applied: every probe from `references/comparison-frameworks.md` checked against sub-questions
- [ ] No uncovered areas remain: all gaps filled with sub-questions or justified as not applicable
## Internet Search Depth
- [ ] Every sub-question was searched with at least 3-5 different query variants
- [ ] At least 3 perspectives from the Perspective Rotation were applied and searched
- [ ] Search saturation reached: last searches stopped producing new substantive information
- [ ] Adjacent fields and analogous problems were searched, not just direct matches
- [ ] Contrarian viewpoints were actively sought ("why not X", "X criticism", "X failure")
- [ ] Practitioner experience was searched (production use, real-world results, lessons learned)
- [ ] Iterative deepening completed: follow-up questions from initial findings were searched
- [ ] No sub-question relies solely on training data without web verification
## Component Option Breadth
- [ ] `00_question_decomposition.md` contains a Component Option Search Plan
- [ ] Every component area was searched across simple baseline, established production, open-source, commercial/vendor, current SOTA, adjacent-domain, no-build/defer, and known-bad options where applicable
- [ ] Every component area has at least 3 realistic candidates, or a documented explanation of why broad searches found fewer
- [ ] Each lead candidate has official/source-of-truth evidence plus independent validation when available
- [ ] Each component area includes at least one baseline/fallback option and at least one rejected or experimental option when possible
- [ ] Alternative names, synonyms, and neighboring-domain terms were searched before declaring the option landscape complete
- [ ] Licensing, runtime, platform, maintenance, and unsupported-scenario searches were performed for every lead, fallback, and rejected candidate
## Mode A Specific
- [ ] Phase 1 completed: AC assessment was presented to and confirmed by user
- [ ] AC assessment consistent: Solution draft respects the (possibly adjusted) acceptance criteria and restrictions
- [ ] Competitor analysis included: Existing solutions were researched
- [ ] All components have comparison tables: Each component lists alternatives with tools, advantages, limitations, security, cost
- [ ] Component options are broad: component tables include baseline, production, open-source, commercial/vendor, SOTA/research, adjacent-domain, defer/no-build, and disqualified options where applicable
- [ ] Tools/libraries verified: Suggested tools actually exist and work as described
- [ ] Component fit matrix completed: `06_component_fit_matrix.md` (or `06_component_fit_matrix/` if split) exists and every selected component/tool/pattern is marked `Selected`
- [ ] No field-adjacent substitution: no selected candidate is chosen only because it solves a similar class of problem while failing the project's explicit constraints
- [ ] Testing strategy covers AC: Tests map to acceptance criteria
- [ ] Tech stack documented (if Phase 3 ran): `tech_stack.md` has evaluation tables, risk assessment, and learning requirements
- [ ] Security analysis documented (if Phase 4 ran): `security_analysis.md` has threat model and per-component controls
## Mode B Specific
- [ ] Findings table complete: All identified weak points documented with solutions
- [ ] Weak point categories covered: Functional, security, and performance assessed
- [ ] New draft is self-contained: Written as if from scratch, no "updated" markers
- [ ] Performance column included: Mode B comparison tables include performance characteristics
- [ ] Previous draft issues addressed: Every finding in the table is resolved in the new draft
- [ ] Existing selected components were challenged against a broad alternative landscape before being kept
- [ ] Existing component fit audited: every old and new component/tool/pattern was checked against `restrictions.md`, `acceptance_criteria.md`, and the Project Constraint Matrix
- [ ] Rejected/experimental candidates are not lead recommendations unless the user explicitly accepted the risk
## Timeliness Check (High-Sensitivity Domain BLOCKING)
When the research topic has Critical or High sensitivity level:
- [ ] Timeliness sensitivity assessment completed: `00_question_decomposition.md` contains a timeliness assessment section
- [ ] Source timeliness annotated: Every source has publication date, timeliness status, version info
- [ ] No outdated sources used as factual evidence (Critical: within 6 months; High: within 1 year)
- [ ] Version numbers explicitly annotated for all technical products/APIs/SDKs
- [ ] Official sources prioritized: Core conclusions have support from official documentation/blogs
- [ ] Cross-validation completed: Key technical information confirmed from at least 2 independent sources
- [ ] Download page directly verified: Platform support info comes from real-time extraction of official download pages
- [ ] Protocol/feature names searched: Searched for product-supported protocol names (MCP, ACP, etc.)
- [ ] GitHub Issues mined: Reviewed product's GitHub Issues popular discussions
- [ ] Community hotspots identified: Identified and recorded feature points users care most about
## Target Audience Consistency Check (BLOCKING)
- [ ] Research boundary clearly defined: `00_question_decomposition.md` has clear population/geography/timeframe/level boundaries
- [ ] Every source has target audience annotated in `01_source_registry.md` (or category files under `01_source_registry/` if split)
- [ ] Mismatched sources properly handled (excluded, annotated, or marked reference-only)
- [ ] No audience confusion in fact cards: Every fact has target audience consistent with research boundary
- [ ] No audience confusion in the report: Policies/research/data cited have consistent target audiences
## Source Verifiability
- [ ] All cited links are publicly accessible (annotate `[login required]` if not)
- [ ] Citations include exact section/page/timestamp for long documents
- [ ] Cited facts have corresponding statements in the original text (no over-interpretation)
- [ ] Source publication/update dates annotated; technical docs include version numbers
- [ ] Unverifiable information annotated `[limited source]` and not sole support for core conclusions
## Exact-Fit Validation (BLOCKING)
- [ ] Project Constraint Matrix extracted from problem context before component selection
- [ ] Component fit matrix includes `Component Area`, `Option Family`, and `Pinned Mode/Config` columns
- [ ] Every selected component/tool/library/service/pattern/algorithm has evidence for required inputs/outputs and integration boundaries
- [ ] Every selected candidate has evidence for the operating context and lifecycle assumptions it must support
- [ ] Every selected candidate has evidence for non-functional targets that are binding for the project
- [ ] Known unsupported scenarios and failure reports were searched for every selected candidate
- [ ] Mismatches are recorded as disqualifiers, not softened into generic limitations
- [ ] Any candidate with unproven fit is marked `Experimental only` or escalated for user decision
- [ ] Any candidate with documented constraint conflict is marked `Rejected`
## API Capability Verification (BLOCKING)
**Applicability**: this checklist applies only when the run is classified as **Technical-component selection** (see SKILL.md → Research Output Class). For non-technical research (concept comparison, market/policy investigation, root-cause analysis, knowledge organization), skip this checklist entirely and note the skip in `05_validation_log.md`. For mixed runs, apply only to technical component areas.
For every lead candidate that is a library/SDK/framework/service:
- [ ] The exact mode/configuration the project will use is pinned in one explicit sentence (inputs, outputs, runtime); no vague "supports X" language
- [ ] `context7` (or equivalent docs lookup) was run for the candidate, with at least 3 queries: mode enumeration, project's exact mode, disqualifier probe
- [ ] All consulted URLs from context7 / official docs are appended to `01_source_registry.md` (or files under `01_source_registry/` if split)
- [ ] A Minimum Viable Example (MVE) was saved for the pinned mode in `02_fact_cards.md` / `02_fact_cards/` (or `02_mve_evidence.md`) with: source, inputs in example, outputs in example, project inputs, project outputs required, match assessment ✅/⚠️/❌
- [ ] When the MVE inputs or outputs do not exactly match the project's, the mismatch is cited from the official docs (not inferred), and the candidate is `Experimental only` or `Rejected`
- [ ] When a library has multiple modes, each project-relevant mode appears as its own candidate row (not a single library row that softens across modes)
- [ ] Restrictions × Candidate-Modes sub-matrix in `06_component_fit_matrix.md` (or files under `06_component_fit_matrix/` if split) is filled for every lead candidate, with one row per numbered restriction and per numbered acceptance criterion
- [ ] Sub-matrix uses ✅ / ❌ / ❓ / N/A only — no free-form prose substitutes
- [ ] No `Selected` candidate has any ❌ or ❓ cell in its sub-matrix
- [ ] "Validation gate required" footnotes are explicitly classified as either *API capability* (must be resolved here) or *runtime quality* (may be carried forward)
- [ ] Paraphrased capability claims in fact cards have been cross-checked against the literal mode-enumeration evidence (no `mono, inertial → mono-inertial` style conflation)
@@ -0,0 +1,121 @@
# Source Tiering & Authority Anchoring — Reference
## Source Tiers
| Tier | Source Type | Purpose | Credibility |
|------|------------|---------|-------------|
| **L1** | Official docs, papers, specs, RFCs | Definitions, mechanisms, verifiable facts | High |
| **L2** | Official blogs, tech talks, white papers | Design intent, architectural thinking | High |
| **L3** | Authoritative media, expert commentary, tutorials | Supplementary intuition, case studies | Medium |
| **L4** | Community discussions, personal blogs, forums | Discover blind spots, validate understanding | Low |
## L4 Community Source Specifics (mandatory for product comparison research)
| Source Type | Access Method | Value |
|------------|---------------|-------|
| **GitHub Issues** | Visit `github.com/<org>/<repo>/issues` | Real user pain points, feature requests, bug reports |
| **GitHub Discussions** | Visit `github.com/<org>/<repo>/discussions` | Feature discussions, usage insights, community consensus |
| **Reddit** | Search `site:reddit.com "<product_name>"` | Authentic user reviews, comparison discussions |
| **Hacker News** | Search `site:news.ycombinator.com "<product_name>"` | In-depth technical community discussions |
| **Discord/Telegram** | Product's official community channels | Active user feedback (must annotate [limited source]) |
## Principles
- Conclusions must be traceable to L1/L2
- L3/L4 serve only as supplementary and validation
- L4 community discussions are used to discover "what users truly care about"
- Record all information sources
- **Search broadly before searching deeply** — cast a wide net with multiple query variants before diving deep into any single source
- **Cross-domain search** — when direct results are sparse, search adjacent fields, analogous problems, and related industries
- **Never rely on a single search** — each sub-question requires multiple searches from different angles (synonyms, negations, practitioner language, academic language)
## Timeliness Filtering Rules (execute based on Step 0.5 sensitivity level)
| Sensitivity Level | Source Filtering Rule | Suggested Search Parameters |
|-------------------|----------------------|-----------------------------|
| Critical | Only accept sources within 6 months as factual evidence | `time_range: "month"` or `start_date` set to last 3 months |
| High | Prefer sources within 1 year; annotate if older than 1 year | `time_range: "year"` |
| Medium | Sources within 2 years used normally; older ones need validity check | Default search |
| Low | No time limit | Default search |
## High-Sensitivity Domain Search Strategy
```
1. Round 1: Targeted official source search
- Use include_domains to restrict to official domains
- Example: include_domains: ["anthropic.com", "openai.com", "docs.xxx.com"]
2. Round 2: Official download/release page direct verification (BLOCKING)
- Directly visit official download pages; don't rely on search caches
- Use tavily-extract or WebFetch to extract page content
- Verify: platform support, current version number, release date
3. Round 3: Product-specific protocol/feature search (BLOCKING)
- Search protocol names the product supports (MCP, ACP, LSP, etc.)
- Format: "<product_name> <protocol_name>" site:official_domain
4. Round 4: Time-limited broad search
- time_range: "month" or start_date set to recent
- Exclude obviously outdated sources
5. Round 5: Version verification
- Cross-validate version numbers from search results
- If inconsistency found, immediately consult official Changelog
6. Round 6: Community voice mining (BLOCKING - mandatory for product comparison research)
- Visit the product's GitHub Issues page, review popular/pinned issues
- Search Issues for key feature terms (e.g., "MCP", "plugin", "integration")
- Review discussion trends from the last 3-6 months
- Identify the feature points and differentiating characteristics users care most about
```
## Community Voice Mining Detailed Steps
```
GitHub Issues Mining Steps:
1. Visit github.com/<org>/<repo>/issues
2. Sort by "Most commented" to view popular discussions
3. Search keywords:
- Feature-related: feature request, enhancement, MCP, plugin, API
- Comparison-related: vs, compared to, alternative, migrate from
4. Review issue labels: enhancement, feature, discussion
5. Record frequently occurring feature demands and user pain points
Value Translation:
- Frequently discussed features -> likely differentiating highlights
- User complaints/requests -> likely product weaknesses
- Comparison discussions -> directly obtain user-perspective difference analysis
```
## Source Registry Entry Template
For each source consulted, immediately append to `01_source_registry.md` (or the appropriate category file under `01_source_registry/` if the artifact has been split — see splittable-artifacts convention in `steps/00_project-integration.md`):
```markdown
## Source #[number]
- **Title**: [source title]
- **Link**: [URL]
- **Tier**: L1/L2/L3/L4
- **Publication Date**: [YYYY-MM-DD]
- **Timeliness Status**: Currently valid / Needs verification / Outdated (reference only)
- **Version Info**: [If involving a specific version, must annotate]
- **Target Audience**: [Explicitly annotate the group/geography/level this source targets]
- **Research Boundary Match**: Full match / Partial overlap / Reference only
- **Summary**: [1-2 sentence key content]
- **Related Sub-question**: [which sub-question this corresponds to]
```
## Target Audience Verification (BLOCKING)
Before including each source, verify that its target audience matches the research boundary:
| Source Type | Target audience to verify | Verification method |
|------------|---------------------------|---------------------|
| **Policy/Regulation** | Who is it for? (K-12/university/all) | Check document title, scope clauses |
| **Academic Research** | Who are the subjects? (vocational/undergraduate/graduate) | Check methodology/sample description sections |
| **Statistical Data** | Which population is measured? | Check data source description |
| **Case Reports** | What type of institution is involved? | Confirm institution type |
Handling mismatched sources:
- Target audience completely mismatched -> do not include
- Partially overlapping -> include but annotate applicable scope
- Usable as analogous reference -> include but explicitly annotate "reference only"
@@ -0,0 +1,56 @@
# Usage Examples — Reference
## Example 1: Initial Research (Mode A)
```
User: Research this problem and find the best solution
```
Execution flow:
1. Context resolution: no explicit file -> project mode (INPUT_DIR=`_docs/00_problem/`, OUTPUT_DIR=`_docs/01_solution/`)
2. Guardrails: verify INPUT_DIR exists with required files
3. Mode detection: no `solution_draft*.md` -> Mode A
4. Phase 1: Assess acceptance criteria and restrictions, ask user about unclear parts
5. BLOCKING: present AC assessment, wait for user confirmation
6. Phase 2: Full 8-step research — competitors, components, state-of-the-art solutions
7. Output: `OUTPUT_DIR/solution_draft01.md`
8. (Optional) Phase 3: Tech stack consolidation -> `tech_stack.md`
9. (Optional) Phase 4: Security deep dive -> `security_analysis.md`
## Example 2: Solution Assessment (Mode B)
```
User: Assess the current solution draft
```
Execution flow:
1. Context resolution: no explicit file -> project mode
2. Guardrails: verify INPUT_DIR exists
3. Mode detection: `solution_draft03.md` found in OUTPUT_DIR -> Mode B, read it as input
4. Full 8-step research — weak points, security, performance, solutions
5. Output: `OUTPUT_DIR/solution_draft04.md` with findings table + revised draft
## Example 3: Standalone Research
```
User: /research @my_problem.md
```
Execution flow:
1. Context resolution: explicit file -> standalone mode (INPUT_FILE=`my_problem.md`, OUTPUT_DIR=`_standalone/my_problem/01_solution/`)
2. Guardrails: verify INPUT_FILE exists and is non-empty, warn about missing restrictions/AC
3. Mode detection + full research flow as in Example 1, scoped to standalone paths
4. Output: `_standalone/my_problem/01_solution/solution_draft01.md`
5. Move `my_problem.md` into `_standalone/my_problem/`
## Example 4: Force Initial Research (Override)
```
User: Research from scratch, ignore existing drafts
```
Execution flow:
1. Context resolution: no explicit file -> project mode
2. Mode detection: drafts exist, but user explicitly requested initial research -> Mode A
3. Phase 1 + Phase 2 as in Example 1
4. Output: `OUTPUT_DIR/solution_draft##.md` (incremented from highest existing)