# Containerization Plan ## Container Architecture | Container | Base Image | Purpose | GPU Access | |-----------|-----------|---------|------------| | semantic-detection | nvcr.io/nvidia/l4t-tensorrt:r36.x (JetPack 6.2) | Main detection service (Cython + TRT + scan controller + gimbal + recorder) | Yes (TRT inference) | | vlm-service | dustynv/nanollm:r36 (NanoLLM for JetPack 6) | VLM inference (VILA1.5-3B, 4-bit MLC) | Yes (GPU inference) | ## Dockerfile: semantic-detection ```dockerfile # Outline — not runnable, for planning purposes FROM nvcr.io/nvidia/l4t-tensorrt:r36.x # System dependencies RUN apt-get update && apt-get install -y python3.11 python3-pip libopencv-dev # Python dependencies COPY requirements.txt . RUN pip3 install -r requirements.txt # pyserial, crcmod, scikit-image, pyyaml # Cython build COPY src/ /app/src/ RUN cd /app/src && python3 setup.py build_ext --inplace # Config and models mounted as volumes VOLUME ["/models", "/etc/semantic-detection", "/data/output"] ENTRYPOINT ["python3", "/app/src/main.py"] ``` ## Dockerfile: vlm-service Uses NanoLLM pre-built Docker image. No custom Dockerfile needed — configuration via environment variables and volume mounts. ```yaml # docker-compose snippet vlm-service: image: dustynv/nanollm:r36 runtime: nvidia environment: - MODEL=VILA1.5-3B - QUANTIZATION=w4a16 volumes: - vlm-models:/models - vlm-socket:/tmp ipc: host shm_size: 8g ``` ## Volume Strategy | Volume | Mount Point | Contents | Persistence | |--------|-----------|----------|-------------| | models | /models | TRT FP16 engines (yoloe-11s-seg.engine, yoloe-26s-seg.engine, mobilenetv3.engine) | Persistent on NVMe | | config | /etc/semantic-detection | config.yaml, class definitions | Persistent on NVMe | | output | /data/output | Detection logs, recorded frames, gimbal logs | Persistent on NVMe (circular buffer) | | vlm-models | /models (vlm-service) | VILA1.5-3B MLC weights | Persistent on NVMe | | vlm-socket | /tmp (both containers) | Unix domain socket for IPC | Ephemeral | ## GPU Sharing Both containers share the same GPU. Sequential scheduling enforced at application level: - During Level 1: only semantic-detection uses GPU (YOLOE inference) - During Level 2 Tier 3: semantic-detection pauses YOLOE, vlm-service runs VLM inference - `--runtime=nvidia` on both containers, but application logic prevents concurrent GPU access ## Resource Limits | Container | Memory Limit | CPU Limit | GPU | |-----------|-------------|-----------|-----| | semantic-detection | 4GB | No limit (all 6 cores available) | Shared | | vlm-service | 4GB | No limit | Shared | Note: Limits are soft — shared LPDDR5 means actual allocation is dynamic. Application-level monitoring (HealthMonitor) tracks actual usage. ## Development Environment ```yaml # docker-compose.dev.yaml services: semantic-detection: build: . environment: - ENV=development - GIMBAL_MODE=mock_tcp - INFERENCE_ENGINE=onnxruntime volumes: - ./src:/app/src - ./config/config.dev.yaml:/etc/semantic-detection/config.yaml ports: - "8080:8080" vlm-stub: build: ./tests/vlm_stub volumes: - vlm-socket:/tmp mock-gimbal: build: ./tests/mock_gimbal ports: - "9090:9090" ```