Files
gps-denied-desktop/_docs/00_research/gpu_concurrent_workloads_research.md

9.6 KiB
Raw Permalink Blame History

GPU Concurrent Workloads Research: VO + Satellite Matching on Single NVIDIA GPU

Context: UAV image processing pipeline with two GPU-intensive stages on RTX 2060 (6GB VRAM):

  • Visual Odometry (VO): SuperPoint (~80ms) + LightGlue (~100ms) ≈ 180ms/frame
  • Satellite matching: DINOv2 (~50ms) + LiteSAM (~200ms) ≈ 250ms/frame

Goal: Determine whether "overlap" of satellite matching for frame N with VO for frame N+1 is achievable and what architecture to use.


1. Can Two PyTorch Models Run Truly Concurrently on a Single GPU?

Short Answer: Usually No for compute-bound workloads like yours.

What happens with CUDA streams:

  • CUDA streams allow asynchronous submission of operations; they do not guarantee parallel execution.
  • PyTorch's torch.cuda.Stream() API submits work to different streams, but the GPU scheduler decides actual concurrency.
  • When kernels fully saturate the GPU (e.g., large matmuls, feature extraction, matching), the GPU runs them sequentially because there is no spare compute capacity for overlap.

PyTorch maintainer (ngimel) on pytorch/pytorch#59692:

"If operations on a single stream are big enough to saturate the GPU [...], using multiple streams is not expected to help. The situations where multi-stream execution actually helps are pretty limited - you want kernels that don't utilize all the GPU resources (e.g. don't launch many threadblocks), and also you want kernels that run long enough so that overhead of launching and synchronizing streams does not increase total time."

When overlap can occur:

  • Kernels that do not saturate the GPU (small thread counts, low occupancy).
  • Data transfer + compute overlap: DMA transfers (CPU↔GPU) run on a separate hardware path and can overlap with kernel execution. This is the most reliable overlap pattern.

Sources:


2. Actual Throughput: Overlapping vs Sequential

Compute Overlap (VO + Satellite on Same GPU)

Scenario Expected Result
Sequential ~180ms + ~250ms = ~430ms per frame pair
"Overlapping" with streams ~430ms or worse kernels serialize; stream overhead can add 1020%
ONNX Runtime 2 models parallel Issue #15202: 30ms vs 10ms parallel was ~3× slower than sequential

Conclusion: True compute overlap for two GPU-bound models on one GPU is not achievable in practice. Throughput is at best equal to sequential; often worse due to context switching and contention.

Data Transfer + Compute Overlap

This does work and is the main optimization lever:


3. Options for Pipeline Parallelism on Single GPU

Option A: CUDA Streams (torch.cuda.Stream)

Aspect Assessment
True overlap No for compute-bound models (SuperPoint, LightGlue, DINOv2, LiteSAM)
Best for Overlapping data transfer with compute; small, non-saturating kernels
Pros No process overhead; same address space; simple to try
Cons No benefit for your workload; can be 20%+ slower than sequential
Source Concurrently test several Pytorch models on single GPU slower than iterative

Option B: CUDA Multi-Process Service (MPS)

Aspect Assessment
True overlap Yes, for multi-process apps; kernels from different processes can run concurrently via Hyper-Q
Requirements Linux or QNX only; compute capability ≥ 3.5; nvidia-smi -c 3 (exclusive mode); nvidia-cuda-mps-control -d
RTX 2060 Compute capability 7.5 (Turing) supported on Linux
Memory overhead Without MPS: ~500MB VRAM per process. With MPS: shared context, reduced overhead
Pros Real kernel overlap across processes; no code changes; better utilization when each process underutilizes GPU
Cons Linux-only; requires exclusive GPU mode; designed for MPI/cooperative processes; pre-Volta: 16 clients max; Volta+: 48 clients
Source NVIDIA MPS docs, TorchServe MPS

Option C: Sequential Execution + Async Python

Aspect Assessment
True overlap No GPU overlap; logical overlap via async/threading
Pattern Run VO synchronously (latency-critical). Submit satellite matching to a queue; process in background; results arrive later
Pros Simple; predictable VO latency; no GPU contention; satellite results used when ready
Cons No throughput gain; total GPU time unchanged
Source BigDL async pipeline (CPU-stage overlap, not single-GPU compute overlap)

Recommended architecture:

  1. Prioritize VO on GPU Run SuperPoint + LightGlue first; they are latency-critical for navigation.
  2. Run satellite matching sequentially after VO Or in a separate phase when VO is idle.
  3. Use async Python for pipeline structure Dont block the main loop waiting for satellite. Queue frame N for satellite matching; when frame N+1 arrives, start VO for N+1 while satellite for N may still be running.
  4. Overlap data transfer with compute Prefetch next frame with pin_memory() and non_blocking=True while current frame is processed.
  5. Avoid CUDA streams for concurrent model execution (no benefit); ONNX + PyTorch concurrent on same GPU (contention).

Why this works:

  • VO latency stays predictable (~180ms).
  • Satellite matching completes when it can; results are consumed asynchronously.
  • No GPU contention; no risk of streams serializing and adding overhead.
  • Data transfer overlap can shave a few ms per frame.

5. Memory Overhead of CUDA MPS on RTX 2060

Scenario VRAM Overhead
Without MPS (2 processes) ~500MB × 2 ≈ 1GB for contexts alone
With MPS Shared context; less than 1GB total for contexts
RTX 2060 (6GB) Model weights + activations + ~1GB context → tight. MPS helps if using 2 processes

Source: pytorch/serve#2128, NVIDIA MPS blog


6. ONNX Runtime + PyTorch Concurrent on Same GPU?

Answer: Not recommended.

  • ONNX Runtime: running 2 models in parallel on same GPU → ~3× slower than sequential (onnxruntime#15202).
  • ORT maintainer: "You can't tie both sessions to the same device id. Each session should be associated with a different device id." With one GPU, true parallel execution is not supported.
  • Use threading (not multiprocessing) if you must share one GPU; multiprocessing adds overhead and performs worse than sequential.
  • Batching is preferred: single session, batch dimension, one inference call.

Source: microsoft/onnxruntime#15202, microsoft/onnxruntime#9795


Summary Table

Option True Overlap? Throughput vs Sequential Complexity Recommended?
CUDA streams (same process) No (compute-bound) Same or worse Low No
CUDA MPS (2 processes) Yes (if underutilized) Possible gain Medium Maybe (Linux)
Sequential + async Python No Same Low Yes
Data transfer overlap Yes (DMA + compute) Small gain Low Yes
ONNX + PyTorch concurrent No Worse High No

Source URLs