9.6 KiB
GPU Concurrent Workloads Research: VO + Satellite Matching on Single NVIDIA GPU
Context: UAV image processing pipeline with two GPU-intensive stages on RTX 2060 (6GB VRAM):
- Visual Odometry (VO): SuperPoint (~80ms) + LightGlue (~100ms) ≈ 180ms/frame
- Satellite matching: DINOv2 (~50ms) + LiteSAM (~200ms) ≈ 250ms/frame
Goal: Determine whether "overlap" of satellite matching for frame N with VO for frame N+1 is achievable and what architecture to use.
1. Can Two PyTorch Models Run Truly Concurrently on a Single GPU?
Short Answer: Usually No for compute-bound workloads like yours.
What happens with CUDA streams:
- CUDA streams allow asynchronous submission of operations; they do not guarantee parallel execution.
- PyTorch's
torch.cuda.Stream()API submits work to different streams, but the GPU scheduler decides actual concurrency. - When kernels fully saturate the GPU (e.g., large matmuls, feature extraction, matching), the GPU runs them sequentially because there is no spare compute capacity for overlap.
PyTorch maintainer (ngimel) on pytorch/pytorch#59692:
"If operations on a single stream are big enough to saturate the GPU [...], using multiple streams is not expected to help. The situations where multi-stream execution actually helps are pretty limited - you want kernels that don't utilize all the GPU resources (e.g. don't launch many threadblocks), and also you want kernels that run long enough so that overhead of launching and synchronizing streams does not increase total time."
When overlap can occur:
- Kernels that do not saturate the GPU (small thread counts, low occupancy).
- Data transfer + compute overlap: DMA transfers (CPU↔GPU) run on a separate hardware path and can overlap with kernel execution. This is the most reliable overlap pattern.
Sources:
- pytorch/pytorch#48279 – PyTorch streams don't execute concurrently (closed; resolved with large workloads)
- pytorch/pytorch#59692 – cuda streams run sequentially
- Stack Overflow: Why is torch.cuda.stream() not asynchronous?
2. Actual Throughput: Overlapping vs Sequential
Compute Overlap (VO + Satellite on Same GPU)
| Scenario | Expected Result |
|---|---|
| Sequential | ~180ms + ~250ms = ~430ms per frame pair |
| "Overlapping" with streams | ~430ms or worse – kernels serialize; stream overhead can add 10–20% |
| ONNX Runtime 2 models parallel | Issue #15202: 30ms vs 10ms – parallel was ~3× slower than sequential |
Conclusion: True compute overlap for two GPU-bound models on one GPU is not achievable in practice. Throughput is at best equal to sequential; often worse due to context switching and contention.
Data Transfer + Compute Overlap
This does work and is the main optimization lever:
- Use
pin_memory()andnon_blocking=Truefor CPU→GPU transfers. - Overlap: transfer frame N+1 to GPU while processing frame N.
- PyTorch pinmem_nonblock tutorial
3. Options for Pipeline Parallelism on Single GPU
Option A: CUDA Streams (torch.cuda.Stream)
| Aspect | Assessment |
|---|---|
| True overlap | No for compute-bound models (SuperPoint, LightGlue, DINOv2, LiteSAM) |
| Best for | Overlapping data transfer with compute; small, non-saturating kernels |
| Pros | No process overhead; same address space; simple to try |
| Cons | No benefit for your workload; can be 20%+ slower than sequential |
| Source | Concurrently test several Pytorch models on single GPU slower than iterative |
Option B: CUDA Multi-Process Service (MPS)
| Aspect | Assessment |
|---|---|
| True overlap | Yes, for multi-process apps; kernels from different processes can run concurrently via Hyper-Q |
| Requirements | Linux or QNX only; compute capability ≥ 3.5; nvidia-smi -c 3 (exclusive mode); nvidia-cuda-mps-control -d |
| RTX 2060 | Compute capability 7.5 (Turing) – supported on Linux |
| Memory overhead | Without MPS: ~500MB VRAM per process. With MPS: shared context, reduced overhead |
| Pros | Real kernel overlap across processes; no code changes; better utilization when each process underutilizes GPU |
| Cons | Linux-only; requires exclusive GPU mode; designed for MPI/cooperative processes; pre-Volta: 16 clients max; Volta+: 48 clients |
| Source | NVIDIA MPS docs, TorchServe MPS |
Option C: Sequential Execution + Async Python
| Aspect | Assessment |
|---|---|
| True overlap | No GPU overlap; logical overlap via async/threading |
| Pattern | Run VO synchronously (latency-critical). Submit satellite matching to a queue; process in background; results arrive later |
| Pros | Simple; predictable VO latency; no GPU contention; satellite results used when ready |
| Cons | No throughput gain; total GPU time unchanged |
| Source | BigDL async pipeline (CPU-stage overlap, not single-GPU compute overlap) |
4. Recommended Pattern: Latency-Critical VO + Background Satellite
Recommended architecture:
- Prioritize VO on GPU – Run SuperPoint + LightGlue first; they are latency-critical for navigation.
- Run satellite matching sequentially after VO – Or in a separate phase when VO is idle.
- Use async Python for pipeline structure – Don’t block the main loop waiting for satellite. Queue frame N for satellite matching; when frame N+1 arrives, start VO for N+1 while satellite for N may still be running.
- Overlap data transfer with compute – Prefetch next frame with
pin_memory()andnon_blocking=Truewhile current frame is processed. - Avoid – CUDA streams for concurrent model execution (no benefit); ONNX + PyTorch concurrent on same GPU (contention).
Why this works:
- VO latency stays predictable (~180ms).
- Satellite matching completes when it can; results are consumed asynchronously.
- No GPU contention; no risk of streams serializing and adding overhead.
- Data transfer overlap can shave a few ms per frame.
5. Memory Overhead of CUDA MPS on RTX 2060
| Scenario | VRAM Overhead |
|---|---|
| Without MPS (2 processes) | ~500MB × 2 ≈ 1GB for contexts alone |
| With MPS | Shared context; less than 1GB total for contexts |
| RTX 2060 (6GB) | Model weights + activations + ~1GB context → tight. MPS helps if using 2 processes |
Source: pytorch/serve#2128, NVIDIA MPS blog
6. ONNX Runtime + PyTorch Concurrent on Same GPU?
Answer: Not recommended.
- ONNX Runtime: running 2 models in parallel on same GPU → ~3× slower than sequential (onnxruntime#15202).
- ORT maintainer: "You can't tie both sessions to the same device id. Each session should be associated with a different device id." With one GPU, true parallel execution is not supported.
- Use threading (not multiprocessing) if you must share one GPU; multiprocessing adds overhead and performs worse than sequential.
- Batching is preferred: single session, batch dimension, one inference call.
Source: microsoft/onnxruntime#15202, microsoft/onnxruntime#9795
Summary Table
| Option | True Overlap? | Throughput vs Sequential | Complexity | Recommended? |
|---|---|---|---|---|
| CUDA streams (same process) | No (compute-bound) | Same or worse | Low | No |
| CUDA MPS (2 processes) | Yes (if underutilized) | Possible gain | Medium | Maybe (Linux) |
| Sequential + async Python | No | Same | Low | Yes |
| Data transfer overlap | Yes (DMA + compute) | Small gain | Low | Yes |
| ONNX + PyTorch concurrent | No | Worse | High | No |
Source URLs
- https://github.com/pytorch/pytorch/issues/48279
- https://github.com/pytorch/pytorch/issues/59692
- https://stackoverflow.com/questions/79697025/why-is-with-torch-cuda-stream-not-asynchronous
- https://stackoverflow.com/questions/78669860/concurrently-test-several-pytorch-models-on-a-single-gpu-slower-than-iterative-a
- https://docs.nvidia.com/deploy/mps/
- https://pytorch.org/serve/hardware_support/nvidia_mps.html
- https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html
- https://github.com/microsoft/onnxruntime/issues/15202
- https://github.com/microsoft/onnxruntime/issues/9795
- https://www.codegenes.net/blog/pytorch-multiple-models-same-gpu/
- https://bigdl.readthedocs.io/en/v2.4.0/doc/Nano/Howto/Inference/PyTorch/accelerate_pytorch_inference_async_pipeline.html
- https://developer.nvidia.com/blog/boost-gpu-memory-performance-with-no-code-changes-using-nvidia-cuda-mps/
- https://github.com/pytorch/serve/issues/2128