mirror of https://github.com/azaion/gps-denied-desktop.git synced 2026-04-22 11:46:36 +00:00

Files

T

Oleksandr Bezdieniezhnykh b419e2c04a add clarification to research methodology by including a step for solution comparison and user consultation

2026-03-17 18:43:57 +02:00

9.6 KiB

Raw Permalink Blame History

GPU Concurrent Workloads Research: VO + Satellite Matching on Single NVIDIA GPU

Context: UAV image processing pipeline with two GPU-intensive stages on RTX 2060 (6GB VRAM):

Visual Odometry (VO): SuperPoint (~80ms) + LightGlue (~100ms) ≈ 180ms/frame
Satellite matching: DINOv2 (~50ms) + LiteSAM (~200ms) ≈ 250ms/frame

Goal: Determine whether "overlap" of satellite matching for frame N with VO for frame N+1 is achievable and what architecture to use.

1. Can Two PyTorch Models Run Truly Concurrently on a Single GPU?

Short Answer: Usually No for compute-bound workloads like yours.

What happens with CUDA streams:

CUDA streams allow asynchronous submission of operations; they do not guarantee parallel execution.
PyTorch's torch.cuda.Stream() API submits work to different streams, but the GPU scheduler decides actual concurrency.
When kernels fully saturate the GPU (e.g., large matmuls, feature extraction, matching), the GPU runs them sequentially because there is no spare compute capacity for overlap.

PyTorch maintainer (ngimel) on pytorch/pytorch#59692:

"If operations on a single stream are big enough to saturate the GPU [...], using multiple streams is not expected to help. The situations where multi-stream execution actually helps are pretty limited - you want kernels that don't utilize all the GPU resources (e.g. don't launch many threadblocks), and also you want kernels that run long enough so that overhead of launching and synchronizing streams does not increase total time."

When overlap can occur:

Kernels that do not saturate the GPU (small thread counts, low occupancy).
Data transfer + compute overlap: DMA transfers (CPU↔GPU) run on a separate hardware path and can overlap with kernel execution. This is the most reliable overlap pattern.

Sources:

pytorch/pytorch#48279 – PyTorch streams don't execute concurrently (closed; resolved with large workloads)
pytorch/pytorch#59692 – cuda streams run sequentially
Stack Overflow: Why is torch.cuda.stream() not asynchronous?

2. Actual Throughput: Overlapping vs Sequential

Compute Overlap (VO + Satellite on Same GPU)

Scenario	Expected Result
Sequential	~180ms + ~250ms = ~430ms per frame pair
"Overlapping" with streams	~430ms or worse – kernels serialize; stream overhead can add 10–20%
ONNX Runtime 2 models parallel	Issue #15202: 30ms vs 10ms – parallel was ~3× slower than sequential

Conclusion: True compute overlap for two GPU-bound models on one GPU is not achievable in practice. Throughput is at best equal to sequential; often worse due to context switching and contention.

Data Transfer + Compute Overlap

This does work and is the main optimization lever:

Use pin_memory() and non_blocking=True for CPU→GPU transfers.
Overlap: transfer frame N+1 to GPU while processing frame N.
PyTorch pinmem_nonblock tutorial

3. Options for Pipeline Parallelism on Single GPU

Option A: CUDA Streams (`torch.cuda.Stream`)

Aspect	Assessment
True overlap	No for compute-bound models (SuperPoint, LightGlue, DINOv2, LiteSAM)
Best for	Overlapping data transfer with compute; small, non-saturating kernels
Pros	No process overhead; same address space; simple to try
Cons	No benefit for your workload; can be 20%+ slower than sequential
Source	Concurrently test several Pytorch models on single GPU slower than iterative

Option B: CUDA Multi-Process Service (MPS)

Aspect	Assessment
True overlap	Yes, for multi-process apps; kernels from different processes can run concurrently via Hyper-Q
Requirements	Linux or QNX only; compute capability ≥ 3.5; `nvidia-smi -c 3` (exclusive mode); `nvidia-cuda-mps-control -d`
RTX 2060	Compute capability 7.5 (Turing) – supported on Linux
Memory overhead	Without MPS: ~500MB VRAM per process. With MPS: shared context, reduced overhead
Pros	Real kernel overlap across processes; no code changes; better utilization when each process underutilizes GPU
Cons	Linux-only; requires exclusive GPU mode; designed for MPI/cooperative processes; pre-Volta: 16 clients max; Volta+: 48 clients
Source	NVIDIA MPS docs, TorchServe MPS

Option C: Sequential Execution + Async Python

Aspect	Assessment
True overlap	No GPU overlap; logical overlap via async/threading
Pattern	Run VO synchronously (latency-critical). Submit satellite matching to a queue; process in background; results arrive later
Pros	Simple; predictable VO latency; no GPU contention; satellite results used when ready
Cons	No throughput gain; total GPU time unchanged
Source	BigDL async pipeline (CPU-stage overlap, not single-GPU compute overlap)

4. Recommended Pattern: Latency-Critical VO + Background Satellite

Recommended architecture:

Prioritize VO on GPU – Run SuperPoint + LightGlue first; they are latency-critical for navigation.
Run satellite matching sequentially after VO – Or in a separate phase when VO is idle.
Use async Python for pipeline structure – Don’t block the main loop waiting for satellite. Queue frame N for satellite matching; when frame N+1 arrives, start VO for N+1 while satellite for N may still be running.
Overlap data transfer with compute – Prefetch next frame with pin_memory() and non_blocking=True while current frame is processed.
Avoid – CUDA streams for concurrent model execution (no benefit); ONNX + PyTorch concurrent on same GPU (contention).

Why this works:

VO latency stays predictable (~180ms).
Satellite matching completes when it can; results are consumed asynchronously.
No GPU contention; no risk of streams serializing and adding overhead.
Data transfer overlap can shave a few ms per frame.

5. Memory Overhead of CUDA MPS on RTX 2060

Scenario	VRAM Overhead
Without MPS (2 processes)	~500MB × 2 ≈ 1GB for contexts alone
With MPS	Shared context; less than 1GB total for contexts
RTX 2060 (6GB)	Model weights + activations + ~1GB context → tight. MPS helps if using 2 processes

Source: pytorch/serve#2128, NVIDIA MPS blog

6. ONNX Runtime + PyTorch Concurrent on Same GPU?

Answer: Not recommended.

ONNX Runtime: running 2 models in parallel on same GPU → ~3× slower than sequential (onnxruntime#15202).
ORT maintainer: "You can't tie both sessions to the same device id. Each session should be associated with a different device id." With one GPU, true parallel execution is not supported.
Use threading (not multiprocessing) if you must share one GPU; multiprocessing adds overhead and performs worse than sequential.
Batching is preferred: single session, batch dimension, one inference call.

Source: microsoft/onnxruntime#15202, microsoft/onnxruntime#9795

Summary Table

Option	True Overlap?	Throughput vs Sequential	Complexity	Recommended?
CUDA streams (same process)	No (compute-bound)	Same or worse	Low	No
CUDA MPS (2 processes)	Yes (if underutilized)	Possible gain	Medium	Maybe (Linux)
Sequential + async Python	No	Same	Low	Yes
Data transfer overlap	Yes (DMA + compute)	Small gain	Low	Yes
ONNX + PyTorch concurrent	No	Worse	High	No

9.6 KiB Raw Permalink Blame History Unescape Escape