# GPU Concurrent Workloads Research: VO + Satellite Matching on Single NVIDIA GPU

**Context:** UAV image processing pipeline with two GPU-intensive stages on RTX 2060 (6GB VRAM):
- **Visual Odometry (VO):** SuperPoint (~80ms) + LightGlue (~100ms) ≈ 180ms/frame
- **Satellite matching:** DINOv2 (~50ms) + LiteSAM (~200ms) ≈ 250ms/frame

**Goal:** Determine whether "overlap" of satellite matching for frame N with VO for frame N+1 is achievable and what architecture to use.

---

## 1. Can Two PyTorch Models Run Truly Concurrently on a Single GPU?

### Short Answer: **Usually No** for compute-bound workloads like yours.

**What happens with CUDA streams:**
- CUDA streams allow *asynchronous* submission of operations; they do **not** guarantee *parallel* execution.
- PyTorch's `torch.cuda.Stream()` API submits work to different streams, but the GPU scheduler decides actual concurrency.
- **When kernels fully saturate the GPU** (e.g., large matmuls, feature extraction, matching), the GPU runs them **sequentially** because there is no spare compute capacity for overlap.

**PyTorch maintainer (ngimel) on [pytorch/pytorch#59692](https://github.com/pytorch/pytorch/issues/59692):**
> "If operations on a single stream are big enough to saturate the GPU [...], using multiple streams is not expected to help. The situations where multi-stream execution actually helps are pretty limited - you want kernels that don't utilize all the GPU resources (e.g. don't launch many threadblocks), and also you want kernels that run long enough so that overhead of launching and synchronizing streams does not increase total time."

**When overlap *can* occur:**
- Kernels that **do not** saturate the GPU (small thread counts, low occupancy).
- **Data transfer + compute overlap**: DMA transfers (CPU↔GPU) run on a separate hardware path and can overlap with kernel execution. This is the most reliable overlap pattern.

**Sources:**
- [pytorch/pytorch#48279](https://github.com/pytorch/pytorch/issues/48279) – PyTorch streams don't execute concurrently (closed; resolved with large workloads)
- [pytorch/pytorch#59692](https://github.com/pytorch/pytorch/issues/59692) – cuda streams run sequentially
- [Stack Overflow: Why is torch.cuda.stream() not asynchronous?](https://stackoverflow.com/questions/79697025/why-is-with-torch-cuda-stream-not-asynchronous)

---

## 2. Actual Throughput: Overlapping vs Sequential

### Compute Overlap (VO + Satellite on Same GPU)

| Scenario | Expected Result |
|----------|-----------------|
| **Sequential** | ~180ms + ~250ms = **~430ms** per frame pair |
| **"Overlapping" with streams** | **~430ms or worse** – kernels serialize; stream overhead can add 10–20% |
| **ONNX Runtime 2 models parallel** | [Issue #15202](https://github.com/microsoft/onnxruntime/issues/15202): **30ms vs 10ms** – parallel was ~3× slower than sequential |

**Conclusion:** True compute overlap for two GPU-bound models on one GPU is **not achievable** in practice. Throughput is at best equal to sequential; often worse due to context switching and contention.

### Data Transfer + Compute Overlap

This **does** work and is the main optimization lever:
- Use `pin_memory()` and `non_blocking=True` for CPU→GPU transfers.
- Overlap: transfer frame N+1 to GPU while processing frame N.
- [PyTorch pinmem_nonblock tutorial](https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html)

---

## 3. Options for Pipeline Parallelism on Single GPU

### Option A: CUDA Streams (`torch.cuda.Stream`)

| Aspect | Assessment |
|--------|------------|
| **True overlap** | No for compute-bound models (SuperPoint, LightGlue, DINOv2, LiteSAM) |
| **Best for** | Overlapping data transfer with compute; small, non-saturating kernels |
| **Pros** | No process overhead; same address space; simple to try |
| **Cons** | No benefit for your workload; can be 20%+ slower than sequential |
| **Source** | [Concurrently test several Pytorch models on single GPU slower than iterative](https://stackoverflow.com/questions/78669860/concurrently-test-several-pytorch-models-on-a-single-gpu-slower-than-iterative-a) |

### Option B: CUDA Multi-Process Service (MPS)

| Aspect | Assessment |
|--------|------------|
| **True overlap** | Yes, for *multi-process* apps; kernels from different processes can run concurrently via Hyper-Q |
| **Requirements** | Linux or QNX only; compute capability ≥ 3.5; `nvidia-smi -c 3` (exclusive mode); `nvidia-cuda-mps-control -d` |
| **RTX 2060** | Compute capability 7.5 (Turing) – **supported** on Linux |
| **Memory overhead** | Without MPS: ~500MB VRAM per process. With MPS: shared context, **reduced** overhead |
| **Pros** | Real kernel overlap across processes; no code changes; better utilization when each process underutilizes GPU |
| **Cons** | Linux-only; requires exclusive GPU mode; designed for MPI/cooperative processes; pre-Volta: 16 clients max; Volta+: 48 clients |
| **Source** | [NVIDIA MPS docs](https://docs.nvidia.com/deploy/mps/), [TorchServe MPS](https://pytorch.org/serve/hardware_support/nvidia_mps.html) |

### Option C: Sequential Execution + Async Python

| Aspect | Assessment |
|--------|------------|
| **True overlap** | No GPU overlap; **logical** overlap via async/threading |
| **Pattern** | Run VO synchronously (latency-critical). Submit satellite matching to a queue; process in background; results arrive later |
| **Pros** | Simple; predictable VO latency; no GPU contention; satellite results used when ready |
| **Cons** | No throughput gain; total GPU time unchanged |
| **Source** | [BigDL async pipeline](https://bigdl.readthedocs.io/en/v2.4.0/doc/Nano/Howto/Inference/PyTorch/accelerate_pytorch_inference_async_pipeline.html) (CPU-stage overlap, not single-GPU compute overlap) |

---

## 4. Recommended Pattern: Latency-Critical VO + Background Satellite

**Recommended architecture:**

1. **Prioritize VO on GPU** – Run SuperPoint + LightGlue first; they are latency-critical for navigation.
2. **Run satellite matching sequentially after VO** – Or in a separate phase when VO is idle.
3. **Use async Python for pipeline structure** – Don’t block the main loop waiting for satellite. Queue frame N for satellite matching; when frame N+1 arrives, start VO for N+1 while satellite for N may still be running.
4. **Overlap data transfer with compute** – Prefetch next frame with `pin_memory()` and `non_blocking=True` while current frame is processed.
5. **Avoid** – CUDA streams for concurrent model execution (no benefit); ONNX + PyTorch concurrent on same GPU (contention).

**Why this works:**
- VO latency stays predictable (~180ms).
- Satellite matching completes when it can; results are consumed asynchronously.
- No GPU contention; no risk of streams serializing and adding overhead.
- Data transfer overlap can shave a few ms per frame.

---

## 5. Memory Overhead of CUDA MPS on RTX 2060

| Scenario | VRAM Overhead |
|---------|---------------|
| **Without MPS** (2 processes) | ~500MB × 2 ≈ **1GB** for contexts alone |
| **With MPS** | Shared context; **less** than 1GB total for contexts |
| **RTX 2060 (6GB)** | Model weights + activations + ~1GB context → tight. MPS helps if using 2 processes |

**Source:** [pytorch/serve#2128](https://github.com/pytorch/serve/issues/2128), [NVIDIA MPS blog](https://developer.nvidia.com/blog/boost-gpu-memory-performance-with-no-code-changes-using-nvidia-cuda-mps/)

---

## 6. ONNX Runtime + PyTorch Concurrent on Same GPU?

**Answer: Not recommended.**

- ONNX Runtime: running 2 models in parallel on same GPU → **~3× slower** than sequential ([onnxruntime#15202](https://github.com/microsoft/onnxruntime/issues/15202)).
- ORT maintainer: *"You can't tie both sessions to the same device id. Each session should be associated with a different device id."* With one GPU, true parallel execution is not supported.
- Use **threading** (not multiprocessing) if you must share one GPU; multiprocessing adds overhead and performs worse than sequential.
- **Batching** is preferred: single session, batch dimension, one inference call.

**Source:** [microsoft/onnxruntime#15202](https://github.com/microsoft/onnxruntime/issues/15202), [microsoft/onnxruntime#9795](https://github.com/microsoft/onnxruntime/issues/9795)

---

## Summary Table

| Option | True Overlap? | Throughput vs Sequential | Complexity | Recommended? |
|--------|---------------|--------------------------|------------|--------------|
| CUDA streams (same process) | No (compute-bound) | Same or worse | Low | No |
| CUDA MPS (2 processes) | Yes (if underutilized) | Possible gain | Medium | Maybe (Linux) |
| Sequential + async Python | No | Same | Low | **Yes** |
| Data transfer overlap | Yes (DMA + compute) | Small gain | Low | **Yes** |
| ONNX + PyTorch concurrent | No | Worse | High | No |

---

## Source URLs

- https://github.com/pytorch/pytorch/issues/48279
- https://github.com/pytorch/pytorch/issues/59692
- https://stackoverflow.com/questions/79697025/why-is-with-torch-cuda-stream-not-asynchronous
- https://stackoverflow.com/questions/78669860/concurrently-test-several-pytorch-models-on-a-single-gpu-slower-than-iterative-a
- https://docs.nvidia.com/deploy/mps/
- https://pytorch.org/serve/hardware_support/nvidia_mps.html
- https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html
- https://github.com/microsoft/onnxruntime/issues/15202
- https://github.com/microsoft/onnxruntime/issues/9795
- https://www.codegenes.net/blog/pytorch-multiple-models-same-gpu/
- https://bigdl.readthedocs.io/en/v2.4.0/doc/Nano/Howto/Inference/PyTorch/accelerate_pytorch_inference_async_pipeline.html
- https://developer.nvidia.com/blog/boost-gpu-memory-performance-with-no-code-changes-using-nvidia-cuda-mps/
- https://github.com/pytorch/serve/issues/2128