Overview
GPU acceleration using NVIDIA CUDA dramatically improves inference performance:- CPU inference: ~100-300ms per frame (YOLOv4)
- GPU inference: ~10-30ms per frame (YOLOv4)
- Speedup: 5-10x faster
GPU acceleration is automatic when CUDA is available. No code changes required.
Benefits of GPU Acceleration
Faster Detection
10x speedup means more frequent detection or processing more streams.
Lower frame_skip
Process every 5-10 frames instead of every 30 frames.
More Streams
Handle 12-16 cameras instead of 4-8 with acceptable performance.
Better Responsiveness
Detect persons entering frame within 0.5 seconds instead of 2-3 seconds.
How GPU Detection Works
RTSP Human Capture automatically detects and uses CUDA GPUs:GPU Detection Logic
Fromperson_detector.py:62-72:
Requirements
To enable GPU acceleration, you need:1. NVIDIA GPU
1. NVIDIA GPU
Compatible GPUs:Expected output:
- NVIDIA GTX 10-series or newer
- NVIDIA RTX 20/30/40-series
- NVIDIA Tesla/Quadro data center GPUs
- Compute Capability 3.5 or higher
2. CUDA Toolkit
2. CUDA Toolkit
Supported versions:Expected output:Install CUDA Toolkit:
- CUDA 11.2 or newer
- CUDA 12.x recommended
- Ubuntu/Debian
- Fedora/RHEL
- Windows
3. cuDNN (Optional but Recommended)
3. cuDNN (Optional but Recommended)
NVIDIA cuDNN (CUDA Deep Neural Network library) further optimizes performance.Download:
- Go to https://developer.nvidia.com/cudnn
- Sign up for NVIDIA Developer Program (free)
- Download cuDNN for your CUDA version
- Extract and copy files:
4. CUDA-enabled OpenCV
4. CUDA-enabled OpenCV
This is the critical requirement! Standard OpenCV doesn’t include CUDA support.You need
opencv-contrib-python compiled with CUDA.Installing CUDA-enabled OpenCV
Standard OpenCV from PyPI does NOT include CUDA support. You have three options:Option 1: Pre-built CUDA Wheels (Recommended)
Use pre-compiled wheels from the opencv-python-cuda-wheels project:Download appropriate wheel
Visit: https://github.com/cudawarped/opencv-python-cuda-wheels/releases/latestSelect wheel matching:
- Your Python version (e.g., cp312 = Python 3.12)
- Your platform (e.g., linux_x86_64)
- Your CUDA version (e.g., cuda122 = CUDA 12.2)
The
pyproject.toml in this repository is configured to look for CUDA wheels in the deps/ directory.Option 2: Build from Source
Build OpenCV with CUDA support yourself:Configure with CMake
CUDA_ARCH_BIN with your GPU’s compute capability:- RTX 3060/3070/3080/3090:
8.6 - RTX 4060/4070/4080/4090:
8.9 - RTX 2060/2070/2080:
7.5 - GTX 1060/1070/1080:
6.1
Option 3: Docker with CUDA
Use NVIDIA’s official CUDA container:Dockerfile
Requires nvidia-docker2 or NVIDIA Container Toolkit installed on host.
Verifying GPU Acceleration
Check 1: CUDA Device Count
- Success (devices > 0)
- Failure (devices = 0)
Check 2: Application Output
Run the application and look for the startup message:Check 3: GPU Utilization
Monitor GPU usage during processing:- GPU-Util: Should be 20-50% during inference
- Memory-Usage: ~300-500 MB for YOLOv4
- Power: Should increase when processing
Check 4: Performance Benchmark
Compare inference times:- CPU Baseline
- GPU Accelerated
Disable GPU temporarily:Run and observe frame processing times in console output.Expected: Detection messages every 1-3 seconds (with frame_skip=15)
Performance Comparison
Single Stream
| Configuration | Detection Latency | Max Streams |
|---|---|---|
| CPU (YOLOv4) | ~100-300ms | 1-2 |
| GPU (YOLOv4) | ~10-30ms | 8-16 |
| CPU (YOLOv3) | ~80-250ms | 1-3 |
| GPU (YOLOv3) | ~8-25ms | 10-20 |
| CPU (HOG) | ~50-150ms | 2-4 |
Times measured on:
- CPU: Intel i7-9700K @ 3.6GHz
- GPU: NVIDIA RTX 3060 12GB
- Resolution: 1920×1080
Multi-Stream Scalability
With frame_skip=15 (2 fps detection rate):| Streams | CPU Load | GPU Load | Recommended Hardware |
|---|---|---|---|
| 1-2 | 40-80% | 10-20% | Any |
| 3-4 | 80-100% | 20-35% | CPU: Mid-range, GPU: Any |
| 5-8 | >100% (bottleneck) | 35-60% | GPU: Mid-range (GTX 1660+) |
| 9-16 | N/A | 60-85% | GPU: High-end (RTX 3060+) |
Optimizing GPU Performance
1. Adjust Batch Size
YOLO processes one frame at a time. For multi-stream, this is actually optimal since:- Threads queue up at the inference lock
- GPU processes frames sequentially
- No benefit to batching in this architecture
2. Lower frame_skip for GPU
With GPU, you can afford more frequent detection:- 3x more frequent detection
- Still faster than CPU at frame_skip=30
- Better responsiveness
3. Monitor GPU Memory
Each model loaded into GPU memory:| Model | GPU Memory |
|---|---|
| YOLOv4 | ~250 MB |
| YOLOv3 | ~248 MB |
- 1920×1080: ~8 MB per frame
- Intermediate layers: ~50-100 MB
4. Use Appropriate CUDA Arch
When building OpenCV from source, matchCUDA_ARCH_BIN to your GPU:
Troubleshooting
CUDA devices: 0
CUDA devices: 0
Issue: OpenCV doesn’t detect GPUDiagnosis:
-
Check NVIDIA driver:
Should show GPU info. If not, driver not installed.
-
Check CUDA toolkit:
Should show CUDA version. If not, toolkit not installed.
-
Check OpenCV build:
Should show CUDA-related flags. If not, OpenCV not built with CUDA.
- Install NVIDIA driver
- Install CUDA toolkit
- Install/build CUDA-enabled OpenCV
Application still says 'using CPU'
Application still says 'using CPU'
Issue:Even though Should show:Solution:
Use pre-built wheels from opencv-python-cuda-wheels (they have DNN CUDA enabled).
cv2.cuda.getCudaEnabledDeviceCount() returns > 0.Cause: OpenCV DNN module built without CUDA support (need OPENCV_DNN_CUDA=ON).Verify:Out of memory errors
Out of memory errors
Error:Causes:
- GPU doesn’t have enough VRAM
- Multiple applications using GPU
- Memory leak
-
Check available memory:
- Close other GPU applications (Chrome, games, etc.)
- Use smaller model (YOLOv3-tiny instead of YOLOv4)
- Process fewer streams
GPU utilization is low
GPU utilization is low
Issue: GPU-Util in
nvidia-smi shows less than 10%Causes:- Not enough streams (GPU waiting for CPU)
- frame_skip too high
- Display rendering is bottleneck
-
Lower frame_skip:
- Add more streams: GPU can handle 8-16 streams efficiently
-
Disable display:
Slower than expected
Slower than expected
Issue: GPU not providing expected speedupCheck:
-
GPU actually being used:
GPU-Util should be >0% and spike during detection.
-
Power mode:
Should show “P2” or “P0” (performance mode), not “P8” (idle).
-
Thermal throttling:
Check temperature in
nvidia-smi. If >80°C, may be throttling. -
CUDA architecture mismatch:
Rebuild OpenCV with correct
CUDA_ARCH_BINfor your GPU.
GPU Selection (Multi-GPU Systems)
If you have multiple GPUs, OpenCV uses GPU 0 by default. To select a different GPU:Best Practices
Use Pre-built Wheels
Easier and more reliable than building from source. Get from opencv-python-cuda-wheels.
Monitor GPU Usage
Keep
nvidia-smi running in a separate terminal to watch GPU utilization.Update Drivers Regularly
Newer NVIDIA drivers often include performance improvements.
Match CUDA Versions
Ensure OpenCV CUDA version matches installed CUDA toolkit version.
Test Before Deploying
Verify GPU acceleration works with test streams before production deployment.
Plan for Scaling
GPU allows 3-5x more streams than CPU. Plan hardware accordingly.
Recommended Hardware
Budget Setup ($300-500)
GPU: NVIDIA GTX 1660 Super (6GB)- 4-8 streams at 1080p
- 2 fps detection rate
- YOLOv4
Mid-Range Setup ($500-800)
GPU: NVIDIA RTX 3060 (12GB)- 8-12 streams at 1080p
- 2-3 fps detection rate
- YOLOv4
- Room for growth
High-End Setup ($1000+)
GPU: NVIDIA RTX 4070 (12GB) or RTX 3080 (10GB)- 12-16 streams at 1080p
- 3-5 fps detection rate
- YOLOv4
- Multiple instances possible
Enterprise/Data Center
GPU: NVIDIA A4000/A5000 or Tesla T4- 16-24 streams at 1080p
- 5+ fps detection rate
- ECC memory
- 24/7 reliability
All recommendations assume:
- 1920×1080 resolution streams
- YOLOv4 model
- frame_skip tuned appropriately