Skip to main content

Overview

GPU acceleration using NVIDIA CUDA dramatically improves inference performance:
  • CPU inference: ~100-300ms per frame (YOLOv4)
  • GPU inference: ~10-30ms per frame (YOLOv4)
  • Speedup: 5-10x faster
GPU acceleration is automatic when CUDA is available. No code changes required.

Benefits of GPU Acceleration

Faster Detection

10x speedup means more frequent detection or processing more streams.

Lower frame_skip

Process every 5-10 frames instead of every 30 frames.

More Streams

Handle 12-16 cameras instead of 4-8 with acceptable performance.

Better Responsiveness

Detect persons entering frame within 0.5 seconds instead of 2-3 seconds.

How GPU Detection Works

RTSP Human Capture automatically detects and uses CUDA GPUs:

GPU Detection Logic

From person_detector.py:62-72:
# Try to use NVIDIA GPU via CUDA backend
if self.net is not None:
    cuda_available = cv2.cuda.getCudaEnabledDeviceCount() > 0
    if cuda_available:
        print("CUDA available, using GPU for inference")
        self.net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
        self.net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
    else:
        print("CUDA not available, using CPU for inference")
        self.net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
        self.net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
1

Check CUDA device count

cuda_available = cv2.cuda.getCudaEnabledDeviceCount() > 0
Returns number of CUDA-capable GPUs detected by OpenCV.
2

Configure DNN backend

If CUDA available:
self.net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
self.net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
If not:
self.net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
self.net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
3

Run inference

self.net.setInput(blob)
outputs = self.net.forward(self.output_layers)
Automatically uses GPU if configured, CPU otherwise.
No configuration needed! If you have CUDA installed and CUDA-enabled OpenCV, GPU acceleration is automatic.

Requirements

To enable GPU acceleration, you need:
Compatible GPUs:
  • NVIDIA GTX 10-series or newer
  • NVIDIA RTX 20/30/40-series
  • NVIDIA Tesla/Quadro data center GPUs
  • Compute Capability 3.5 or higher
Check your GPU:
lspci | grep -i nvidia
Expected output:
01:00.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060]
Supported versions:
  • CUDA 11.2 or newer
  • CUDA 12.x recommended
Check CUDA version:
nvcc --version
Expected output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Install CUDA Toolkit:
# Add NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install CUDA
sudo apt-get install cuda-toolkit-12-4
This is the critical requirement! Standard OpenCV doesn’t include CUDA support.You need opencv-contrib-python compiled with CUDA.

Installing CUDA-enabled OpenCV

Standard OpenCV from PyPI does NOT include CUDA support. You have three options: Use pre-compiled wheels from the opencv-python-cuda-wheels project:
1

Download appropriate wheel

Visit: https://github.com/cudawarped/opencv-python-cuda-wheels/releases/latestSelect wheel matching:
  • Your Python version (e.g., cp312 = Python 3.12)
  • Your platform (e.g., linux_x86_64)
  • Your CUDA version (e.g., cuda122 = CUDA 12.2)
Example filename:
opencv_contrib_python-4.9.0+cuda122-cp312-cp312-linux_x86_64.whl
2

Create deps directory

mkdir -p deps
cd deps
3

Download wheel

wget https://github.com/cudawarped/opencv-python-cuda-wheels/releases/download/latest/opencv_contrib_python-4.9.0+cuda122-cp312-cp312-linux_x86_64.whl
Replace with your specific wheel URL.
4

Install with uv

cd ..
uv pip install deps/opencv_contrib_python-4.9.0+cuda122-cp312-cp312-linux_x86_64.whl
5

Verify CUDA support

python -c "import cv2; print('CUDA devices:', cv2.cuda.getCudaEnabledDeviceCount())"
Expected output:
CUDA devices: 1
If you see 0, CUDA is not available.
The pyproject.toml in this repository is configured to look for CUDA wheels in the deps/ directory.

Option 2: Build from Source

Build OpenCV with CUDA support yourself:
This is time-consuming (1-2 hours) and error-prone. Only recommended if pre-built wheels don’t work.
1

Install build dependencies

sudo apt-get update
sudo apt-get install -y \
    build-essential cmake git pkg-config \
    libjpeg-dev libpng-dev libtiff-dev \
    libavcodec-dev libavformat-dev libswscale-dev \
    libv4l-dev libxvidcore-dev libx264-dev \
    libgtk-3-dev libatlas-base-dev gfortran \
    python3-dev
2

Clone OpenCV repositories

git clone https://github.com/opencv/opencv.git
git clone https://github.com/opencv/opencv_contrib.git
cd opencv
mkdir build
cd build
3

Configure with CMake

cmake -D CMAKE_BUILD_TYPE=RELEASE \
    -D CMAKE_INSTALL_PREFIX=/usr/local \
    -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib/modules \
    -D WITH_CUDA=ON \
    -D CUDA_ARCH_BIN=8.6 \
    -D WITH_CUDNN=ON \
    -D OPENCV_DNN_CUDA=ON \
    -D ENABLE_FAST_MATH=ON \
    -D CUDA_FAST_MATH=ON \
    -D WITH_CUBLAS=ON \
    -D BUILD_opencv_python3=ON \
    ..
Replace CUDA_ARCH_BIN with your GPU’s compute capability:
  • RTX 3060/3070/3080/3090: 8.6
  • RTX 4060/4070/4080/4090: 8.9
  • RTX 2060/2070/2080: 7.5
  • GTX 1060/1070/1080: 6.1
Check your GPU: https://developer.nvidia.com/cuda-gpus
4

Build (this takes 1-2 hours)

make -j$(nproc)
sudo make install
sudo ldconfig
5

Verify installation

python3 -c "import cv2; print(cv2.getBuildInformation())" | grep -i cuda
Should show CUDA-related build flags.

Option 3: Docker with CUDA

Use NVIDIA’s official CUDA container:
Dockerfile
FROM nvidia/cuda:12.4.0-cudnn-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3.12 python3-pip \
    libgl1-mesa-glx libglib2.0-0

WORKDIR /app
COPY . /app

RUN pip install uv
RUN uv sync

CMD ["uv", "run", "main.py"]
# Build container
docker build -t rtsp-human-capture .

# Run with GPU access
docker run --gpus all rtsp-human-capture \
    --rtsp "rtsp://camera.local/stream" --save image
Requires nvidia-docker2 or NVIDIA Container Toolkit installed on host.

Verifying GPU Acceleration

Check 1: CUDA Device Count

python -c "import cv2; print('CUDA devices:', cv2.cuda.getCudaEnabledDeviceCount())"
Expected output:
CUDA devices: 1
CUDA devices: 1
OpenCV detects your GPU. GPU acceleration will work.

Check 2: Application Output

Run the application and look for the startup message:
python main.py --rtsp "rtsp://camera.local/stream" --save image
With GPU:
Loading person detection model...
CUDA available, using GPU for inference
Model loaded: YOLOv4
Without GPU:
Loading person detection model...
CUDA not available, using CPU for inference
Model loaded: YOLOv4

Check 3: GPU Utilization

Monitor GPU usage during processing:
watch -n 1 nvidia-smi
Expected output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01   Driver Version: 535.183.01   CUDA Version: 12.4   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 30%   45C    P2    85W / 170W |   1024MiB /  12288MB |     35%      Default |
+-------------------------------+----------------------+----------------------+
Key metrics:
  • GPU-Util: Should be 20-50% during inference
  • Memory-Usage: ~300-500 MB for YOLOv4
  • Power: Should increase when processing
If GPU-Util stays at 0%, GPU is not being used despite CUDA being available.

Check 4: Performance Benchmark

Compare inference times:
Disable GPU temporarily:
# Modify person_detector.py line 64-65:
cuda_available = False  # Force CPU
Run and observe frame processing times in console output.Expected: Detection messages every 1-3 seconds (with frame_skip=15)

Performance Comparison

Single Stream

ConfigurationDetection LatencyMax Streams
CPU (YOLOv4)~100-300ms1-2
GPU (YOLOv4)~10-30ms8-16
CPU (YOLOv3)~80-250ms1-3
GPU (YOLOv3)~8-25ms10-20
CPU (HOG)~50-150ms2-4
Times measured on:
  • CPU: Intel i7-9700K @ 3.6GHz
  • GPU: NVIDIA RTX 3060 12GB
  • Resolution: 1920×1080

Multi-Stream Scalability

With frame_skip=15 (2 fps detection rate):
StreamsCPU LoadGPU LoadRecommended Hardware
1-240-80%10-20%Any
3-480-100%20-35%CPU: Mid-range, GPU: Any
5-8>100% (bottleneck)35-60%GPU: Mid-range (GTX 1660+)
9-16N/A60-85%GPU: High-end (RTX 3060+)
CPU bottleneck: With >3 streams on CPU, frame_skip must be increased to 30+ for usable performance.GPU bottleneck: With >12 streams on mid-range GPU, consider lowering frame_skip or using multiple instances.

Optimizing GPU Performance

1. Adjust Batch Size

YOLO processes one frame at a time. For multi-stream, this is actually optimal since:
  • Threads queue up at the inference lock
  • GPU processes frames sequentially
  • No benefit to batching in this architecture

2. Lower frame_skip for GPU

With GPU, you can afford more frequent detection:
# CPU: Process every 30th frame
python main.py --rtsp-file streams.txt --save video --frame-skip 30

# GPU: Process every 10th frame
python main.py --rtsp-file streams.txt --save video --frame-skip 10
Result:
  • 3x more frequent detection
  • Still faster than CPU at frame_skip=30
  • Better responsiveness

3. Monitor GPU Memory

Each model loaded into GPU memory:
ModelGPU Memory
YOLOv4~250 MB
YOLOv3~248 MB
Plus per-frame buffers:
  • 1920×1080: ~8 MB per frame
  • Intermediate layers: ~50-100 MB
Total: ~350-450 MB for single instance
Running multiple instances? Ensure total GPU memory usage < 80% of available VRAM:
  • RTX 3060 (12GB): Can run 20+ instances
  • GTX 1660 (6GB): Can run 10+ instances

4. Use Appropriate CUDA Arch

When building OpenCV from source, match CUDA_ARCH_BIN to your GPU:
cmake -D CUDA_ARCH_BIN=8.6 ...  # RTX 3060
Mismatch causes performance loss (10-30% slower).

Troubleshooting

Issue: OpenCV doesn’t detect GPUDiagnosis:
  1. Check NVIDIA driver:
    nvidia-smi
    
    Should show GPU info. If not, driver not installed.
  2. Check CUDA toolkit:
    nvcc --version
    
    Should show CUDA version. If not, toolkit not installed.
  3. Check OpenCV build:
    python -c "import cv2; print(cv2.getBuildInformation())" | grep -i cuda
    
    Should show CUDA-related flags. If not, OpenCV not built with CUDA.
Solution:
  • Install NVIDIA driver
  • Install CUDA toolkit
  • Install/build CUDA-enabled OpenCV
Issue:
CUDA not available, using CPU for inference
Even though cv2.cuda.getCudaEnabledDeviceCount() returns > 0.Cause: OpenCV DNN module built without CUDA support (need OPENCV_DNN_CUDA=ON).Verify:
python -c "import cv2; print(cv2.getBuildInformation())" | grep -i "dnn.*cuda"
Should show:
OPENCV_DNN_CUDA:                 YES
Solution: Use pre-built wheels from opencv-python-cuda-wheels (they have DNN CUDA enabled).
Error:
CUDA error: out of memory
Causes:
  • GPU doesn’t have enough VRAM
  • Multiple applications using GPU
  • Memory leak
Solutions:
  1. Check available memory:
    nvidia-smi
    
  2. Close other GPU applications (Chrome, games, etc.)
  3. Use smaller model (YOLOv3-tiny instead of YOLOv4)
  4. Process fewer streams
Issue: GPU-Util in nvidia-smi shows less than 10%Causes:
  • Not enough streams (GPU waiting for CPU)
  • frame_skip too high
  • Display rendering is bottleneck
Solutions:
  1. Lower frame_skip:
    --frame-skip 5  # More frequent detection
    
  2. Add more streams: GPU can handle 8-16 streams efficiently
  3. Disable display:
    # Remove --display flag
    
Issue: GPU not providing expected speedupCheck:
  1. GPU actually being used:
    watch -n 1 nvidia-smi
    
    GPU-Util should be >0% and spike during detection.
  2. Power mode:
    nvidia-smi -q -d PERFORMANCE
    
    Should show “P2” or “P0” (performance mode), not “P8” (idle).
  3. Thermal throttling: Check temperature in nvidia-smi. If >80°C, may be throttling.
  4. CUDA architecture mismatch: Rebuild OpenCV with correct CUDA_ARCH_BIN for your GPU.

GPU Selection (Multi-GPU Systems)

If you have multiple GPUs, OpenCV uses GPU 0 by default. To select a different GPU:
# Use GPU 1
export CUDA_VISIBLE_DEVICES=1
python main.py --rtsp "rtsp://..." --save image

# Use GPUs 0 and 2 (for multiple instances)
export CUDA_VISIBLE_DEVICES=0,2
For multiple instances across GPUs:
# Terminal 1: Use GPU 0
CUDA_VISIBLE_DEVICES=0 python main.py --rtsp-file cameras_1-8.txt --save video

# Terminal 2: Use GPU 1
CUDA_VISIBLE_DEVICES=1 python main.py --rtsp-file cameras_9-16.txt --save video

Best Practices

Use Pre-built Wheels

Easier and more reliable than building from source. Get from opencv-python-cuda-wheels.

Monitor GPU Usage

Keep nvidia-smi running in a separate terminal to watch GPU utilization.

Update Drivers Regularly

Newer NVIDIA drivers often include performance improvements.

Match CUDA Versions

Ensure OpenCV CUDA version matches installed CUDA toolkit version.

Test Before Deploying

Verify GPU acceleration works with test streams before production deployment.

Plan for Scaling

GPU allows 3-5x more streams than CPU. Plan hardware accordingly.

Budget Setup ($300-500)

GPU: NVIDIA GTX 1660 Super (6GB)
  • 4-8 streams at 1080p
  • 2 fps detection rate
  • YOLOv4

Mid-Range Setup ($500-800)

GPU: NVIDIA RTX 3060 (12GB)
  • 8-12 streams at 1080p
  • 2-3 fps detection rate
  • YOLOv4
  • Room for growth

High-End Setup ($1000+)

GPU: NVIDIA RTX 4070 (12GB) or RTX 3080 (10GB)
  • 12-16 streams at 1080p
  • 3-5 fps detection rate
  • YOLOv4
  • Multiple instances possible

Enterprise/Data Center

GPU: NVIDIA A4000/A5000 or Tesla T4
  • 16-24 streams at 1080p
  • 5+ fps detection rate
  • ECC memory
  • 24/7 reliability
All recommendations assume:
  • 1920×1080 resolution streams
  • YOLOv4 model
  • frame_skip tuned appropriately

Docker GPU Setup

For containerized deployments with GPU:

Install NVIDIA Container Toolkit

# Add repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# Install
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Test GPU Access

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
Should show your GPU info.

Run Application with GPU

docker run --gpus all \
  -v $(pwd)/output:/app/output \
  -v $(pwd)/model:/app/model \
  rtsp-human-capture \
  --rtsp "rtsp://camera.local/stream" --save image

Next Steps