GPU Acceleration - RTSP Human Capture

Overview

GPU acceleration using NVIDIA CUDA dramatically improves inference performance:

CPU inference: ~100-300ms per frame (YOLOv4)
GPU inference: ~10-30ms per frame (YOLOv4)
Speedup: 5-10x faster

GPU acceleration is automatic when CUDA is available. No code changes required.

Benefits of GPU Acceleration

Faster Detection

10x speedup means more frequent detection or processing more streams.

Lower frame_skip

Process every 5-10 frames instead of every 30 frames.

More Streams

Handle 12-16 cameras instead of 4-8 with acceptable performance.

Better Responsiveness

Detect persons entering frame within 0.5 seconds instead of 2-3 seconds.

How GPU Detection Works

RTSP Human Capture automatically detects and uses CUDA GPUs:

GPU Detection Logic

From person_detector.py:62-72:

# Try to use NVIDIA GPU via CUDA backend
if self.net is not None:
    cuda_available = cv2.cuda.getCudaEnabledDeviceCount() > 0
    if cuda_available:
        print("CUDA available, using GPU for inference")
        self.net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
        self.net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
    else:
        print("CUDA not available, using CPU for inference")
        self.net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
        self.net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

Check CUDA device count

cuda_available = cv2.cuda.getCudaEnabledDeviceCount() > 0

Returns number of CUDA-capable GPUs detected by OpenCV.

Configure DNN backend

If CUDA available:

self.net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
self.net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

If not:

self.net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
self.net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

Run inference

self.net.setInput(blob)
outputs = self.net.forward(self.output_layers)

Automatically uses GPU if configured, CPU otherwise.

No configuration needed! If you have CUDA installed and CUDA-enabled OpenCV, GPU acceleration is automatic.

Requirements

To enable GPU acceleration, you need:

1. NVIDIA GPU

Compatible GPUs:

NVIDIA GTX 10-series or newer
NVIDIA RTX 20/30/40-series
NVIDIA Tesla/Quadro data center GPUs
Compute Capability 3.5 or higher

Check your GPU:

lspci | grep -i nvidia

Expected output:

01:00.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060]

2. CUDA Toolkit

Supported versions:

CUDA 11.2 or newer
CUDA 12.x recommended

Check CUDA version:

nvcc --version

Expected output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131

Install CUDA Toolkit:

Ubuntu/Debian
Fedora/RHEL
Windows

# Add NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install CUDA
sudo apt-get install cuda-toolkit-12-4

# Add NVIDIA repository
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo

# Install CUDA
sudo dnf install cuda-toolkit-12-4

3. cuDNN (Optional but Recommended)

NVIDIA cuDNN (CUDA Deep Neural Network library) further optimizes performance.Download:

Go to https://developer.nvidia.com/cudnn
Sign up for NVIDIA Developer Program (free)
Download cuDNN for your CUDA version
Extract and copy files:

tar -xvf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz
sudo cp cudnn-linux-x86_64-8.9.7.29_cuda12-archive/include/* /usr/local/cuda/include/
sudo cp cudnn-linux-x86_64-8.9.7.29_cuda12-archive/lib/* /usr/local/cuda/lib64/
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

4. CUDA-enabled OpenCV

This is the critical requirement! Standard OpenCV doesn’t include CUDA support.You need opencv-contrib-python compiled with CUDA.

Installing CUDA-enabled OpenCV

Standard OpenCV from PyPI does NOT include CUDA support. You have three options:

Option 1: Pre-built CUDA Wheels (Recommended)

Use pre-compiled wheels from the opencv-python-cuda-wheels project:

Download appropriate wheel

Visit: https://github.com/cudawarped/opencv-python-cuda-wheels/releases/latestSelect wheel matching:

Your Python version (e.g., cp312 = Python 3.12)
Your platform (e.g., linux_x86_64)
Your CUDA version (e.g., cuda122 = CUDA 12.2)

Example filename:

opencv_contrib_python-4.9.0+cuda122-cp312-cp312-linux_x86_64.whl

Create deps directory

mkdir -p deps
cd deps

Download wheel

wget https://github.com/cudawarped/opencv-python-cuda-wheels/releases/download/latest/opencv_contrib_python-4.9.0+cuda122-cp312-cp312-linux_x86_64.whl

Replace with your specific wheel URL.

Install with uv

cd ..
uv pip install deps/opencv_contrib_python-4.9.0+cuda122-cp312-cp312-linux_x86_64.whl

Verify CUDA support

python -c "import cv2; print('CUDA devices:', cv2.cuda.getCudaEnabledDeviceCount())"

Expected output:

CUDA devices: 1

If you see 0, CUDA is not available.

The pyproject.toml in this repository is configured to look for CUDA wheels in the deps/ directory.

Option 2: Build from Source

Build OpenCV with CUDA support yourself:

This is time-consuming (1-2 hours) and error-prone. Only recommended if pre-built wheels don’t work.

Install build dependencies

sudo apt-get update
sudo apt-get install -y \
    build-essential cmake git pkg-config \
    libjpeg-dev libpng-dev libtiff-dev \
    libavcodec-dev libavformat-dev libswscale-dev \
    libv4l-dev libxvidcore-dev libx264-dev \
    libgtk-3-dev libatlas-base-dev gfortran \
    python3-dev

Clone OpenCV repositories

git clone https://github.com/opencv/opencv.git
git clone https://github.com/opencv/opencv_contrib.git
cd opencv
mkdir build
cd build

Configure with CMake

cmake -D CMAKE_BUILD_TYPE=RELEASE \
    -D CMAKE_INSTALL_PREFIX=/usr/local \
    -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib/modules \
    -D WITH_CUDA=ON \
    -D CUDA_ARCH_BIN=8.6 \
    -D WITH_CUDNN=ON \
    -D OPENCV_DNN_CUDA=ON \
    -D ENABLE_FAST_MATH=ON \
    -D CUDA_FAST_MATH=ON \
    -D WITH_CUBLAS=ON \
    -D BUILD_opencv_python3=ON \
    ..

Replace CUDA_ARCH_BIN with your GPU’s compute capability:

RTX 3060/3070/3080/3090: 8.6
RTX 4060/4070/4080/4090: 8.9
RTX 2060/2070/2080: 7.5
GTX 1060/1070/1080: 6.1

Check your GPU: https://developer.nvidia.com/cuda-gpus

Build (this takes 1-2 hours)

make -j$(nproc)
sudo make install
sudo ldconfig

Verify installation

python3 -c "import cv2; print(cv2.getBuildInformation())" | grep -i cuda

Should show CUDA-related build flags.

Option 3: Docker with CUDA

Use NVIDIA’s official CUDA container:

Dockerfile

FROM nvidia/cuda:12.4.0-cudnn-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3.12 python3-pip \
    libgl1-mesa-glx libglib2.0-0

WORKDIR /app
COPY . /app

RUN pip install uv
RUN uv sync

CMD ["uv", "run", "main.py"]

# Build container
docker build -t rtsp-human-capture .

# Run with GPU access
docker run --gpus all rtsp-human-capture \
    --rtsp "rtsp://camera.local/stream" --save image

Requires nvidia-docker2 or NVIDIA Container Toolkit installed on host.

Verifying GPU Acceleration

Check 1: CUDA Device Count

python -c "import cv2; print('CUDA devices:', cv2.cuda.getCudaEnabledDeviceCount())"

Expected output:

CUDA devices: 1

Success (devices > 0)
Failure (devices = 0)

CUDA devices: 1

OpenCV detects your GPU. GPU acceleration will work.

CUDA devices: 0

Causes:

OpenCV not built with CUDA
CUDA toolkit not installed
NVIDIA driver not installed
GPU not compatible

Check 2: Application Output

Run the application and look for the startup message:

python main.py --rtsp "rtsp://camera.local/stream" --save image

With GPU:

Loading person detection model...
CUDA available, using GPU for inference
Model loaded: YOLOv4

Without GPU:

Loading person detection model...
CUDA not available, using CPU for inference
Model loaded: YOLOv4

Check 3: GPU Utilization

Monitor GPU usage during processing:

watch -n 1 nvidia-smi

Expected output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01   Driver Version: 535.183.01   CUDA Version: 12.4   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 30%   45C    P2    85W / 170W |   1024MiB /  12288MB |     35%      Default |
+-------------------------------+----------------------+----------------------+

Key metrics:

GPU-Util: Should be 20-50% during inference
Memory-Usage: ~300-500 MB for YOLOv4
Power: Should increase when processing

If GPU-Util stays at 0%, GPU is not being used despite CUDA being available.

Check 4: Performance Benchmark

Compare inference times:

CPU Baseline
GPU Accelerated

Disable GPU temporarily:

# Modify person_detector.py line 64-65:
cuda_available = False  # Force CPU

Run and observe frame processing times in console output.Expected: Detection messages every 1-3 seconds (with frame_skip=15)

Restore GPU detection:

cuda_available = cv2.cuda.getCudaEnabledDeviceCount() > 0

Run and observe frame processing times.Expected: Detection messages every 0.5-1 seconds (same frame_skip)Should be noticeably faster.

Performance Comparison

Single Stream

Configuration	Detection Latency	Max Streams
CPU (YOLOv4)	~100-300ms	1-2
GPU (YOLOv4)	~10-30ms	8-16
CPU (YOLOv3)	~80-250ms	1-3
GPU (YOLOv3)	~8-25ms	10-20
CPU (HOG)	~50-150ms	2-4

Times measured on:

CPU: Intel i7-9700K @ 3.6GHz
GPU: NVIDIA RTX 3060 12GB
Resolution: 1920×1080

Multi-Stream Scalability

With frame_skip=15 (2 fps detection rate):

Streams	CPU Load	GPU Load	Recommended Hardware
1-2	40-80%	10-20%	Any
3-4	80-100%	20-35%	CPU: Mid-range, GPU: Any
5-8	>100% (bottleneck)	35-60%	GPU: Mid-range (GTX 1660+)
9-16	N/A	60-85%	GPU: High-end (RTX 3060+)

CPU bottleneck: With >3 streams on CPU, frame_skip must be increased to 30+ for usable performance.GPU bottleneck: With >12 streams on mid-range GPU, consider lowering frame_skip or using multiple instances.

Optimizing GPU Performance

1. Adjust Batch Size

YOLO processes one frame at a time. For multi-stream, this is actually optimal since:

Threads queue up at the inference lock
GPU processes frames sequentially
No benefit to batching in this architecture

2. Lower frame_skip for GPU

With GPU, you can afford more frequent detection:

# CPU: Process every 30th frame
python main.py --rtsp-file streams.txt --save video --frame-skip 30

# GPU: Process every 10th frame
python main.py --rtsp-file streams.txt --save video --frame-skip 10

Result:

3x more frequent detection
Still faster than CPU at frame_skip=30
Better responsiveness

3. Monitor GPU Memory

Each model loaded into GPU memory:

Model	GPU Memory
YOLOv4	~250 MB
YOLOv3	~248 MB

Plus per-frame buffers:

1920×1080: ~8 MB per frame
Intermediate layers: ~50-100 MB

Total: ~350-450 MB for single instance

Running multiple instances? Ensure total GPU memory usage < 80% of available VRAM:

RTX 3060 (12GB): Can run 20+ instances
GTX 1660 (6GB): Can run 10+ instances

4. Use Appropriate CUDA Arch

When building OpenCV from source, match CUDA_ARCH_BIN to your GPU:

cmake -D CUDA_ARCH_BIN=8.6 ...  # RTX 3060

Mismatch causes performance loss (10-30% slower).

Troubleshooting

CUDA devices: 0

Issue: OpenCV doesn’t detect GPUDiagnosis:

Check NVIDIA driver:
```
nvidia-smi
```
Should show GPU info. If not, driver not installed.
Check CUDA toolkit:
```
nvcc --version
```
Should show CUDA version. If not, toolkit not installed.
Check OpenCV build:
```
python -c "import cv2; print(cv2.getBuildInformation())" | grep -i cuda
```
Should show CUDA-related flags. If not, OpenCV not built with CUDA.

Solution:

Install NVIDIA driver
Install CUDA toolkit
Install/build CUDA-enabled OpenCV

Application still says 'using CPU'

Issue:

CUDA not available, using CPU for inference

Even though cv2.cuda.getCudaEnabledDeviceCount() returns > 0.Cause: OpenCV DNN module built without CUDA support (need OPENCV_DNN_CUDA=ON).Verify:

python -c "import cv2; print(cv2.getBuildInformation())" | grep -i "dnn.*cuda"

Should show:

OPENCV_DNN_CUDA:                 YES

Solution: Use pre-built wheels from opencv-python-cuda-wheels (they have DNN CUDA enabled).

Out of memory errors

Error:

CUDA error: out of memory

Causes:

GPU doesn’t have enough VRAM
Multiple applications using GPU
Memory leak

Solutions:

Check available memory:
```
nvidia-smi
```
Close other GPU applications (Chrome, games, etc.)
Use smaller model (YOLOv3-tiny instead of YOLOv4)
Process fewer streams

GPU utilization is low

Issue: GPU-Util in nvidia-smi shows less than 10%Causes:

Not enough streams (GPU waiting for CPU)
frame_skip too high
Display rendering is bottleneck

Solutions:

Lower frame_skip:

--frame-skip 5  # More frequent detection

Add more streams: GPU can handle 8-16 streams efficiently
Disable display:
```
# Remove --display flag
```

Slower than expected

Issue: GPU not providing expected speedupCheck:

GPU actually being used:
```
watch -n 1 nvidia-smi
```
GPU-Util should be >0% and spike during detection.
Power mode:
```
nvidia-smi -q -d PERFORMANCE
```
Should show “P2” or “P0” (performance mode), not “P8” (idle).
Thermal throttling: Check temperature in nvidia-smi. If >80°C, may be throttling.
CUDA architecture mismatch: Rebuild OpenCV with correct CUDA_ARCH_BIN for your GPU.

GPU Selection (Multi-GPU Systems)

If you have multiple GPUs, OpenCV uses GPU 0 by default. To select a different GPU:

# Use GPU 1
export CUDA_VISIBLE_DEVICES=1
python main.py --rtsp "rtsp://..." --save image

# Use GPUs 0 and 2 (for multiple instances)
export CUDA_VISIBLE_DEVICES=0,2

For multiple instances across GPUs:

# Terminal 1: Use GPU 0
CUDA_VISIBLE_DEVICES=0 python main.py --rtsp-file cameras_1-8.txt --save video

# Terminal 2: Use GPU 1
CUDA_VISIBLE_DEVICES=1 python main.py --rtsp-file cameras_9-16.txt --save video

Best Practices

Use Pre-built Wheels

Easier and more reliable than building from source. Get from opencv-python-cuda-wheels.

Monitor GPU Usage

Keep nvidia-smi running in a separate terminal to watch GPU utilization.

Update Drivers Regularly

Newer NVIDIA drivers often include performance improvements.

Match CUDA Versions

Ensure OpenCV CUDA version matches installed CUDA toolkit version.

Test Before Deploying

Verify GPU acceleration works with test streams before production deployment.

Plan for Scaling

GPU allows 3-5x more streams than CPU. Plan hardware accordingly.

Recommended Hardware

Budget Setup ($300-500)

GPU: NVIDIA GTX 1660 Super (6GB)

4-8 streams at 1080p
2 fps detection rate
YOLOv4

Mid-Range Setup ($500-800)

GPU: NVIDIA RTX 3060 (12GB)

8-12 streams at 1080p
2-3 fps detection rate
YOLOv4
Room for growth

High-End Setup ($1000+)

GPU: NVIDIA RTX 4070 (12GB) or RTX 3080 (10GB)

12-16 streams at 1080p
3-5 fps detection rate
YOLOv4
Multiple instances possible

Enterprise/Data Center

GPU: NVIDIA A4000/A5000 or Tesla T4

16-24 streams at 1080p
5+ fps detection rate
ECC memory
24/7 reliability

All recommendations assume:

1920×1080 resolution streams
YOLOv4 model
frame_skip tuned appropriately

Docker GPU Setup

For containerized deployments with GPU:

Install NVIDIA Container Toolkit

# Add repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# Install
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Test GPU Access

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Should show your GPU info.

Run Application with GPU

docker run --gpus all \
  -v $(pwd)/output:/app/output \
  -v $(pwd)/model:/app/model \
  rtsp-human-capture \
  --rtsp "rtsp://camera.local/stream" --save image

Next Steps

Multi-Stream Processing

Leverage GPU to process many streams

Configuration Tuning

Optimize settings for GPU performance

Model Setup

Configure YOLO models

Single Stream

Test GPU with single stream first

​Overview

​Benefits of GPU Acceleration

Faster Detection

Lower frame_skip

More Streams

Better Responsiveness

​How GPU Detection Works

​GPU Detection Logic

​Requirements

​Installing CUDA-enabled OpenCV

​Option 1: Pre-built CUDA Wheels (Recommended)

​Option 2: Build from Source

​Option 3: Docker with CUDA

​Verifying GPU Acceleration

​Check 1: CUDA Device Count

​Check 2: Application Output

​Check 3: GPU Utilization

​Check 4: Performance Benchmark

​Performance Comparison

​Single Stream

​Multi-Stream Scalability

​Optimizing GPU Performance

​1. Adjust Batch Size

​2. Lower frame_skip for GPU

​3. Monitor GPU Memory

​4. Use Appropriate CUDA Arch

​Troubleshooting

​GPU Selection (Multi-GPU Systems)

​Best Practices

Use Pre-built Wheels

Monitor GPU Usage

Update Drivers Regularly

Match CUDA Versions

Test Before Deploying

Plan for Scaling

​Recommended Hardware

​Budget Setup ($300-500)

​Mid-Range Setup ($500-800)

​High-End Setup ($1000+)

​Enterprise/Data Center

​Docker GPU Setup

​Install NVIDIA Container Toolkit

​Test GPU Access

​Run Application with GPU

​Next Steps

Multi-Stream Processing

Configuration Tuning

Model Setup

Single Stream

Overview

Benefits of GPU Acceleration

How GPU Detection Works

GPU Detection Logic

Requirements

Installing CUDA-enabled OpenCV

Option 1: Pre-built CUDA Wheels (Recommended)

Option 2: Build from Source

Option 3: Docker with CUDA

Verifying GPU Acceleration

Check 1: CUDA Device Count

Check 2: Application Output

Check 3: GPU Utilization

Check 4: Performance Benchmark

Performance Comparison

Single Stream

Multi-Stream Scalability

Optimizing GPU Performance

1. Adjust Batch Size

2. Lower frame_skip for GPU

3. Monitor GPU Memory

4. Use Appropriate CUDA Arch

Troubleshooting

GPU Selection (Multi-GPU Systems)

Best Practices

Recommended Hardware

Budget Setup ($300-500)

Mid-Range Setup ($500-800)

High-End Setup ($1000+)

Enterprise/Data Center

Docker GPU Setup

Install NVIDIA Container Toolkit

Test GPU Access

Run Application with GPU

Next Steps