Programming Comments - CUDA and OpenCV

Summary

OpenCV is complex enough by itself to build, but to make it more complicated it can also be built with support for CUDA. The potential is there to get increased performance when backed with a good CUDA supported GPU, but as we'll discover, it isn't as simple as most people think. Source code has to be changed for OpenCV to take advantage of CUDA, it doesn't happen automatically when your application links against CUDA-enabled OpenCV.

Using CUDA with OpenCV

Most systems don't have the luxury of having OpenCV pre-built with support for CUDA. One big exception is the NVIDIA Jetson devices, such as the Jetson Nano and Jetson NX, both of which are a popular choice as IoT devices.

On your Jetson device, run jtop and look at the "INFO" tab to confirm that OpenCV is built with support for CUDA. It should look similar to this:

* CUDA:      10.2.89
* OpenCV:    4.4.0 (compiled CUDA: YES)
* TensorRT:  7.1.3.0

However, having CUDA support does not mean OpenCV automatically takes advantage of it. You need to do things differently in your OpenCV applications to gain access to the CUDA-accelerated functionality.

Primarily, the cv::Mat objects need to be replaced with cv::cuda::GpuMat objects. You'll still use cv::Mat, but in addition you'll also have some cv::cuda::GpuMat when you need to work with the GPU.

For example, you cannot load an image directly into a GPU mat. Instead, you put the image into a cv::Mat, and then you upload it to the GPU memory using cv::cuda::GpuMat::upload(). Your code would look like this:

cv::Mat cpu_mat = cv::imread("test.png");
cv::cuda::GpuMat gpu_mat;
gpu_mat.upload(cpu_mat);

Note how this is dst.upload(src) and not the other way around.

This upload into the GPU memory space takes some times. With a 1400x690 BGR image on my Jetson NX, this takes ~168 milliseconds, or the duration of five consecutive frames when working with 30 FPS video! As you'll see later in this post, the length of time it takes will become very important.

Resizing images

As an example, let's consider what we need to do to resize an image. On the CPU, you'd normally do something similar to this:

cv::Mat resized;
cv::resize(cpu_mat, resized, {700, 345});

The equivalent when you want to use CUDA and the GPU would be:

cv::cuda::GpuMat resized;
cv::cuda::resize(gpu_mat, resized, {700, 345});

The CUDA-accelerated resize should be much faster than the usual CPU-only call to cv::resize(). So while all of this seems simple enough, when you measure the length of time spent in the two previous calls to resize an image, you'll find something similar to this:

CPU resize: 3.565 milliseconds
GPU resize: 412.891 milliseconds

Unfortunately, that is not a mistake. The GPU version appears to be > 100+ times slower than the CPU version. There are two things happening here:

There is an initial cost to using cv::cuda::GpuMat::upload() which we must not ignore.
There is a non-trivial cost to allocating GPU memory. So every time you use one of the GPU mat objects where it needs to allocate or re-allocate memory, there is a significant cost in time.

How to make it work

If you are resizing an image one time -- or multiple images each of which is resized once -- and doing nothing else with the cv::Mat, it turns out to be faster to do it on the CPU. But if you can spread the cost of the GPU Mat and upload across multiple image operations, then it can be faster to use the GPU over CPU.

For example, if you have a sequence of operations you perform on the image: resize, perform operations across the mat, convert to float, normalize, find the minimum/maximum/mean value, etc..., then at some point you'll reach a threshold where it becomes faster to execute those operations on the GPU vs the CPU. Instrument the code to record some measurements and see which is faster.

Not everything done with cv::Mat can necessarily be done with cv::cuda::GpuMat as the two classes are not exchangeable. Some CUDA-specific calls which exist in the cv::cuda namespace:

canny edge detection: cv::cuda::CannyEdgeDetector
Hough circle and lines detection: cv::cuda::HoughCirclesDetector and cv::cuda::HoughLinesDetector
bitwise or, bitwise and: cv::cuda::bitwise_or() and cv::cuda::bitwise_and()
normalize: cv::cuda::normalize()
resize: cv::cuda::resize()
rotate: cv::cuda::rotate()

You'll need header files such as:

#include <opencv2/cudawarping.hpp>
#include <opencv2/cudaarithm.hpp>

...which should be in /usr/include/opencv4/.

Timing results

I wrote a few lines of code to test my Jetson NX.[1] The code resizes the exact same image 10,000 times, then uploads it to the GPU and runs the same resizing 10,000 more times. The results were:

On the CPU: 5768.181 milliseconds
On the GPU: 2827.898 milliseconds

So once the cost of upload and download are factored in, working with a GPU mat can definitely be beneficial.

But linking against a CUDA-enabled version of OpenCV wont magically give you any CUDA-accelerated functionality.