Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
828 views
in Technique[技术] by (71.8m points)

c++ - How to multiply two openCV matrices in a kernel function in CUDA?

I have the following minimal piece of code and would like to know how I can multiply two matrices in my kernel function? I can eg not create a Mat (like in openCV) in the kernel function.

  __global__ void myMatKernel(int N, Mat *b)
  {
       Mat a;   // creates compilation error 1

      // b = a*b;      <---- what I would need


  }

  int main (void)
  {
        Mat a(10, 1, CV_64F);
        a.setTo(Scalar(2.2));
        Mat c(1, 10, CV_64F);
        c.setTo(Scalar(3.35));
        Mat d;

        d = a*c;    // works perfectly fine, but would like to do this operation on the GPU

        Mat *b;
        cudaMallocManaged(&b, sizeof(Mat));
        cudaDeviceSynchronize();
       //assign somehow values to matrix b before passing it to the function

        myMatKernel<<<1,256>>>(1, b) ;   
        cudaFree(b);
  }

compilation error 1: "error: calling a __host__ function("cv::Mat::Mat") from a __global__ funcction("myKernel") is not allowed"

Could someone explain/show how I can solve these issues?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Since OpenCV allocates host memory for cv::Mat, you can't use Mat and related OpenCV APIs in a kernel as you would have used it in a host code. So you have to write your own kernel for your matrix multiplication.

OpenCV provides a class called cv::cuda::GpuMat. OpenCV allocates device memory for them. However, APIs related to GpuMat are meant to be used in host code. For matrix multiplication you have to write your own kernel anyway.

However I sometimes find some APIs for GpuMat convenient, such as allocating device memory using its constructor and copying data between host matrix and device matrix using download() and upload(). Also, Gpumat class keeps your matrix's attributes such as rows, cols, type(), step, etc. in a single structure. This may come in handy for some cases.

Following sample code uses GpuMat.

int main (void)
{
    Mat a{ 10, 1, CV_64FC1 }; // 10x1 matrix
    Mat b{ 1, 10, CV_64FC1 }; // 1x10 matrix
    Mat c{ 10, 10, CV_64FC1 }; // multiplying a and b results in 10x10 matrix
    a.setTo(Scalar(2.2f));
    b.setTo(Scalar(3.35f));

    cv::cuda::GpuMat d_a{ a.rows, a.cols, CV_64FC1 };
    cv::cuda::GpuMat d_b{ b.rows, b.cols, CV_64FC1 };
    cv::cuda::GpuMat d_c{ c.rows, c.cols, CV_64FC1 };

    d_a.upload(a);
    d_b.upload(b);

    MatMul<<<1, dim3(c.cols, c.rows)>>>((double*)d_a.data, d_a.step,
                                        (double*)d_b.data, d_b.step,
                                        (double*)d_c.data, d_c.step,
                                        a.cols);

    d_c.download(c);
}

__global__ void MatMul(const double* const a, const int a_step,
                       const double* const b, const int b_step,
                       double* const c, const int c_step,
                       const int a_cols)
{
    int c_row = threadIdx.y;
    int c_col = threadIdx.x;

    double sum = 0;
    for (int i = 0; i < a_cols; i++)
        sum += ((double*)((unsigned char*)a + c_row * a_step))[i]
             * ((double*)((unsigned char*)b + i * b_step))[c_col];

    ((double*)((unsigned char*)c + c_row * c_step))[c_col] = sum;
    
}

Note that if the number of elements of c, the result matrix, exceeds the maximum number of threads in a block(1024 for cc >= 2.0), this code won't work. Kernel should be designed differently.


EDIT

((double*)((unsigned char*)c + c_row * c_step))[c_col];

The above statement access c_row-th row and c_col-th column element of matrix c. This matrix is a single channel matrix and the element type is double. It's step is given by c_step. In OpenCV, step refers to the number of bytes allocated per row. It is larger than or equal to the total size of actual pixels in each row to meet memory alignment, which in turn makes memory access faster.

The above statement first casts c (which is of type double*) to unsigned char*, since c_step is counted in bytes. Adding c_row * c_step to (unsigned char*)c gives pointer to the 0-th column of c_row-th row. It now casts the pointer to double* to access c_col-th column with standard array access operator [].


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...