Each of its elements is a block, such that a grid declared as dim3 grid(10, 10, 2) would have 10102 total blocks.
Here is my code to measure time in GPU and CPU. You seem to be a bit confused about the thread hierachy that CUDA has in a nutshell, for a kernel there will be 1 grid, (which I always visualize as a 3-dimensional cube). So, can you suggest if I’m missing sth here. And it increases more times when the size of matrix increased. 2D matrices can be stored in the computer memory using two layouts row-major and column-major. The manner in which matrices are stored affect the performance by a great deal. But before we delve into that, we need to understand how matrices are stored in the memory. I see that the time computed by GPU is larger than by CPU. Let us go ahead and use our knowledge to do matrix-multiplication using CUDA.
In other case, still use 4 threads, assign tile_width = 10, then the result goes wrong.īesides, I also have some related questions, please help to clarify.Ģ) How the structure of grid and block affect the performance of program? How to choose the best structure of grid and block to have highest effect?ģ) I have tried with some data samples (by increase significantly the size of matrix) the compare the time calculate by GPU and CPU. If I use 4 threads (BWIDTH=2), then each thread computes a matrix 8x8 (tile_width = 8). For example, I have matrices A and B are 16x16. Of course, in all cases we can use one thread to compute all elements like CPU does. So, how to handle in cases MWIDTH/(MTILE*BWIDTH) in not an integer? So, I have to define (grid, block) structure to fit with the data. It can also be used in any user code for holding values of 3 dimensions. Its most common application is to pass the grid and block dimensions in a kernel invocation. I have applied the suggested code and see that in this problem, it works correctly in case the size of data area (tile_width) covered by one thread must be equal between threads. dim3 is an integer vector type that can be used in CUDA code. Void cpu_matrixMul(int *a, int *b, int *c, int N) 1 Thread compute multiple elements of matrix product. In this program, I want to calculate multiple elements of a matrix by using If you’re still having trouble, make sure you are using proper cuda error checking (hint: google “proper cuda error checking”) and if you still want help please post a complete code, that someone else could copy, paste and compile, without having to add anything or change anything. This can be used to declare 2-dimensional spaces by passing the two dimensions to the dim3. Since I don’t see how you could have compiled this code you have shown.Īnyway, with the above changes, I was able to compute a sensible result in a piece of test code. In the examples above, we use a one dimensional kernel. Note: dim3 dimension not specified is initialized to 1. “I have compiled but it gets wrong result.” (Number of threads per block should be some multiple of 32). Nevertheless I’m confused by this statement: Probably you meant something like this: c = sum So I don’t see how you could actually be running this code. Int start_col = (blockDim.x*blockIdx.x + threadIdx.x)*tile_width Īlso, this line in your kernel doesn’t make sense: d_p = P_val ĭ_p and P_val aren’t defined anywhere in your kernel. I think they should be like this: int start_row = (blockDim.y*blockIdx.y + threadIdx.y)*tile_width dim3( MaxXBlkDim, MaxYBlkDim, 1 ) Example: dim3 blockShape dim3( 2, 3 ). when you declare with dim3, such as dim3 blockSize(5) then it is one-dimensional. Int start_col = blockDim.x*blockIdx.x + threadIdx.x*tile_width Each thread has a one-dimensional threadID that can be computed using. When I first contacted CUDA, there are still many unclear points. If each thread in your 2D thread array is responsible for a tile_width*tile_width portion of the matrix, then I don’t think these calculations are correct: int start_row = blockDim.y*blockIdx.y + threadIdx.y*tile_width In this section, we will see a sample CUDA C Code.It would help if you show a complete code. This provides a hint to the compiler that this function will be executed on the device and can be called from the host. As an example, while declaring the kernel, we have to use the _global_ keyword. The following keywords are used while declaring a CUDA function. In this chapter, we will discuss the keywords and thread organisation in CUDA.