I want a measure of how much of the peak memory bandwidth my kernel archives.
Say I have a NVIDIA Tesla C1060, which has a max Bandwidth of 102.4 GB/s. In my kernel I have the following accesses to global memory:
...
for(int k=0;k>4000;k++){
float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
....
}
out_data[index]=result;
out_data2[index]=sqrt(result);
...
I count for each thread 4000*2+2 accesses to global memory. Having 1.000.000 threads and all accesses are float I have ~32GB of global memory accesses (inbound and outbound added). As my kernel only takes 0.1s I would archive ~320GB/s which is higher than the max bandwidth, thus there is an error in my calculations / assumptions. I assume, CUDA does some caching, so not all memory accesses count. Now my questions:
- What is my error?
- What accesses to global memory are cached and which are not?
- Is it correct that I don't count access to registers, local, shared and constant memory?
- Can I use the CUDA profiler for easier and more accurate results? Which counters would I need to use? How would I need to interpret them?
Profiler output:
method gputime cputime occupancy instruction warp_serial memtransfer
memcpyHtoD 10.944 17 16384
fill 64.32 93 1 14556 0
fill 64.224 83 1 14556 0
memcpyHtoD 10.656 11 16384
fill 64.064 82 1 14556 0
memcpyHtoD 1172.96 1309 4194304
memcpyHtoD 10.688 12 16384
cu_more_regT 93223.906 93241 1 40716656 0
memcpyDtoH 1276.672 1974 4194304
memcpyDtoH 1291.072 2019 4194304
memcpyDtoH 1278.72 2003 4194304
memcpyDtoH 1840 3172 4194304
New question:
- When 4194304Bytes = 4Bytes * 1024*1024 data points = 4MB and gpu_time
~= 0.1 s then I achieve a bandwidth of 10*40MB/s = 400MB/s. That seems very low. Where is the error?
p.s. Tell me if you need other counters for your answer.
sister question: How to calculate Gflops of a kernel
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…