Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.1k views
in Technique[技术] by (71.8m points)

visual c++ - CUDA: Nsight VS2010 profile __device__ function

I would like to know how to profile a __device__ function which is inside a __global__ function with Nsight 2.2 on visual studio 2010. I need to know which function is consuming a lot of resources and time. I have CUDA 5.0 on CC 2.0.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Nsight Visual Studio Edition 3.0 CUDA Profiler introduces source correlated experiments. The Profile CUDA Activity supports the following source level experiments:

  • Instruction Count - Collects instructions executed, thread instructions executed, active thread histogram, predicated thread histogram for every user instruction in the kernel. Information on syscalls (printf) is not collected.

  • Divergent Branch - Collects branch taken, branch not taken, and divergence count for flow control instructions.

  • Memory Transactions - Collects transaction counts, ideal transaction counter, and requested bytes for global, local, and shared memory instructions.

This information is collected per SASS instruction. If the kernel is compiled with -lineinfo (--generate-line-info) the information can be rolled up to PTX and high level source code. Since this data is rolled up from SASS some statistics may not be intuitive to the high level source. For example a branch statistic may show 100% not taken when you expected 100% taken. If you look at the SASS code you may see that the compiler reversed the conditional.

Please also not that on optimized builds the compiler is sometimes unable to maintain line table information.

enter image description here

At this time hardware performance counters and timing is only available at the kernel level.

Device code timing can be done using clock() and clock64() as mentioned in comments. This is a very advanced technique which requires both ability to understand SASS and interpret results with respect to the SM warp schedulers.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...