You are hereDemystifying GPU Microarchitecture through Microbenchmarking
Demystifying GPU Microarchitecture through Microbenchmarking
- Paper: Demystifying GPU Microarchitecture through Microbenchmarking
- Source code: cudabmk.tar.bz2 cudabmk.zip
- Compile time: 30 minutes (Core2 2.83 GHz)
- Run time: 13 minutes
- Disk space: 268 MB
- Requires CUDA compilers on Linux, but not the SDK.
makecompiles all the tests. Outputs are in ./bin/linux/release/.
- Run each test as needed. Tests take no command-line parameters or input files.
- Don't be surprised if something breaks.
The compile process is based on the process used to compile the CUDA SDK projects on Linux, using common.mk.
- Hint: To compile main, use
make program, not
make main. The other binaries are compiled using e.g.,
make icache1, which is a shortcut for
make -f Makefile-icache1 program.
make verbose=1is sometimes useful.
The source code is split up into multiple executables, each of which is separately compiled. This was done mainly because the CUDA compilation tools would crash when the size of the compiled code got too big. Some of our microbenchmarks are big, especially those measuring the instruction cache where we inflate the code size as part of the test. Compiler warnings about optimizing huge functions and that the compiler may run out of memory or run very slowly are expected for the instruction cache tests. "Cannot tell what pointer points to, assuming global memory space" warnings are expected for global.cu.
Because these benchmarks are rather low-level, they are affected by changes in the compiler or hardware. Here are a few issues we've encountered when changing CUDA compiler or driver versions.
The translation page size used by the GPU has been found to depend on both the hardware and the CUDA driver version used by the system (Analyse de l'architecture GPU Tesla). Our published experiments observed 4 KB pages, but we have observed 64 KB pages after our CUDA driver was updated to version 190.18. Microbenchmarks that depend on the page size have been modified to scale strides/array sizes with page size, set using
const int page_size = 4; (in KB) near the top of the relevant source files. You will likely have to change this value for your system. Our microbenchmarks were designed for 4 KB pages and was briefly tested using 64 KB pages, but weird things may happen when page sizes change, like running out of memory.
The global memory tests use pointer chasing code where the pointer values are stored in an array. Pointers on GT200 are 32 bits. The global memory test will need to be changed if the pointer size changes, e.g., 64-bit pointers on Fermi.
Microbenchmarks are a delicate balance between needing the compiler to optimize well enough to minimize overhead, but not so much that the entire microbenchmark is optimized away. Improving compiler optimizations can break some microbenchmarks. There are two recent optimizations that have affected our code.
Consecutive calls to __syncthreads with no intervening arithmetic is optimized away by nvcc. The compiler will also remove consecutive syncthreads calls even if there is a conditional branch (but no arithmetic) separating them. This optimization will break program correctness unless the directive in the CUDA programming guide that syncthreads is only allowed to be called in conditional code if the branch evaluates identically within the entire block is followed. As a result, our syncthreads latency test now uses a hand-tuned cubin to avoid optimizing away consecutive syncthreads calls.
We also noticed that compiler optimization of useless chains of abs() operations has improved. Declaring the variables involved as
volatile has so far been sufficient in defeating this optimization.
Our compile flow is modified to allow patching compiled code with an alternate copy of the cubin. This is accomplished with the
obj/release/%.cu_sm_13_o rule in morerules.mk and the
build_cubin script. The cubin is usually created by running a compile with -keep and then editing the cubin and renaming it to filename.real_cubin. The current makefile will look for filename.real_cubin during a compile for any filename.cu files listed in
CUFILES_sm_13. This method of patching in pre-compiled cubin code is fragile and will likely break for different versions of nvcc.
Details of our measurements on GTX280 can be found in the paper. Following is a summary of the contents of the source code.
|Arithmetic pipelines||main||pipeline.cu, instructions.h|
|Syncthreads latency||main||pipeline.cu, ksync_uint_dep128.real_cubin|
|Control flow||diverge||diverge.cu, diverge2.cu, diverge3.cu, diverge4.cu, path.cu|
|Texture TLBs||texture4, texture2||texture4.cu, texture2.cu|
|Instruction caches||icache[1-4]||icache.cu, icache_kernels*, icache2.cu, icache2_1.cu, icache2_2.cu, icache2_ibuffer.cu, icache3.cu, icache3_1.cu, icache3_2.cu, icache3_kernel.h, icache4.cu, icache4_L1.cu*|
|empty threads||empty.cu||Measure time to launch and complete an empty kernel|
|kclock||clock.cu||Latency for reading the clock register|
|kclock_test2||clock.cu||Measure clock register "skew" between TPCs. Figure 5.|
|arithmetic||pipeline.cu, instructions.h, ksync_uint_dep128||Latency/throughput for various operations|
|histogram||pipeline.cu||Measure throughput vs. number of warps. Figure 6.|
|dual issue||pipeline.cu||Try to exercise dual issue with MAD and MUL|
|syncthreads||pipeline.cu, ksync_uint_dep128||Measure syncthreads latency vs. number of warps needing to sync.|
|Register file capacity||regfile.cu||Register file capacity (and banking). Figure 9.|
|execution order||path.cu||Execution order of two-way diverged warp|
|kdiverge_test||diverge.cu||Measure execution times for diverged warps for various group sizes and warp counts. Figure 7. Subgroups that branch coherently execute concurrently.|
|kdiverge2_test[1-5]||diverge2.cu||Various code sequences that deadlock due to divergence. Only kdiverge2_test3 does not. Behaviour could potentially change if compiler code generation changes. Listing 4 is kdiverge2_test1.|
|kdiverge3_test1||diverge3.cu||Test showing thread ID does not affect execution order under divergence. Listing 3, Figure 8.|
|kdiverge3_test2||diverge3.cu||Test showing size of group that branches coherently does not affect execution order.|
|kdiverge3_test3||diverge3.cu||Test showing reconvergence after a loop containing divergence inside.|
|kdiverge3_test4||diverge3.cu||Test containing a loop that does not reconverge at the end of the loop due to the use of "break" inside the loop. Weird code generation that may be fixed by a smarter compiler.|
|kdiverge3_test5||diverge3.cu||Same as test5, except the number of loop iterations depends on the thread index. This causes the compiler to generate a reconvergence point after the loop.|
|ksync_test1||sync.cu||Only some threads in a warp calling __syncthreads() does not cause kernel hang.|
|ksync_test2||sync.cu||Two half-warps cannot sync using syncthreads and transfer values in a shared variable. Listing 5.|
|ksync_test3||sync.cu||One of two warps calling syncthreads does not cause kernel hang, but will wait until other warp terminates.|
|ksync_test4||sync.cu||Testing for timeout mechanism for syncthreads between warps. None detected. Listing 6.|
|ksync2_test4||sync2.cu||Two half-warps synchronizing with a warp calling syncthreads twice. Listing 7.|
|shared_latency||shared.cu||Shared memory latency for varying stride.|
|global1||global.cu||Global memory latency for small array.|
|global4||global.cu||Global L2 TLB line/page size. Figure 15.|
|global5||global.cu||Global memory TLBs. Figure 13.|
|global6||global.cu||Global L1 TLB associativity. Figure 14.|
|texture2||texture2.cu||Texture TLB line/page size.|
|texture4||texture4.cu||Stride accesses into linear texture memory at various strides and array sizes. Measures texture caches and TLB properties. Figures 10, 11, 12, and 16.|
|cmem||cmem.cu||All the constant memory tests. Constant memory latency and caching, constant memory sharing, constant memory sharing with instruction cache, constant cache bandwidth. Figures 17 to 21.|
|icache1||icache.cu, icache_kernels.h, icache_kernels[1-4].cu||Instruction cache latency vs. code size in 0.5 KB and 2 KB steps. Also tests instruction cache sharing with two concurrent thread blocks. Figure 22.|
|L2||icache2.cu, icache2_1.cu, icache2_2.cu||Measure instruction cache L2 (8 KB) parameters. Similar to icache1, but finer granularity.|
|Instruction buffer||icache2.cu, icache_ibuffer.cu||Measure size of an instruction fetch. Figure 24.|
|L3||icache3.cu, icache3_1.cu, icache3_2.cu, icache3_kernel.h||Measure instruction cache L3 (32 KB) parameters. Similar to icache2, but larger footprint and coarser granularity.|
|L1||icache4.cu, icache4_L1.cu||Measure L1 cache parameters by creating contention for the L2 from threads running on other TPCs. Figure 23.|