You are hereDemystifying GPU Microarchitecture through Microbenchmarking

Demystifying GPU Microarchitecture through Microbenchmarking



  1. make compiles all the tests. Outputs are in ./bin/linux/release/.
  2. Run each test as needed. Tests take no command-line parameters or input files.
  3. Don't be surprised if something breaks.

Compilation Process

The compile process is based on the process used to compile the CUDA SDK projects on Linux, using

  • Hint: To compile main, use make program, not make main. The other binaries are compiled using e.g., make icache1, which is a shortcut for make -f Makefile-icache1 program.
  • Hint: make verbose=1 is sometimes useful.

The source code is split up into multiple executables, each of which is separately compiled. This was done mainly because the CUDA compilation tools would crash when the size of the compiled code got too big. Some of our microbenchmarks are big, especially those measuring the instruction cache where we inflate the code size as part of the test. Compiler warnings about optimizing huge functions and that the compiler may run out of memory or run very slowly are expected for the instruction cache tests. "Cannot tell what pointer points to, assuming global memory space" warnings are expected for

Potential Issues

Because these benchmarks are rather low-level, they are affected by changes in the compiler or hardware. Here are a few issues we've encountered when changing CUDA compiler or driver versions.

Page Size

The translation page size used by the GPU has been found to depend on both the hardware and the CUDA driver version used by the system (Analyse de l'architecture GPU Tesla). Our published experiments observed 4 KB pages, but we have observed 64 KB pages after our CUDA driver was updated to version 190.18. Microbenchmarks that depend on the page size have been modified to scale strides/array sizes with page size, set using const int page_size = 4; (in KB) near the top of the relevant source files. You will likely have to change this value for your system. Our microbenchmarks were designed for 4 KB pages and was briefly tested using 64 KB pages, but weird things may happen when page sizes change, like running out of memory.

Pointer Size

The global memory tests use pointer chasing code where the pointer values are stored in an array. Pointers on GT200 are 32 bits. The global memory test will need to be changed if the pointer size changes, e.g., 64-bit pointers on Fermi.

Compiler Optimizations

Microbenchmarks are a delicate balance between needing the compiler to optimize well enough to minimize overhead, but not so much that the entire microbenchmark is optimized away. Improving compiler optimizations can break some microbenchmarks. There are two recent optimizations that have affected our code.

Consecutive calls to __syncthreads with no intervening arithmetic is optimized away by nvcc. The compiler will also remove consecutive syncthreads calls even if there is a conditional branch (but no arithmetic) separating them. This optimization will break program correctness unless the directive in the CUDA programming guide that syncthreads is only allowed to be called in conditional code if the branch evaluates identically within the entire block is followed. As a result, our syncthreads latency test now uses a hand-tuned cubin to avoid optimizing away consecutive syncthreads calls.

We also noticed that compiler optimization of useless chains of abs() operations has improved. Declaring the variables involved as volatile has so far been sufficient in defeating this optimization.

Compile Flow

Our compile flow is modified to allow patching compiled code with an alternate copy of the cubin. This is accomplished with the obj/release/%.cu_sm_13_o rule in and the build_cubin script. The cubin is usually created by running a compile with -keep and then editing the cubin and renaming it to filename.real_cubin. The current makefile will look for filename.real_cubin during a compile for any files listed in CUFILES_sm_13. This method of patching in pre-compiled cubin code is fragile and will likely break for different versions of nvcc.


Details of our measurements on GTX280 can be found in the paper. Following is a summary of the contents of the source code.

Test Executable Source files
Clock main
Arithmetic pipelines main, instructions.h
Syncthreads latency main, ksync_uint_dep128.real_cubin
Control flow diverge,,,,
Syncthreads sync,
Register file main
Shared memory shared
Global memory global
Texture memory texture4
Global TLBs global
Texture TLBs texture4, texture2,
Constant memory cmem
Instruction caches icache[1-4], icache_kernels*,,,,,,,, icache3_kernel.h,,*

Test Descriptions

Files Description
empty threads Measure time to launch and complete an empty kernel
kclock Latency for reading the clock register
kclock_test2 Measure clock register "skew" between TPCs. Figure 5.
arithmetic, instructions.h, ksync_uint_dep128 Latency/throughput for various operations
histogram Measure throughput vs. number of warps. Figure 6.
dual issue Try to exercise dual issue with MAD and MUL
syncthreads, ksync_uint_dep128 Measure syncthreads latency vs. number of warps needing to sync.
Register file capacity Register file capacity (and banking). Figure 9.
diverge Files Description
execution order Execution order of two-way diverged warp
kdiverge_test Measure execution times for diverged warps for various group sizes and warp counts. Figure 7. Subgroups that branch coherently execute concurrently.
kdiverge2_test[1-5] Various code sequences that deadlock due to divergence. Only kdiverge2_test3 does not. Behaviour could potentially change if compiler code generation changes. Listing 4 is kdiverge2_test1.
kdiverge3_test1 Test showing thread ID does not affect execution order under divergence. Listing 3, Figure 8.
kdiverge3_test2 Test showing size of group that branches coherently does not affect execution order.
kdiverge3_test3 Test showing reconvergence after a loop containing divergence inside.
kdiverge3_test4 Test containing a loop that does not reconverge at the end of the loop due to the use of "break" inside the loop. Weird code generation that may be fixed by a smarter compiler.
kdiverge3_test5 Same as test5, except the number of loop iterations depends on the thread index. This causes the compiler to generate a reconvergence point after the loop.
sync Files Description
ksync_test1 Only some threads in a warp calling __syncthreads() does not cause kernel hang.
ksync_test2 Two half-warps cannot sync using syncthreads and transfer values in a shared variable. Listing 5.
ksync_test3 One of two warps calling syncthreads does not cause kernel hang, but will wait until other warp terminates.
ksync_test4 Testing for timeout mechanism for syncthreads between warps. None detected. Listing 6.
ksync2_test4 Two half-warps synchronizing with a warp calling syncthreads twice. Listing 7.
shared Files Description
shared_latency Shared memory latency for varying stride.
global Files Description
global1 Global memory latency for small array.
global4 Global L2 TLB line/page size. Figure 15.
global5 Global memory TLBs. Figure 13.
global6 Global L1 TLB associativity. Figure 14.
texture2 Files Description
texture2 Texture TLB line/page size.
texture4 Files Description
texture4 Stride accesses into linear texture memory at various strides and array sizes. Measures texture caches and TLB properties. Figures 10, 11, 12, and 16.
cmem Files Description
cmem All the constant memory tests. Constant memory latency and caching, constant memory sharing, constant memory sharing with instruction cache, constant cache bandwidth. Figures 17 to 21.
icache1 Files Description
icache1, icache_kernels.h, icache_kernels[1-4].cu Instruction cache latency vs. code size in 0.5 KB and 2 KB steps. Also tests instruction cache sharing with two concurrent thread blocks. Figure 22.
icache2 Files Description
L2,, Measure instruction cache L2 (8 KB) parameters. Similar to icache1, but finer granularity.
Instruction buffer, Measure size of an instruction fetch. Figure 24.
icache3 Files Description
L3,,, icache3_kernel.h Measure instruction cache L3 (32 KB) parameters. Similar to icache2, but larger footprint and coarser granularity.
icache4 Files Description
L1, Measure L1 cache parameters by creating contention for the L2 from threads running on other TPCs. Figure 23.