You are hereDemystifying GPU Microarchitecture through Microbenchmarking

Demystifying GPU Microarchitecture through Microbenchmarking

Download

Paper: Demystifying GPU Microarchitecture through Microbenchmarking
Source code: cudabmk.tar.bz2 cudabmk.zip
- Compile time: 30 minutes (Core2 2.83 GHz)
- Run time: 13 minutes
- Disk space: 268 MB
- Requires CUDA compilers on Linux, but not the SDK.

Compile

make compiles all the tests. Outputs are in ./bin/linux/release/.
Run each test as needed. Tests take no command-line parameters or input files.
Don't be surprised if something breaks.

Compilation Process

The compile process is based on the process used to compile the CUDA SDK projects on Linux, using common.mk.

Hint: To compile main, use make program, not make main. The other binaries are compiled using e.g., make icache1, which is a shortcut for make -f Makefile-icache1 program.
Hint: make verbose=1 is sometimes useful.

The source code is split up into multiple executables, each of which is separately compiled. This was done mainly because the CUDA compilation tools would crash when the size of the compiled code got too big. Some of our microbenchmarks are big, especially those measuring the instruction cache where we inflate the code size as part of the test. Compiler warnings about optimizing huge functions and that the compiler may run out of memory or run very slowly are expected for the instruction cache tests. "Cannot tell what pointer points to, assuming global memory space" warnings are expected for global.cu.

Potential Issues

Because these benchmarks are rather low-level, they are affected by changes in the compiler or hardware. Here are a few issues we've encountered when changing CUDA compiler or driver versions.

Page Size

The translation page size used by the GPU has been found to depend on both the hardware and the CUDA driver version used by the system (Analyse de l'architecture GPU Tesla). Our published experiments observed 4 KB pages, but we have observed 64 KB pages after our CUDA driver was updated to version 190.18. Microbenchmarks that depend on the page size have been modified to scale strides/array sizes with page size, set using const int page_size = 4; (in KB) near the top of the relevant source files. You will likely have to change this value for your system. Our microbenchmarks were designed for 4 KB pages and was briefly tested using 64 KB pages, but weird things may happen when page sizes change, like running out of memory.

Pointer Size

The global memory tests use pointer chasing code where the pointer values are stored in an array. Pointers on GT200 are 32 bits. The global memory test will need to be changed if the pointer size changes, e.g., 64-bit pointers on Fermi.

Compiler Optimizations

Microbenchmarks are a delicate balance between needing the compiler to optimize well enough to minimize overhead, but not so much that the entire microbenchmark is optimized away. Improving compiler optimizations can break some microbenchmarks. There are two recent optimizations that have affected our code.

Consecutive calls to __syncthreads with no intervening arithmetic is optimized away by nvcc. The compiler will also remove consecutive syncthreads calls even if there is a conditional branch (but no arithmetic) separating them. This optimization will break program correctness unless the directive in the CUDA programming guide that syncthreads is only allowed to be called in conditional code if the branch evaluates identically within the entire block is followed. As a result, our syncthreads latency test now uses a hand-tuned cubin to avoid optimizing away consecutive syncthreads calls.

We also noticed that compiler optimization of useless chains of abs() operations has improved. Declaring the variables involved as volatile has so far been sufficient in defeating this optimization.

Compile Flow

Our compile flow is modified to allow patching compiled code with an alternate copy of the cubin. This is accomplished with the obj/release/%.cu_sm_13_o rule in morerules.mk and the build_cubin script. The cubin is usually created by running a compile with -keep and then editing the cubin and renaming it to filename.real_cubin. The current makefile will look for filename.real_cubin during a compile for any filename.cu files listed in CUFILES_sm_13. This method of patching in pre-compiled cubin code is fragile and will likely break for different versions of nvcc.

Tests

Details of our measurements on GTX280 can be found in the paper. Following is a summary of the contents of the source code.

Test	Executable	Source files
Clock	main	clock.cu
Arithmetic pipelines	main	pipeline.cu, instructions.h
Syncthreads latency	main	pipeline.cu, ksync_uint_dep128.real_cubin
Control flow	diverge	diverge.cu, diverge2.cu, diverge3.cu, diverge4.cu, path.cu
Syncthreads	sync	sync.cu, sync2.cu
Register file	main	regfile.cu
Shared memory	shared	shared.cu
Global memory	global	global.cu
Texture memory	texture4	texture4.cu
Global TLBs	global	global.cu
Texture TLBs	texture4, texture2	texture4.cu, texture2.cu
Constant memory	cmem	cmem.cu
Instruction caches	icache[1-4]	icache.cu, icache_kernels, icache2.cu, icache2_1.cu, icache2_2.cu, icache2_ibuffer.cu, icache3.cu, icache3_1.cu, icache3_2.cu, icache3_kernel.h, icache4.cu, icache4_L1.cu

Test Descriptions

main	Files	Description
empty threads	empty.cu	Measure time to launch and complete an empty kernel
kclock	clock.cu	Latency for reading the clock register
kclock_test2	clock.cu	Measure clock register "skew" between TPCs. Figure 5.
arithmetic	pipeline.cu, instructions.h, ksync_uint_dep128	Latency/throughput for various operations
histogram	pipeline.cu	Measure throughput vs. number of warps. Figure 6.
dual issue	pipeline.cu	Try to exercise dual issue with MAD and MUL
syncthreads	pipeline.cu, ksync_uint_dep128	Measure syncthreads latency vs. number of warps needing to sync.
Register file capacity	regfile.cu	Register file capacity (and banking). Figure 9.
diverge	Files	Description
execution order	path.cu	Execution order of two-way diverged warp
kdiverge_test	diverge.cu	Measure execution times for diverged warps for various group sizes and warp counts. Figure 7. Subgroups that branch coherently execute concurrently.
kdiverge2_test[1-5]	diverge2.cu	Various code sequences that deadlock due to divergence. Only kdiverge2_test3 does not. Behaviour could potentially change if compiler code generation changes. Listing 4 is kdiverge2_test1.
kdiverge3_test1	diverge3.cu	Test showing thread ID does not affect execution order under divergence. Listing 3, Figure 8.
kdiverge3_test2	diverge3.cu	Test showing size of group that branches coherently does not affect execution order.
kdiverge3_test3	diverge3.cu	Test showing reconvergence after a loop containing divergence inside.
kdiverge3_test4	diverge3.cu	Test containing a loop that does not reconverge at the end of the loop due to the use of "break" inside the loop. Weird code generation that may be fixed by a smarter compiler.
kdiverge3_test5	diverge3.cu	Same as test5, except the number of loop iterations depends on the thread index. This causes the compiler to generate a reconvergence point after the loop.
sync	Files	Description
ksync_test1	sync.cu	Only some threads in a warp calling __syncthreads() does not cause kernel hang.
ksync_test2	sync.cu	Two half-warps cannot sync using syncthreads and transfer values in a shared variable. Listing 5.
ksync_test3	sync.cu	One of two warps calling syncthreads does not cause kernel hang, but will wait until other warp terminates.
ksync_test4	sync.cu	Testing for timeout mechanism for syncthreads between warps. None detected. Listing 6.
ksync2_test4	sync2.cu	Two half-warps synchronizing with a warp calling syncthreads twice. Listing 7.
shared	Files	Description
shared_latency	shared.cu	Shared memory latency for varying stride.
global	Files	Description
global1	global.cu	Global memory latency for small array.
global4	global.cu	Global L2 TLB line/page size. Figure 15.
global5	global.cu	Global memory TLBs. Figure 13.
global6	global.cu	Global L1 TLB associativity. Figure 14.
texture2	Files	Description
texture2	texture2.cu	Texture TLB line/page size.
texture4	Files	Description
texture4	texture4.cu	Stride accesses into linear texture memory at various strides and array sizes. Measures texture caches and TLB properties. Figures 10, 11, 12, and 16.
cmem	Files	Description
cmem	cmem.cu	All the constant memory tests. Constant memory latency and caching, constant memory sharing, constant memory sharing with instruction cache, constant cache bandwidth. Figures 17 to 21.
icache1	Files	Description
icache1	icache.cu, icache_kernels.h, icache_kernels[1-4].cu	Instruction cache latency vs. code size in 0.5 KB and 2 KB steps. Also tests instruction cache sharing with two concurrent thread blocks. Figure 22.
icache2	Files	Description
L2	icache2.cu, icache2_1.cu, icache2_2.cu	Measure instruction cache L2 (8 KB) parameters. Similar to icache1, but finer granularity.
Instruction buffer	icache2.cu, icache_ibuffer.cu	Measure size of an instruction fetch. Figure 24.
icache3	Files	Description
L3	icache3.cu, icache3_1.cu, icache3_2.cu, icache3_kernel.h	Measure instruction cache L3 (32 KB) parameters. Similar to icache2, but larger footprint and coarser granularity.
icache4	Files	Description
L1	icache4.cu, icache4_L1.cu	Measure L1 cache parameters by creating contention for the L2 from threads running on other TPCs. Figure 23.