MTIMES Benchmarks
From Jacket Wiki
Description
Performance timings for General Matrix Multiply (GEMM) for Jacket.
Jacket's custom GEMM (MTIMES) makes heavy use of registers, shared memory, texture memory, optimized CUDA thread block sizes, and multiple outputs per CUDA thread block.
Jacket-1.3 is based on CUBLAS, while Jacket-1.4 has a custom GEMM implementation
Also check out the blog posts here and here and here on Jacket's SGEMM performance.
NOTE: Our GEMM assumes Alpha=1 and Beta=0 | C = Alpha*A*B+Beta
Jacket 1.3 vs Jacket 1.4 vs Jacket 1.5 (all from MATLAB)
Measuring every NxN matrix from 10×10 to 2500×2500 | Tesla C2050
Jacket-1.4 Final vs CUBLAS-3.1 vs MAGMA-0.2 (all from C++)
Measuring every NxN matrix from 2×2 to 2500×2500 | Tesla C2050
(all from the same C++ binary executable)
Jacket 1.3 (CUBLAS-3.1) vs Jacket 1.4 Final (all from MATLAB)
Measuring every NxN matrix from 2×2 to 2500×2500 | Tesla C2050
Formula used for computing GFLOPS
Note:
There seems to be inconsistencies around the web for how to calculate GFLOPS.
According to some, the only true way to do it is to step through the GPU code
and count operatons line by line, figure out how many times the kernel got called,
and then divide by the time. While this approach may be the "true" GFLOPS,
this is very tedious and architecture dependent.
Searching through various academic papers, NVIDIA forum posts and talking to some professors,
I found the 'majority' of people use 2*N^3/sec because it represents the traditional
"naive" CPU version of matrix multiplication:
for (int j = 0; j< p; j++) { for (int i = 0; i< m; i++) { double s = 0; for (int k = 0; k< n; k++) { s += a[i][k] * b[k][j]; } c[i][j] = s; } }
Also, in Computer Science theory, all operations +-*/ are treated equal.
So, analyzing the loop above for a square matrix, there are 2 'ops' done N^3 times.
Using the above formula seemed to be the most widely used formula
for comparison, so I went with it. I am not opposed to others though.
Hardware used
>> ginfo CUDA driver 256.35, CUDA toolkit 3.1 Detected CUDA-capable GPUs: GPU0 Tesla C2050, 1147 MHz, 2688 MB VRAM, Compute 2.0 (single,double)
$ cat /proc/cpuinfo vendor_id : GenuineIntel model name : Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz cpu MHz : 2666.630 cache size : 3072 KB cpu cores : 4 ...
References
- Volkov, V., and Demmel, J. W. Benchmarking GPUs to tune dense linear algebra, SC08.
- SGEMM work by Vasily Volkov
- SGEMM paper by Lung-Sheng Chien
Raw Mat Files (in gflops)
- SGEMM (Benched in MATLAB)
- CUBLAS
- Jacket 1.4
- New! Jacket 1.5
- DGEMM (Benched in MATLAB)
- GEMM (Benched in C++)
DIY
Timeout
If you are on Windows Vista or Windows 7 and get a timeout error, you may try modifying the TDR delay. The TDR delay is the time Windows waits for the GPU to finish whatever it's thinking about. After this time expires, the GPU driver gets reset. The default value is actually 2 seconds for a fresh install of Windows. The Jacket installer asks to increase this to 7. The original purpose of the TDR reset is to keep the desktop responsive, so the only negative side effect of increasing the timeout value is just to wait longer for the GPU control to return to the OS.
Here are some easy registry files to change the timeout value:
- TDR_Fix zip file
- Windows 7 / Vista only
- 7 seconds is the Jacket default
- To Use: Just double-click the tdr_fix_20.reg file (inside the .zip) to increase the timeout delay to 20 seconds.
Links
- Torben's page on Jacket's floating point performance
- Watch the "7-Tricks" video presented at the 2010 GTC on Convolution and Matrix Multiplication
