MTIMES Benchmarks

From Jacket Wiki

Jump to: navigation, search

Contents

Description

Performance timings for General Matrix Multiply (GEMM) for Jacket.

Jacket's custom GEMM (MTIMES) makes heavy use of registers, shared memory, texture memory, optimized CUDA thread block sizes, and multiple outputs per CUDA thread block.

Jacket-1.3 is based on CUBLAS, while Jacket-1.4 has a custom GEMM implementation

Also check out the blog posts here and here and here on Jacket's SGEMM performance.

NOTE: Our GEMM assumes Alpha=1 and Beta=0 | C = Alpha*A*B+Beta


Jacket 1.3 vs Jacket 1.4 vs Jacket 1.5 (all from MATLAB)

Measuring every NxN matrix from 10×10 to 2500×2500 | Tesla C2050
Mtimes fermi j15 sgemm.jpg

Jacket-1.4 Final vs CUBLAS-3.1 vs MAGMA-0.2 (all from C++)

Measuring every NxN matrix from 2×2 to 2500×2500 | Tesla C2050
(all from the same C++ binary executable)
Jacket14 sgemm 3.jpg

Jacket14 dgemm 3.jpg

Jacket 1.3 (CUBLAS-3.1) vs Jacket 1.4 Final (all from MATLAB)

Measuring every NxN matrix from 2×2 to 2500×2500 | Tesla C2050
Jacket14 sgemm.jpg

Jacket14 dgemm.jpg

Jacket14 cgemm.jpg


Formula used for computing GFLOPS

SquareMatGflopsEqn.gif

Note:

There seems to be inconsistencies around the web for how to calculate GFLOPS.

According to some, the only true way to do it is to step through the GPU code
and count operatons line by line, figure out how many times the kernel got called,
and then divide by the time. While this approach may be the "true" GFLOPS,
this is very tedious and architecture dependent.

Searching through various academic papers, NVIDIA forum posts and talking to some professors,
I found the 'majority' of people use 2*N^3/sec because it represents the traditional
"naive" CPU version of matrix multiplication:

for (int j = 0; j<  p; j++) {
  for (int i = 0; i<  m; i++) {
    double s = 0;
    for (int k = 0; k<  n; k++) {
      s += a[i][k] * b[k][j];
    }
    c[i][j] = s;
  }
}

Also, in Computer Science theory, all operations +-*/ are treated equal.
So, analyzing the loop above for a square matrix, there are 2 'ops' done N^3 times.

Using the above formula seemed to be the most widely used formula
for comparison, so I went with it. I am not opposed to others though.

~ Chris McClanahan


Hardware used

>> ginfo
CUDA driver 256.35, CUDA toolkit 3.1
Detected CUDA-capable GPUs:
GPU0 Tesla C2050, 1147 MHz, 2688 MB VRAM, Compute 2.0 (single,double)
$ cat /proc/cpuinfo
vendor_id	: GenuineIntel
model name	: Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz
cpu MHz         : 2666.630
cache size	: 3072 KB
cpu cores	: 4
...


References



Raw Mat Files (in gflops)



DIY


Timeout

If you are on Windows Vista or Windows 7 and get a timeout error, you may try modifying the TDR delay. The TDR delay is the time Windows waits for the GPU to finish whatever it's thinking about. After this time expires, the GPU driver gets reset. The default value is actually 2 seconds for a fresh install of Windows. The Jacket installer asks to increase this to 7. The original purpose of the TDR reset is to keep the desktop responsive, so the only negative side effect of increasing the timeout value is just to wait longer for the GPU control to return to the OS.

Here are some easy registry files to change the timeout value:

  • TDR_Fix zip file
    • Windows 7 / Vista only
    • 7 seconds is the Jacket default
    • To Use: Just double-click the tdr_fix_20.reg file (inside the .zip) to increase the timeout delay to 20 seconds.



Links

  • Torben's page on Jacket's floating point performance
  • Watch the "7-Tricks" video presented at the 2010 GTC on Convolution and Matrix Multiplication
Personal tools