Jacket Floating Point Performance (GFlops)

From Jacket Wiki

Jump to: navigation, search

For most of what we do in Jacket it boils down to making floating point operations - and lots of them. This page shows two ways of measuring the floating point performance of the GPU based on matrix computations. The first is based on matrix multiplications and the other on matrix element computations.

Contents






Abstract

Just to provide some recent results see the following two figures for floating point performance, which are quite interesting. The first is for single precision GFLOPS and the second one for double precision GFLOPS. It is interesting to notice that for double precision data, the FX-3800 and the C1060 both are close to theoretical maximum data. It is also interesting to note that the C2050 clearly outperforms the C1060 - in single precision also by way more than the data announced by NVIDIA.

Fig. 0a: Key data: Matrix element multiply; K=2; Measured floating point performance in single precision for a number of different GPUs. The CPU reference is an Intel Core i7-975 Extreme. Jacket version: 1.4.0, NVIDIA Windows 7 driver: 257.21.
Fig. 0b: Key data: Matrix element multiply; K=2; Measured floating point performance in double precision for a number of different GPUs. The CPU reference is an Intel Core i7-975 Extreme. Jacket version: 1.4.0, NVIDIA Windows 7 driver: 257.21.



Matrix Multiplications

In the following, the first attempt to measure the floating point operation performance is based on matrix multiplications. First, the theory is briefly described where focus is on finding an expression for the number of multiplications and additions when multiplying any number of square matrices. Next the source code to perform the benchmark is shown. And finally some practical results are shown.


Theory

When performing matrix multiplications of K square matrices \mathbf{A}_1 \cdot \mathbf{A}_2 \cdots \mathbf{A}_K the result is:



\mathbf{R} = \mathbf{A}_1 \cdot \mathbf{A}_2 \cdots \mathbf{A}_K,\quad \mathbf{A}_1,\mathbf{A}_2, \ldots, \mathbf{A}_K \in R^\mathbf{N\times N}


The number of floating operations to perform this matrix multiplication are the following:


  1. Number of multiplications: (K-1)\cdot N^3.
  2. Number of additions: (K-1)\cdot N^2\cdot (N-1)


In total the number of floating point operations is:



F_{\rm floats} = (K-1) \cdot N^2 \cdot (2\cdot N-1)

It is essential when measuring the floating point performance to have as many arithmetic computations as possible for the number of memory access (read/write). One of the better ways is the one shown above where F_{\rm floats} \propto N^3. When doing for example element wise operations it is so that F_{\rm floats} \propto N^2, which means that the measured floating point performance is significantly lower. Simply because it is limited by memory access.


Source Code

The source code is made in two parts: 1) a function to do the actual benchmark, and 2) a master file which sets up names of figures and produced data and the directory where to put the data. A number of functions are made where these are different in the number of matrices being multiplied. This could of course have been done in one file but here it has been chosen to have a function for each number of matrices to be multiplied, K.


An example of source code to do this analysis for K = 2is shown immediately below. A few things should be noted in relation to this code:


  1. There is a sweep of matrix size from 1 \times 1 to N_{\rm max} \times N_{\rm max}.
  2. MATLAB is used to generate the matrices. Translation to Jacket is done by gsingle();. I have seen cases where the specific data values may affect the speed of computation - strange but true. Therefore, the translation performed with gsingle(); is the safe way to perform the test.
  3. Averaging is used such that smaller matrices are repeated more often than large matrices. The reason is that it should take significant time when using tic-toc to get repeatable results. The code has been designed such that a while-end construct ensures that the total used time for the benchmark is at least Tmin - the value of Tmin has been chosen to 0.5 second, which has provided good and repeatable results.
  4. Both the CPU (meaning standard MATLAB) and GPU (meaning Jacket) runs are warmed up by running the matrix multiplication twice before the actual test.
  5. Since memory problems were seen with Jacket 1.3RC2 (and later) it was chosen to use the trick "clear gpu_hook" to free up memory.


With these few issues in mind the benchmarking function named flopsMx2.m is quite simply:


function [] = flopsMx2( Size, Dname, Fname, TitleStr, State )
% flopsMx2 GFlops count by use of matrix multiplication
 
%% GFlops benchmark based on multiplication of two matrices.
% (C) Torben Larsen, Aalborg University, 24-JUL-2010
%     E-mail: tl.jacket@es.aau.dk
%     http://wiki.accelereyes.com/wiki/index.php/Torben%27s_Corner
%
 
% Minimum execution time for the individual benchmark point
Tmin = 1;
 
% Max. number of repetitions in loop timing estimation
MaxAvg = 1E9;
 
 
%% INITIALIZE VECTORS
Mem_MB = zeros(length(Size),1);
Fname = ['flopsMx2_' Fname];
 
 
%% PREALLOCATE ARRAYS ETC. - IF "RESUME" IS USED THEN LOAD DATA
% If State==RESUME, data is loaded from the existing file for the given
% benchmark and continued from where it came to.
if strcmp(State,'RESUME')
    load([ Dname '/' Fname '.mat']);
    ii = length(find(GFlops_cpu>0));
    SizeN = Size(ii+1:end);
else
    GFlops_cpu = zeros(length(Size),1);
    GFlops_gpu = zeros(length(Size),1);
    T_CPU = zeros(length(Size),1);
    T_CPU_tot = zeros(length(Size),1);
    T_GPU = zeros(length(Size),1);
    T_GPU_tot = zeros(length(Size),1);
    ii = 0;
    SizeN = Size;
end
 
 
%% PERFORM ANALYSIS
for N=SizeN
    ii = ii + 1;
 
    % Define matrices
    Ac = randn(N,N,'single');
    Bc = randn(N,N,'single');
 
    % Print matrix size
    fprintf('%4.0f / %4.0f', N, max(Size));
 
    % CPU test begin --------------------------------------------------
    whilecount = 0;
    Telap_cpu = -1;
    while Telap_cpu < Tmin
        whilecount = whilecount + 1;
        if Telap_cpu == -1
            t1 = tic;
            Rc = Ac*Bc;   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            Rc = Ac*Bc;   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            Telap_cpu = toc(t1)/2;
            NoRunsCPU = ceil(1.5*Tmin/Telap_cpu);
        else
            NoRunsCPU = ceil(1.5*whilecount*NoRunsCPU/Telap_cpu*Tmin);
        end
 
        % Warm-up        
        for no=1:NoRunsCPU
            Rc = Ac*Bc;   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
        end
 
        % Benchmark
        tstart1 = tic;
        for no=1:NoRunsCPU
            Rc = Ac*Bc;   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
        end
        Telap_cpu = toc(tstart1);
    end
 
    % Determine time for CPU loop alone
    RPT = min(5E3,ceil(MaxAvg/NoRunsCPU));
    tstart = tic;
    for AvgNo=1:RPT
        for no=1:NoRunsCPU
        end
    end
    T_CPU_Loop = toc(tstart)/RPT;
 
    % Compute CPU times
    T_CPU(ii) = max((Telap_cpu-T_CPU_Loop)/NoRunsCPU,2.5E-10);
    T_CPU_tot(ii) = Telap_cpu;
    fprintf('  |  T_CPU: %6.1f,', T_CPU_tot(ii));
    GFlops_cpu(ii) = (2*N^3-N^2)/(T_CPU(ii)*1E9);
    fprintf('   %7.1f [GFlops]', GFlops_cpu(ii));
    % CPU test end   --------------------------------------------------
 
    % GPU test begin --------------------------------------------------
    Ag = gsingle(Ac);
    Bg = gsingle(Bc);
    geval(Ag,Bg);
 
    whilecount = 0;
    Telap_gpu = -1;
    while Telap_gpu < Tmin
        whilecount = whilecount + 1;
        if Telap_gpu == -1
            gsync;
            t1 = tic;
            Rg = Ag*Bg;   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            geval(Rg);
            Rg = Ag*Bg;   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            geval(Rg);
            gsync;
            Telap_gpu = toc(t1)/2;
            NoRunsGPU = ceil(1.5*Tmin/Telap_gpu);
        else
            NoRunsGPU = ceil(1.5*whilecount*NoRunsGPU/Telap_gpu*Tmin);
        end
 
        % Warm-up
        gsync;
        for no=1:NoRunsGPU
            Rg = Ag*Bg;   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            geval(Rg);
        end
 
        % Benchmark
        gsync;
        tstart1 = tic;
        for no=1:NoRunsGPU
            Rg = Ag*Bg;   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            geval(Rg);
        end
        gsync;
        Telap_gpu = toc(tstart1);
    end
 
    % Determine time for GPU loop alone
    RPT = min(5E3,ceil(MaxAvg/NoRunsGPU));
    tstart = tic;
    for AvgNo=1:RPT
        for no=1:NoRunsGPU
        end
    end
    T_GPU_Loop = toc(tstart)/RPT;
 
    % Compute GPU times
    T_GPU(ii) = max((Telap_gpu-T_GPU_Loop)/NoRunsGPU,2.5E-10);
    T_GPU_tot(ii) = Telap_gpu;
    fprintf('   |   T_GPU: %6.1f,', T_GPU_tot(ii));
    GFlops_gpu(ii) = (2*N^3-N^2)/(T_GPU(ii)*1E9);
    fprintf('   %7.1f [GFlops]', GFlops_gpu(ii));
    gpu_info = gpu_entry(13);
    Mem_MB(ii) = gpu_info.gpu_free/1E6;
    clear gpu_hook;
    fprintf('   |   Mem free [MB]:  %6.1f', Mem_MB(ii));
    % GPU test end   --------------------------------------------------
 
    % Print *** as a warning for simulation time violation
    % (should not be possible unless something spookey is going on)
    if T_CPU_tot(ii)>=Tmin && T_GPU_tot(ii)>=Tmin
        fprintf('\n');
    else
        fprintf('  ***\n');
    end
 
    % Save data and plot for every 10 data points
    if ii/10==floor(ii/10)
        save([ Dname '/' Fname '.mat'], 'Size', ...
             'T_CPU', 'T_CPU_tot', 'GFlops_cpu', ...
             'T_GPU', 'T_GPU_tot', 'GFlops_gpu');
 
        figure(1); clf(1);
        plot((2*Size(1:ii).^3-Size(1:ii).^2)/1E9, GFlops_cpu(1:ii), 'r-', ...
             (2*Size(1:ii).^3-Size(1:ii).^2)/1E9, GFlops_gpu(1:ii), 'g-', ...
             'Linewidth',1.5);
        grid;
        xlabel('Complexity   [GFlop]');
        ylabel('Performance   [GFlops]');
        legend('CPU', 'GPU', 'Location', 'SouthEast');
        title(['Mx2: ' TitleStr]);
 
        % Save figure
        print( gcf, '-djpeg99', '-r100', [ Dname '/' Fname '.jpg'] );
        print( gcf, '-depsc2', '-r2400', [ Dname '/' Fname '.eps'] );
    end
end
 
end


and the master file to run the function is for example Asus_flopsMx2.m with the content:


%% MASTER SCRIPT FOR GFLOPS COUNT
 
% IMPORTANT NOTE:  The name of the benchmark file MUST be like:
%                  Name_FCT.m where FCT is the name of the function
%                  to be benchmarked (all capital letters) - e.g. BESSELJ.
%                  A full name could be: AsusG51J_BESSELJ.m or
%                  Asus_G51J_BESSELJ.
%
% Platform:        Asus G51J
% CPU:             Intel Core i7-720QM 1.6GHz
% CPU GFlops:      25.6 (http://www.intel.com/support/processors/sb/cs-023143.htm)
% CPU mem.:        4 GB
% GPU:             NVIDIA GTX260M
% GPU GFlops:      462 (http://www.nvidia.com/object/product_geforce_gtx_260m_us.html)
% GPU mem.:        1024 MB
% Operating sys.:  Microsoft Windows 7 x64
% Jacket ver.:     1.4.0 (build 6080)
% NVIDIA driver:   257.21
% CUDA Toolkit:    3.1
 
%% SET UP INPUT DATA
% Set number of threads to 1
maxNumCompThreads(1);
 
% Matrix size
Size = [2:1:3390];
 
% Core name of plot and data files
Fname = 'Asus_G51J_01';
Dname = './Asus.GTX260M';
 
% Name of title in plot
TitleStr = 'Asus G51J: Core i7-720QM (25.6 GFlops) & GTX260M (462 GFlops)';
 
% Computation type;
%   State = 'BENCH' for cold start
%   State = 'RESUME' for continuation of computations
State = 'BENCH';
 
% Perform computation
flopsMx2( Size, Dname, Fname, TitleStr, State );

Place the function file flopsMx2.m and the master script Asus_flopsMx2.m in the same directory. Also a directory named ./Asus.GTX260 must be made - this is where the execution data and figures are stored.

The complete list of source code files can be found here:

The function files use quite advanced timing techniques. More information on these principles can be found here.


Results For An Asus G51J With A GeForce GTX260M GPU

The results from running the test on an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU is shown in the figure below. The peak floating point performance is 462 GFlops according to NVIDIA.


Fig. 1a: Key data: Matrix multiply; K=2; GTX260M; Measurement of floating point performance for multiplying two matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21.
Fig. 1b: Key data: Matrix multiply; K=5; GTX260M; Measurement of floating point performance for multiplying five matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21.
Fig. 1c: Key data: Matrix multiply; K=10; GTX260M;Measurement of floating point performance for multiplying ten matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21.
Fig. 1d: Key data: Matrix multiply; K=20; GTX260M;Measurement of floating point performance for multiplying twenty matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21.


The results from Figs. 1-3 agree very well and advanced implementations are likely to be used due to the varying nature of the performance versus matrix size. Also note that the results are almost identical no matter if two, five or ten matrices are multiplied. This makes it likely that it is actually the desired floating point operations that are actually measured. It is to be expected that the administrative overhead to load the data into registers etc. are somewhat different depending on how many matrices that are multiplied. Since the results are virtually the same it is likely that the GFlops count actually holds.


Results For An Apple MacBook Pro With A GeForce GT330M GPU

The results from running the test on an Apple MacBook Pro with an Intel Core i7-620M CPU and an NVIDIA GeForce GT330M GPU is shown in the figure below. The peak floating point performance is 182 GFlops according to NVIDIA information.


Fig. 2a: Key data: Matrix multiply; K=2; GT330M; Measurement of floating point performance for multiplying two matrices using an Apple MacBook Pro with an Intel Core i7-620M CPU and an NVIDIA GeForce GT330M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Mac driver: 19.5.8f01
Fig. 2b: [Key data: Matrix multiply; K=5; GT330M]; Measurement of floating point performance for multiplying five matrices using an Apple MacBook Pro with an Intel Core i7-620M CPU and an NVIDIA GeForce GT330M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Mac driver: 19.5.8f01
Fig. 2c: Key data: Matrix multiply; K=10; GT330M; Measurement of floating point performance for multiplying ten matrices using an Apple MacBook Pro with an Intel Core i7-620M CPU and an NVIDIA GeForce GT330M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Mac driver: 19.5.8f01
...


Results For A Colfax CXT2000 With A Tesla C1060 GPU

The results from running the test on a Colfax CXT2000 with an Intel Core i7-975 CPU and an NVIDIA Tesla C1060 GPU is shown in the figure below. The peak floating point performance is 933 GFlops according to [1].

TO APPEAR SOON >>> Peak performance around 350 GFlops.



Results For A Colfax CXT2000 With A Tesla C2050 GPU

The results from running the test on a Colfax CXT2000 with an Intel Core i7-975 CPU and an NVIDIA Tesla C2050 GPU is shown in the figure below. The peak floating point performance is 1030 GFlops according to [2].

Fig. 4a: Key data: Matrix multiply; K=2; C2050; Measurement of floating point performance for multiplying two matrices using a Colfax CXT2000 with an Intel Core i7-975 Extreme CPU and an NVIDIA Tesla C2050 GPU. The benchmark is conducted in single precision. Jacket version: 1.4.0 (build 6080), NVIDIA driver: 258.96, NVIDIA Toolkit 3.1, OS: Windows 7 Enterprise x64



Results For A Colfax WS With A GeForce GTX470 GPU

The results from running the test on a Colfax WS with an Intel Core i7-920 CPU and an NVIDIA GeForce GTX470 GPU is shown in the figure below. The single precision peak floating point performance was measured to 571 GFlops, and in double precision the floating point performance was 135 GFlops.


Fig. 5a: Measurement of single precision floating point performance by use of the SGeMM method using a Colfax WS with an Intel Core i7-920 CPU and an NVIDIA GeForce GTX470 GPU. The benchmark is conducted in single precision. Ubuntu Linux: 9.04, Jacket version: 1.4.1 (build 6737), NVIDIA driver: 256.40, CUDA Toolkit: 3.1.
Fig. 5b: Measurement of double precision floating point performance by use of the SGeMM method using a Colfax WS with an Intel Core i7-920 CPU and an NVIDIA GeForce GTX470 GPU. Ubuntu Linux: 9.04, Jacket version: 1.4.1 (build 6737), NVIDIA driver: 256.40, CUDA Toolkit: 3.1.



Element Wise Matrix Multiplication

When performing element wise matrix multiplications of two square matrices \mathbf{A} and \mathbf{B} the result is:


\mathbf{R} = \mathbf{A} .* \mathbf{B},\quad \mathbf{A},\mathbf{B} \in R^\mathbf{N\times N}

The number of floating operations to perform this matrix multiplications are the following:

  1. Number of multiplications: N2.
  2. Number of additions: 0

In total the number of floating point operations is:

Ffloats = N2


Fig. 3a: Key data: Matrix element multiply; K=2; GTX260M; Measurement of floating point performance for element multiplication of two matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21.
Fig. 3b: Key data: Matrix element multiply; K=10; GTX260M;Measurement of floating point performance for element multiplication of ten matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21.


Conclusions

More Information

  1. See Chris McClanahan blog here and Chris' Wiki here.



Go Home: Torben's Corner


Personal tools