Jacket Floating Point Performance (GFlops)
From Jacket Wiki
For most of what we do in Jacket it boils down to making floating point operations - and lots of them. This page shows two ways of measuring the floating point performance of the GPU based on matrix computations. The first is based on matrix multiplications and the other on matrix element computations.
Abstract
Just to provide some recent results see the following two figures for floating point performance, which are quite interesting. The first is for single precision GFLOPS and the second one for double precision GFLOPS. It is interesting to notice that for double precision data, the FX-3800 and the C1060 both are close to theoretical maximum data. It is also interesting to note that the C2050 clearly outperforms the C1060 - in single precision also by way more than the data announced by NVIDIA.
Matrix Multiplications
In the following, the first attempt to measure the floating point operation performance is based on matrix multiplications. First, the theory is briefly described where focus is on finding an expression for the number of multiplications and additions when multiplying any number of square matrices. Next the source code to perform the benchmark is shown. And finally some practical results are shown.
Theory
When performing matrix multiplications of K square matrices
the result is:
The number of floating operations to perform this matrix multiplication are the following:
- Number of multiplications:
.
- Number of additions:
In total the number of floating point operations is:
It is essential when measuring the floating point performance to have as many arithmetic computations as possible for the number of memory access (read/write). One of the better ways is the one shown above where
. When doing for example element wise operations it is so that
, which means that the measured floating point performance is significantly lower. Simply because it is limited by memory access.
Source Code
The source code is made in two parts: 1) a function to do the actual benchmark, and 2) a master file which sets up names of figures and produced data and the directory where to put the data. A number of functions are made where these are different in the number of matrices being multiplied. This could of course have been done in one file but here it has been chosen to have a function for each number of matrices to be multiplied, K.
An example of source code to do this analysis for K = 2is shown immediately below. A few things should be noted in relation to this code:
- There is a sweep of matrix size from
to
.
- MATLAB is used to generate the matrices. Translation to Jacket is done by gsingle();. I have seen cases where the specific data values may affect the speed of computation - strange but true. Therefore, the translation performed with gsingle(); is the safe way to perform the test.
- Averaging is used such that smaller matrices are repeated more often than large matrices. The reason is that it should take significant time when using tic-toc to get repeatable results. The code has been designed such that a while-end construct ensures that the total used time for the benchmark is at least Tmin - the value of Tmin has been chosen to 0.5 second, which has provided good and repeatable results.
- Both the CPU (meaning standard MATLAB) and GPU (meaning Jacket) runs are warmed up by running the matrix multiplication twice before the actual test.
- Since memory problems were seen with Jacket 1.3RC2 (and later) it was chosen to use the trick "clear gpu_hook" to free up memory.
With these few issues in mind the benchmarking function named flopsMx2.m is quite simply:
function [] = flopsMx2( Size, Dname, Fname, TitleStr, State ) % flopsMx2 GFlops count by use of matrix multiplication %% GFlops benchmark based on multiplication of two matrices. % (C) Torben Larsen, Aalborg University, 24-JUL-2010 % E-mail: tl.jacket@es.aau.dk % http://wiki.accelereyes.com/wiki/index.php/Torben%27s_Corner % % Minimum execution time for the individual benchmark point Tmin = 1; % Max. number of repetitions in loop timing estimation MaxAvg = 1E9; %% INITIALIZE VECTORS Mem_MB = zeros(length(Size),1); Fname = ['flopsMx2_' Fname]; %% PREALLOCATE ARRAYS ETC. - IF "RESUME" IS USED THEN LOAD DATA % If State==RESUME, data is loaded from the existing file for the given % benchmark and continued from where it came to. if strcmp(State,'RESUME') load([ Dname '/' Fname '.mat']); ii = length(find(GFlops_cpu>0)); SizeN = Size(ii+1:end); else GFlops_cpu = zeros(length(Size),1); GFlops_gpu = zeros(length(Size),1); T_CPU = zeros(length(Size),1); T_CPU_tot = zeros(length(Size),1); T_GPU = zeros(length(Size),1); T_GPU_tot = zeros(length(Size),1); ii = 0; SizeN = Size; end %% PERFORM ANALYSIS for N=SizeN ii = ii + 1; % Define matrices Ac = randn(N,N,'single'); Bc = randn(N,N,'single'); % Print matrix size fprintf('%4.0f / %4.0f', N, max(Size)); % CPU test begin -------------------------------------------------- whilecount = 0; Telap_cpu = -1; while Telap_cpu < Tmin whilecount = whilecount + 1; if Telap_cpu == -1 t1 = tic; Rc = Ac*Bc; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% Rc = Ac*Bc; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% Telap_cpu = toc(t1)/2; NoRunsCPU = ceil(1.5*Tmin/Telap_cpu); else NoRunsCPU = ceil(1.5*whilecount*NoRunsCPU/Telap_cpu*Tmin); end % Warm-up for no=1:NoRunsCPU Rc = Ac*Bc; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% end % Benchmark tstart1 = tic; for no=1:NoRunsCPU Rc = Ac*Bc; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% end Telap_cpu = toc(tstart1); end % Determine time for CPU loop alone RPT = min(5E3,ceil(MaxAvg/NoRunsCPU)); tstart = tic; for AvgNo=1:RPT for no=1:NoRunsCPU end end T_CPU_Loop = toc(tstart)/RPT; % Compute CPU times T_CPU(ii) = max((Telap_cpu-T_CPU_Loop)/NoRunsCPU,2.5E-10); T_CPU_tot(ii) = Telap_cpu; fprintf(' | T_CPU: %6.1f,', T_CPU_tot(ii)); GFlops_cpu(ii) = (2*N^3-N^2)/(T_CPU(ii)*1E9); fprintf(' %7.1f [GFlops]', GFlops_cpu(ii)); % CPU test end -------------------------------------------------- % GPU test begin -------------------------------------------------- Ag = gsingle(Ac); Bg = gsingle(Bc); geval(Ag,Bg); whilecount = 0; Telap_gpu = -1; while Telap_gpu < Tmin whilecount = whilecount + 1; if Telap_gpu == -1 gsync; t1 = tic; Rg = Ag*Bg; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% geval(Rg); Rg = Ag*Bg; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% geval(Rg); gsync; Telap_gpu = toc(t1)/2; NoRunsGPU = ceil(1.5*Tmin/Telap_gpu); else NoRunsGPU = ceil(1.5*whilecount*NoRunsGPU/Telap_gpu*Tmin); end % Warm-up gsync; for no=1:NoRunsGPU Rg = Ag*Bg; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% geval(Rg); end % Benchmark gsync; tstart1 = tic; for no=1:NoRunsGPU Rg = Ag*Bg; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% geval(Rg); end gsync; Telap_gpu = toc(tstart1); end % Determine time for GPU loop alone RPT = min(5E3,ceil(MaxAvg/NoRunsGPU)); tstart = tic; for AvgNo=1:RPT for no=1:NoRunsGPU end end T_GPU_Loop = toc(tstart)/RPT; % Compute GPU times T_GPU(ii) = max((Telap_gpu-T_GPU_Loop)/NoRunsGPU,2.5E-10); T_GPU_tot(ii) = Telap_gpu; fprintf(' | T_GPU: %6.1f,', T_GPU_tot(ii)); GFlops_gpu(ii) = (2*N^3-N^2)/(T_GPU(ii)*1E9); fprintf(' %7.1f [GFlops]', GFlops_gpu(ii)); gpu_info = gpu_entry(13); Mem_MB(ii) = gpu_info.gpu_free/1E6; clear gpu_hook; fprintf(' | Mem free [MB]: %6.1f', Mem_MB(ii)); % GPU test end -------------------------------------------------- % Print *** as a warning for simulation time violation % (should not be possible unless something spookey is going on) if T_CPU_tot(ii)>=Tmin && T_GPU_tot(ii)>=Tmin fprintf('\n'); else fprintf(' ***\n'); end % Save data and plot for every 10 data points if ii/10==floor(ii/10) save([ Dname '/' Fname '.mat'], 'Size', ... 'T_CPU', 'T_CPU_tot', 'GFlops_cpu', ... 'T_GPU', 'T_GPU_tot', 'GFlops_gpu'); figure(1); clf(1); plot((2*Size(1:ii).^3-Size(1:ii).^2)/1E9, GFlops_cpu(1:ii), 'r-', ... (2*Size(1:ii).^3-Size(1:ii).^2)/1E9, GFlops_gpu(1:ii), 'g-', ... 'Linewidth',1.5); grid; xlabel('Complexity [GFlop]'); ylabel('Performance [GFlops]'); legend('CPU', 'GPU', 'Location', 'SouthEast'); title(['Mx2: ' TitleStr]); % Save figure print( gcf, '-djpeg99', '-r100', [ Dname '/' Fname '.jpg'] ); print( gcf, '-depsc2', '-r2400', [ Dname '/' Fname '.eps'] ); end end end
and the master file to run the function is for example Asus_flopsMx2.m with the content:
%% MASTER SCRIPT FOR GFLOPS COUNT % IMPORTANT NOTE: The name of the benchmark file MUST be like: % Name_FCT.m where FCT is the name of the function % to be benchmarked (all capital letters) - e.g. BESSELJ. % A full name could be: AsusG51J_BESSELJ.m or % Asus_G51J_BESSELJ. % % Platform: Asus G51J % CPU: Intel Core i7-720QM 1.6GHz % CPU GFlops: 25.6 (http://www.intel.com/support/processors/sb/cs-023143.htm) % CPU mem.: 4 GB % GPU: NVIDIA GTX260M % GPU GFlops: 462 (http://www.nvidia.com/object/product_geforce_gtx_260m_us.html) % GPU mem.: 1024 MB % Operating sys.: Microsoft Windows 7 x64 % Jacket ver.: 1.4.0 (build 6080) % NVIDIA driver: 257.21 % CUDA Toolkit: 3.1 %% SET UP INPUT DATA % Set number of threads to 1 maxNumCompThreads(1); % Matrix size Size = [2:1:3390]; % Core name of plot and data files Fname = 'Asus_G51J_01'; Dname = './Asus.GTX260M'; % Name of title in plot TitleStr = 'Asus G51J: Core i7-720QM (25.6 GFlops) & GTX260M (462 GFlops)'; % Computation type; % State = 'BENCH' for cold start % State = 'RESUME' for continuation of computations State = 'BENCH'; % Perform computation flopsMx2( Size, Dname, Fname, TitleStr, State );
Place the function file flopsMx2.m and the master script Asus_flopsMx2.m in the same directory. Also a directory named ./Asus.GTX260 must be made - this is where the execution data and figures are stored.
The complete list of source code files can be found here:
- K=2: Function file: flopsMx2. Master file: Asus_GTX260M_flopsMx2.m.
- K=5: Function file: flopsMx5. Master file: Asus_GTX260M_flopsMx5.m.
- K=10: Function file: flopsMx10. Master file: Asus_GTX260M_flopsMx10.m.
- K=20: Function file: flopsMx20. Master file: Asus_GTX260M_flopsMx20.m.
- K=40: Function file: flopsMx40. Master file: Asus_GTX260M_flopsMx40.m.
The function files use quite advanced timing techniques. More information on these principles can be found here.
Results For An Asus G51J With A GeForce GTX260M GPU
The results from running the test on an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU is shown in the figure below. The peak floating point performance is 462 GFlops according to NVIDIA.
| Fig. 1a: Key data: Matrix multiply; K=2; GTX260M; Measurement of floating point performance for multiplying two matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21. | Fig. 1b: Key data: Matrix multiply; K=5; GTX260M; Measurement of floating point performance for multiplying five matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21. |
| Fig. 1c: Key data: Matrix multiply; K=10; GTX260M;Measurement of floating point performance for multiplying ten matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21. | Fig. 1d: Key data: Matrix multiply; K=20; GTX260M;Measurement of floating point performance for multiplying twenty matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21. |
The results from Figs. 1-3 agree very well and advanced implementations are likely to be used due to the varying nature of the performance versus matrix size. Also note that the results are almost identical no matter if two, five or ten matrices are multiplied. This makes it likely that it is actually the desired floating point operations that are actually measured. It is to be expected that the administrative overhead to load the data into registers etc. are somewhat different depending on how many matrices that are multiplied. Since the results are virtually the same it is likely that the GFlops count actually holds.
Results For An Apple MacBook Pro With A GeForce GT330M GPU
The results from running the test on an Apple MacBook Pro with an Intel Core i7-620M CPU and an NVIDIA GeForce GT330M GPU is shown in the figure below. The peak floating point performance is 182 GFlops according to NVIDIA information.
| Fig. 2a: Key data: Matrix multiply; K=2; GT330M; Measurement of floating point performance for multiplying two matrices using an Apple MacBook Pro with an Intel Core i7-620M CPU and an NVIDIA GeForce GT330M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Mac driver: 19.5.8f01 | Fig. 2b: [Key data: Matrix multiply; K=5; GT330M]; Measurement of floating point performance for multiplying five matrices using an Apple MacBook Pro with an Intel Core i7-620M CPU and an NVIDIA GeForce GT330M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Mac driver: 19.5.8f01 |
| Fig. 2c: Key data: Matrix multiply; K=10; GT330M; Measurement of floating point performance for multiplying ten matrices using an Apple MacBook Pro with an Intel Core i7-620M CPU and an NVIDIA GeForce GT330M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Mac driver: 19.5.8f01 | ... |
Results For A Colfax CXT2000 With A Tesla C1060 GPU
The results from running the test on a Colfax CXT2000 with an Intel Core i7-975 CPU and an NVIDIA Tesla C1060 GPU is shown in the figure below. The peak floating point performance is 933 GFlops according to [1].
TO APPEAR SOON >>> Peak performance around 350 GFlops.
Results For A Colfax CXT2000 With A Tesla C2050 GPU
The results from running the test on a Colfax CXT2000 with an Intel Core i7-975 CPU and an NVIDIA Tesla C2050 GPU is shown in the figure below. The peak floating point performance is 1030 GFlops according to [2].
Results For A Colfax WS With A GeForce GTX470 GPU
The results from running the test on a Colfax WS with an Intel Core i7-920 CPU and an NVIDIA GeForce GTX470 GPU is shown in the figure below. The single precision peak floating point performance was measured to 571 GFlops, and in double precision the floating point performance was 135 GFlops.
| Fig. 5a: Measurement of single precision floating point performance by use of the SGeMM method using a Colfax WS with an Intel Core i7-920 CPU and an NVIDIA GeForce GTX470 GPU. The benchmark is conducted in single precision. Ubuntu Linux: 9.04, Jacket version: 1.4.1 (build 6737), NVIDIA driver: 256.40, CUDA Toolkit: 3.1. |
Element Wise Matrix Multiplication
When performing element wise matrix multiplications of two square matrices
and
the result is:
The number of floating operations to perform this matrix multiplications are the following:
- Number of multiplications: N2.
- Number of additions: 0
In total the number of floating point operations is:
Ffloats = N2
| Fig. 3a: Key data: Matrix element multiply; K=2; GTX260M; Measurement of floating point performance for element multiplication of two matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21. | Fig. 3b: Key data: Matrix element multiply; K=10; GTX260M;Measurement of floating point performance for element multiplication of ten matrices using an Asus G51J with an Intel Core i7-720QM CPU and an NVIDIA GeForce GTX260M GPU. The benchmark is conducted in single precision. Jacket version: 1.4RC2, NVIDIA Windows 7 driver: 257.21. |
Conclusions
More Information
Go Home: Torben's Corner