Benchmarking Jacket Applications
From Jacket Wiki
Being able to benchmark the developed applications is essential to all use of Jacket. There are different possible ways to do benchmarking of which profiling (gprofile) is one. However, using tic-toc is also very useful. In some cases the elapsed time is very small, which usually means that there is quite a bit of variation in the results. In these cases it is often an advantage to include a for-end loop to repeat the computations, such there the elapsed time is longer. This tends to produce reproducible results, which is obviously very important to trust the results. The use of tic-toc is described in further detail in the following.
Contents |
Basic Procedure
Methodology
The general procedure involves a few steps. First of all it is necessary to warm up Jacket and CUDA. This is described in further detail here. Then a time mark must be set, and the computations must be done. After that it is essential to force computations on the GPU due to the 'lazy' computation behavior of Jacket. Finally, another time stamp is set to find the elapsed CPU time. The procedure can be outlined as follows:
WARM UP JACKET AND CUDA % This is essential to create reproducible results gsync; % Finalize pending GPU activity, and synchronize CPU and GPU T1 = tic; % Set mark for start of computations for counter=1:#Runs % Repeat computations if one computation is short in time duration PERFORM COMPUTATIONS geval(VARIABLES); % Force computation of result on the GPU end gsync; % Finalize GPU activity and synchronize the CPU and GPU Tcomp = toc(T1)/#Runs; % Time to do one set of computations
In cases where a single sequence of the computations has a very short time duration, it is an advantage to increase reproducibility to use a for-end loop as indicated above to make a number of repeated computations. It is the experience that the overhead is generally very small. This time can of course also be subtracted if desired.
Example 1
This subsection describes a timing example where the functionality is contained in a Jacket/MATLAB function. The master .m file is given by:
%% Step 0: User defined settings Sz = 500; % Matrix size NoRuns = 15; % Number of runs to increase reproducibility %% Step 1: Warm-up the GPU by calling myfunction1 with a small array size [ dummy ] = benchJKT( 5, 2, 1 ); %% Step 2: Do the computations for both CPU and GPU computations % Get time when using the GPU [ T_GPU ] = benchJKT( Sz, NoRuns, 1 ); % Get time when using the CPU [ T_CPU ] = benchJKT( Sz, NoRuns, -1 ); %% Step 3: Display results fprintf('===========================================================\n'); fprintf('CPU run time: %10.3f\n', T_CPU); fprintf('GPU run time: %10.3f\n', T_GPU); fprintf('Speed-up: %10.3f\n', T_CPU/T_GPU); fprintf('===========================================================\n');
As seen above the same function benchJKT is used for both the CPU and the GPU computations - the only difference is a GPU switch, which tells the function to either use the CPU or the GPU. Having the functionality in Jacket/MATLAB functions is an advantage for different reasons - it makes it easier to avoid errors as the basic equations are only coded once, and it makes it easier to warm up the functions. The function benchJKT is described as:
function [ Telapsed ] = benchJKT( Sz, NoRuns, GPU ) %benchJKT Example of function to benchmark CPU and GPU runs. % Step 1: Preallocate variables gsync; Tstart = tic; % Start the timer if GPU>0, A = grand(Sz,Sz,'single'); B = grand(Sz,Sz,'single'); C = grand(Sz,Sz,'single'); D = grand(Sz,Sz,'single'); R = grand(Sz,Sz,'single'); else A = rand(Sz,Sz,'single'); B = rand(Sz,Sz,'single'); C = rand(Sz,Sz,'single'); D = rand(Sz,Sz,'single'); R = rand(Sz,Sz,'single'); end geval(A, B, C, D, R); % Step 2: Make computations for counter=1:NoRuns % Repeat the computations R = A.^7 * B.^11; % Processing R = R.^3 ./ C.^2; % Processing R = R + D.^2; % Processing geval(R); % Force Jacket to compute R end gsync; Telapsed = toc(Tstart)/NoRuns; % Stop the timer and compute elapsed time end
Running the above code on the Reference System #3 (with the GeForce 9400M GPU) results in the following:
>> master_benchJKT =========================================================== CPU run time: 0.159 GPU run time: 0.041 Speed-up: 3.852 =========================================================== >> master_benchJKT =========================================================== CPU run time: 0.157 GPU run time: 0.043 Speed-up: 3.629 =========================================================== >> master_benchJKT =========================================================== CPU run time: 0.157 GPU run time: 0.043 Speed-up: 3.657 =========================================================== >>
Advanced Concepts
Methodology
When doing more advanced benchmarks such as the ones presented for the JPI (Jacket Performance Index) on Torben's Corner there are more things to consider.
- Computer: When doing benchmarks the computer should of course be left alone such that no disturbing requests are needed to be handled by the operating system. Be sure to disable any power saving features. Also always run the computer with power connected from the charger (for laptops). Otherwise it is typical to see an immediate drop in performance if power from the charger is disconnected.
- Hyper Threading: Hyper Threading is used as standard by MATLAB on multi core systems. Unless the benchmark should precisely show this then be sure to disable Hyper Threading. At the moment this can be done by issuing the command maxNumCompThreads(1);. In later versions of MATLAB this possibility will be removed and then Hyper Threading must be disabled in MATLAB preferences - this is then effective for the entire MATLAB session.
- Repetitions: When running benchmarks it usually takes way more time for small sweep parameters (e.g. small matrix sizes) than for large sweep parameters (e.g. large matrix sizes). It is a typical approach by some to average over a certain fixed number of repetitions - e.g. 20, 50 or 100. This is however not a good approach in my experience. The critical issue if to avoid timing short events - in this case the results tend not to be reproducible. I have tried several approaches to set the number of repetitions done and has found the following to be the best. Here the number of repetitions are set iteratively such that the timed event always lasts at least for Tmin seconds. Using for example Tmin = 1 ensures in my experience that the results are reproducible.
- Warming up: To reach reproducible results it is necessary to warm up both MATLAB and Jacket - the best way to do this is simply to perform precisely the same operation (with the same number of iterations) as for the event later to be timed. By doing this it is also possible to stop/resume the benchmark with about the same results and when not stopping the benchmark.
So to sum up, the following features would be an advantage:
- It should be possible to set a minimum time that the benchmark will run to increase the chance of producing reproducible results. The benchmark should by itself meet this requirement for minimum benchmarking time.
- It should be possible to start/stop/resume the benchmark without creating artifacts when resuming the benchmark.
- Data should frequently be saved to disk in case of a crash or simply to be able to resume the benchmark later.
- Plots of the results should be shown during execution - it is then possible to terminate the execution if some strange behavior is seen or if some mistake in the script is made.
Source Code
As one example the following shows the function code to do a benchmark as well as the source code for the master file to set up the data. The key file is the function file, which in this example is named bench_BESSELJ.m:
function [] = bench_BESSELJ( Size, Dname, Fname, TitleStr, State ) % bench_BESSELJ Generic benchmark file for matrix based benchmarks % Name of the function file must be bench_FCT where FCT is the specific % name of the function to be benchmarked. For example a valid function name % could be bench_BESSELJ.m or bench_MULTIPLY.m. % % (C) Torben Larsen, Aalborg University, 23-JUL-2010 % E-mail: tl.jacket@es.aau.dk % http://wiki.accelereyes.com/wiki/index.php/Torben%27s_Corner % % Minimum execution time for the individual benchmark point Tmin = 1; % Max. number of repetitions in loop timing estimation. The max. % number that can be handled in loops is MaxAvg = 2147483647. MaxAvg = 1E9; %% INITIALIZE VECTORS ETC. % Vector of available GPU memory Mem_MB = zeros(length(Size),1); % Extract function name from the function m-file name - update Fname S = regexp(mfilename, '_', 'split'); FCT = char(S(2)); Fname = [FCT '_' Fname]; %% LOAD DATA/PLOT OR PERFORM BENCHMARK if strcmp(State,'PLOT') % Load data to define: % Size, Speedup, FCT, T_CPU, T_CPU_tot, T_GPU, T_GPU_tot load([ Dname '/' Fname '.mat']); bench_plot( Size, Speedup, FCT, T_CPU, T_GPU, TitleStr ) return; end %% PREALLOCATE ARRAYS ETC. - IF "RESUME" IS USED THEN LOAD DATA % If State==RESUME, data is loaded from the existing file for the given % benchmark and continued from where it came to. if strcmp(State,'RESUME') load([ Dname '/' Fname '.mat']); ii = length(find(T_CPU>0)); SizeN = Size(ii+1:end); else Speedup = zeros(length(Size),1); T_CPU = zeros(length(Size),1); T_CPU_tot = zeros(length(Size),1); T_GPU = zeros(length(Size),1); T_GPU_tot = zeros(length(Size),1); ii = 0; SizeN = Size; end %% PERFORM ANALYSIS for N=SizeN ii = ii + 1; % Set PRNG to ensure same starting state for reproducibility RandStream.setDefaultStream(RandStream('mt19937ar','seed',1004397)); % Define arrays Ac = randn(N,N,'single') + 1j*randn(N,N,'single'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% % Print current matrix size fprintf('%4.0f / %4.0f', N, max(Size)); % CPU test begin -------------------------------------------------- whilecount = 0; Telap_cpu = -1; while Telap_cpu < Tmin whilecount = whilecount + 1; if Telap_cpu == -1 t1 = tic; Rc = besselj(1/pi,Ac); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% Rc = besselj(1/pi,Ac); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% Telap_cpu = toc(t1)/2; NoRunsCPU = ceil(1.5*Tmin/Telap_cpu); else NoRunsCPU = ceil(1.5*whilecount*NoRunsCPU/Telap_cpu*Tmin); end % Warm-up for no=1:NoRunsCPU Rc = besselj(1/pi,Ac); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% end % Benchmark tstart1 = tic; for no=1:NoRunsCPU Rc = besselj(1/pi,Ac); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% end Telap_cpu = toc(tstart1); end % Determine time for CPU loop alone RPT = min(5E3,ceil(MaxAvg/NoRunsCPU)); tstart = tic; for AvgNo=1:RPT for no=1:NoRunsCPU end end T_CPU_Loop = toc(tstart)/RPT; % Compute CPU times T_CPU(ii) = max((Telap_cpu-T_CPU_Loop)/NoRunsCPU,2.5E-10); T_CPU_tot(ii) = Telap_cpu; fprintf(' | T_CPU: %6.1f,', T_CPU_tot(ii)); % CPU test end -------------------------------------------------- % GPU test begin -------------------------------------------------- Ag = gsingle(Ac); geval(Ag); whilecount = 0; Telap_gpu = -1; while Telap_gpu < Tmin whilecount = whilecount + 1; if Telap_gpu == -1 gsync; t1 = tic; Rg = besselj(1/pi,Ag); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% geval(Rg); Rg = besselj(1/pi,Ag); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% geval(Rg); gsync; Telap_gpu = toc(t1)/2; NoRunsGPU = ceil(1.5*Tmin/Telap_gpu); else NoRunsGPU = ceil(1.5*whilecount*NoRunsGPU/Telap_gpu*Tmin); end % Warm-up gsync; for no=1:NoRunsGPU Rg = besselj(1/pi,Ag); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% geval(Rg); end % Benchmark gsync; tstart1 = tic; for no=1:NoRunsGPU Rg = besselj(1/pi,Ag); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%% geval(Rg); end gsync; Telap_gpu = toc(tstart1); end % Determine time for GPU loop alone RPT = min(5E3,ceil(MaxAvg/NoRunsGPU)); tstart = tic; for AvgNo=1:RPT for no=1:NoRunsGPU end end T_GPU_Loop = toc(tstart)/RPT; % Compute GPU times T_GPU(ii) = max((Telap_gpu-T_GPU_Loop)/NoRunsGPU,2.5E-10); T_GPU_tot(ii) = Telap_gpu; fprintf(' T_GPU: %6.1f', T_GPU_tot(ii)); % Speed-up Speedup(ii) = T_CPU(ii)/T_GPU(ii); fprintf(' | Speed-up: %6.1f', Speedup(ii)); % Memory gpu_info = gpu_entry(13); Mem_MB(ii) = gpu_info.gpu_free/1E6; clear gpu_hook; fprintf(' | Mem free [MB]: %6.1f', Mem_MB(ii)); % GPU test end -------------------------------------------------- % Print *** as a warning for simulation time violation % (should not be possible unless something spookey is going on) if T_CPU_tot(ii)>=Tmin && T_GPU_tot(ii)>=Tmin fprintf('\n'); else fprintf(' ***\n'); end % Save data and plot for every 10 data points if ii/10==floor(ii/10) save([ Dname '/' Fname '.mat'], 'Size', 'Speedup', 'FCT', ... 'T_CPU', 'T_CPU_tot', 'T_GPU', 'T_GPU_tot'); % Plot data bench_plot( Size, Speedup, FCT, T_CPU, T_GPU, TitleStr ); end end %% PLOT RESULTS, SAVE FIGURE AND DATA print( gcf, '-djpeg99', '-r100', [ Dname '/' Fname '.jpg'] ); print( gcf, '-depsc2', '-r2400', [ Dname '/' Fname '.eps'] ); save([ Dname '/' Fname '.mat'], 'Size', 'Speedup', 'FCT', ... 'T_CPU', 'T_CPU_tot', 'T_GPU', 'T_GPU_tot'); end function [] = bench_plot( Size, Speedup, FCT, T_CPU, T_GPU, TitleStr ) % bench_plot Plot data for the benchmark function ii = length(find(T_CPU>0)); % Plot data figure(1); clf(1); subplot(2,1,1); plot(Size(1:ii), Speedup(1:ii), 'g-', ... 'Linewidth',1.5); grid; ylabel('GPU Speed-up [-]'); title([FCT ': ' TitleStr]); subplot(2,1,2); plot(Size(1:ii), 1E3*T_CPU(1:ii), 'r-', ... Size(1:ii), 1E3*T_GPU(1:ii), 'b-', ... 'LineWidth', 1.5); grid; legend('CPU', 'GPU', 'Location', 'NorthWest'); xlabel('Square Matrix Size, N\timesN [-]'); ylabel('Execution time [ms]'); end
The source file for the master file named Asus_GTX260M_BESSELJ.m is:
%% MASTER SCRIPT FOR BESSELJ BENCHMARK % IMPORTANT NOTE: The name of the benchmark file MUST be like: % Name_FCT.m where FCT is the name of the function % to be benchmarked (all capital letters) - e.g. BESSELJ. % A full name could be: AsusG51J_BESSELJ.m or % Asus_G51J_BESSELJ. % % Platform: Asus G51J % CPU: Intel Core i7-720QM 1.6GHz % CPU GFlops: 25.6 (http://www.intel.com/support/processors/sb/cs-023143.htm) % CPU mem.: 4 GB % GPU: NVIDIA GTX260M % GPU GFlops: 462 (http://www.nvidia.com/object/product_geforce_gtx_260m_us.html) % GPU mem.: 1024 MB % Operating sys.: Microsoft Windows 7 x64 % Jacket ver.: 1.4.0 (build 6080) % NVIDIA driver: 257.21 % CUDA Toolkit: 3.1 %% SET UP INPUT DATA % Set number of threads to 1 maxNumCompThreads(1); % Matrix size Size = [2:1:2000]; % Core name of plot and data files Fname = 'Asus_G51J_01'; Dname = './Asus.GTX260M'; % Name of title in plot TitleStr = 'Asus G51J: Core i7-720QM & GTX260M'; % Computation type; % State = 'FULL' for cold start % State = 'RESUME' for continuation of computations % State = 'PLOT' plot existing data State = 'FULL'; % Run benchmark bench_BESSELJ( Size, Dname, Fname, TitleStr, State );
Before running the master file a directory named ./Asus.GTX260M must be created. The files Asus_GTX260M_BESSELJ.m and bench_BESSELJ.m must be placed in the same directory and the directory Asus.GTX260M must be a sub-directory.
The benchmark can be stopped by CTRL-C when running. It can then be resumed by setting State = 'RESUME in the master file Asus_GTX260M_BESSELJ.m and then restart that master file again. The benchmark then proceeds.
Go Home: Torben's Corner