Benchmarking Jacket Applications

From Jacket Wiki

Jump to: navigation, search

Being able to benchmark the developed applications is essential to all use of Jacket. There are different possible ways to do benchmarking of which profiling (gprofile) is one. However, using tic-toc is also very useful. In some cases the elapsed time is very small, which usually means that there is quite a bit of variation in the results. In these cases it is often an advantage to include a for-end loop to repeat the computations, such there the elapsed time is longer. This tends to produce reproducible results, which is obviously very important to trust the results. The use of tic-toc is described in further detail in the following.


Contents

Basic Procedure

Methodology

The general procedure involves a few steps. First of all it is necessary to warm up Jacket and CUDA. This is described in further detail here. Then a time mark must be set, and the computations must be done. After that it is essential to force computations on the GPU due to the 'lazy' computation behavior of Jacket. Finally, another time stamp is set to find the elapsed CPU time. The procedure can be outlined as follows:

WARM UP JACKET AND CUDA    % This is essential to create reproducible results
gsync;                     % Finalize pending GPU activity, and synchronize CPU and GPU
T1 = tic;                  % Set mark for start of computations
for counter=1:#Runs        % Repeat computations if one computation is short in time duration
    PERFORM COMPUTATIONS
    geval(VARIABLES);      % Force computation of result on the GPU
end
gsync;                     % Finalize GPU activity and synchronize the CPU and GPU
Tcomp = toc(T1)/#Runs;     % Time to do one set of computations

In cases where a single sequence of the computations has a very short time duration, it is an advantage to increase reproducibility to use a for-end loop as indicated above to make a number of repeated computations. It is the experience that the overhead is generally very small. This time can of course also be subtracted if desired.


Example 1

This subsection describes a timing example where the functionality is contained in a Jacket/MATLAB function. The master .m file is given by:

%% Step 0: User defined settings
Sz = 500;      % Matrix size
NoRuns = 15;   % Number of runs to increase reproducibility
 
 
%% Step 1: Warm-up the GPU by calling myfunction1 with a small array size
[ dummy ] = benchJKT( 5, 2, 1 );
 
 
%% Step 2: Do the computations for both CPU and GPU computations
% Get time when using the GPU
[ T_GPU ] = benchJKT( Sz, NoRuns, 1 );
 
% Get time when using the CPU
[ T_CPU ] = benchJKT( Sz, NoRuns, -1 );
 
 
%% Step 3: Display results
fprintf('===========================================================\n');
fprintf('CPU run time:                    %10.3f\n', T_CPU);
fprintf('GPU run time:                    %10.3f\n', T_GPU);
fprintf('Speed-up:                        %10.3f\n', T_CPU/T_GPU);
fprintf('===========================================================\n');

As seen above the same function benchJKT is used for both the CPU and the GPU computations - the only difference is a GPU switch, which tells the function to either use the CPU or the GPU. Having the functionality in Jacket/MATLAB functions is an advantage for different reasons - it makes it easier to avoid errors as the basic equations are only coded once, and it makes it easier to warm up the functions. The function benchJKT is described as:

function [ Telapsed ] = benchJKT( Sz, NoRuns, GPU )
%benchJKT Example of function to benchmark CPU and GPU runs.
 
% Step 1: Preallocate variables
gsync;
Tstart = tic;                       % Start the timer
if GPU>0,
    A = grand(Sz,Sz,'single');
    B = grand(Sz,Sz,'single');
    C = grand(Sz,Sz,'single');
    D = grand(Sz,Sz,'single');
    R = grand(Sz,Sz,'single');
else
    A = rand(Sz,Sz,'single');
    B = rand(Sz,Sz,'single');
    C = rand(Sz,Sz,'single');
    D = rand(Sz,Sz,'single');
    R = rand(Sz,Sz,'single');
end
geval(A, B, C, D, R);
 
 
% Step 2: Make computations
for counter=1:NoRuns            % Repeat the computations
    R = A.^7 * B.^11;           % Processing
    R = R.^3 ./ C.^2;           % Processing
    R = R + D.^2;               % Processing
    geval(R);                   % Force Jacket to compute R
end
gsync;
Telapsed = toc(Tstart)/NoRuns;  % Stop the timer and compute elapsed time
 
end

Running the above code on the Reference System #3 (with the GeForce 9400M GPU) results in the following:

>> master_benchJKT
===========================================================
CPU run time:                         0.159
GPU run time:                         0.041
Speed-up:                             3.852
===========================================================
>> master_benchJKT
===========================================================
CPU run time:                         0.157
GPU run time:                         0.043
Speed-up:                             3.629
===========================================================
>> master_benchJKT
===========================================================
CPU run time:                         0.157
GPU run time:                         0.043
Speed-up:                             3.657
===========================================================
>>


Advanced Concepts

Methodology

When doing more advanced benchmarks such as the ones presented for the JPI (Jacket Performance Index) on Torben's Corner there are more things to consider.

  1. Computer: When doing benchmarks the computer should of course be left alone such that no disturbing requests are needed to be handled by the operating system. Be sure to disable any power saving features. Also always run the computer with power connected from the charger (for laptops). Otherwise it is typical to see an immediate drop in performance if power from the charger is disconnected.
  2. Hyper Threading: Hyper Threading is used as standard by MATLAB on multi core systems. Unless the benchmark should precisely show this then be sure to disable Hyper Threading. At the moment this can be done by issuing the command maxNumCompThreads(1);. In later versions of MATLAB this possibility will be removed and then Hyper Threading must be disabled in MATLAB preferences - this is then effective for the entire MATLAB session.
  3. Repetitions: When running benchmarks it usually takes way more time for small sweep parameters (e.g. small matrix sizes) than for large sweep parameters (e.g. large matrix sizes). It is a typical approach by some to average over a certain fixed number of repetitions - e.g. 20, 50 or 100. This is however not a good approach in my experience. The critical issue if to avoid timing short events - in this case the results tend not to be reproducible. I have tried several approaches to set the number of repetitions done and has found the following to be the best. Here the number of repetitions are set iteratively such that the timed event always lasts at least for Tmin seconds. Using for example Tmin = 1 ensures in my experience that the results are reproducible.
  4. Warming up: To reach reproducible results it is necessary to warm up both MATLAB and Jacket - the best way to do this is simply to perform precisely the same operation (with the same number of iterations) as for the event later to be timed. By doing this it is also possible to stop/resume the benchmark with about the same results and when not stopping the benchmark.

So to sum up, the following features would be an advantage:

  1. It should be possible to set a minimum time that the benchmark will run to increase the chance of producing reproducible results. The benchmark should by itself meet this requirement for minimum benchmarking time.
  2. It should be possible to start/stop/resume the benchmark without creating artifacts when resuming the benchmark.
  3. Data should frequently be saved to disk in case of a crash or simply to be able to resume the benchmark later.
  4. Plots of the results should be shown during execution - it is then possible to terminate the execution if some strange behavior is seen or if some mistake in the script is made.

Source Code

As one example the following shows the function code to do a benchmark as well as the source code for the master file to set up the data. The key file is the function file, which in this example is named bench_BESSELJ.m:

function [] = bench_BESSELJ( Size, Dname, Fname, TitleStr, State )
% bench_BESSELJ Generic benchmark file for matrix based benchmarks
% Name of the function file must be bench_FCT where FCT is the specific
% name of the function to be benchmarked. For example a valid function name
% could be bench_BESSELJ.m or bench_MULTIPLY.m.
%
% (C) Torben Larsen, Aalborg University, 23-JUL-2010
%     E-mail: tl.jacket@es.aau.dk
%     http://wiki.accelereyes.com/wiki/index.php/Torben%27s_Corner
%
 
% Minimum execution time for the individual benchmark point
Tmin = 1;
 
% Max. number of repetitions in loop timing estimation. The max.
% number that can be handled in loops is MaxAvg = 2147483647.
MaxAvg = 1E9;
 
 
%% INITIALIZE VECTORS ETC.
% Vector of available GPU memory
Mem_MB = zeros(length(Size),1);
 
% Extract function name from the function m-file name - update Fname
S = regexp(mfilename, '_', 'split');
FCT   = char(S(2));
Fname = [FCT '_' Fname];
 
 
%% LOAD DATA/PLOT OR PERFORM BENCHMARK
if strcmp(State,'PLOT')
    % Load data to define:
    %   Size, Speedup, FCT, T_CPU, T_CPU_tot, T_GPU, T_GPU_tot
    load([ Dname '/' Fname '.mat']);
    bench_plot( Size, Speedup, FCT, T_CPU, T_GPU, TitleStr )
    return;
end
 
 
%% PREALLOCATE ARRAYS ETC. - IF "RESUME" IS USED THEN LOAD DATA
% If State==RESUME, data is loaded from the existing file for the given
% benchmark and continued from where it came to.
if strcmp(State,'RESUME')
    load([ Dname '/' Fname '.mat']);
    ii = length(find(T_CPU>0));
    SizeN = Size(ii+1:end);
else
    Speedup = zeros(length(Size),1);
    T_CPU = zeros(length(Size),1);
    T_CPU_tot = zeros(length(Size),1);
    T_GPU = zeros(length(Size),1);
    T_GPU_tot = zeros(length(Size),1);
    ii = 0;
    SizeN = Size;
end
 
 
 
%% PERFORM ANALYSIS
for N=SizeN
    ii = ii + 1;
 
    % Set PRNG to ensure same starting state for reproducibility
    RandStream.setDefaultStream(RandStream('mt19937ar','seed',1004397));
 
    % Define arrays
    Ac = randn(N,N,'single') + 1j*randn(N,N,'single');   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
 
    % Print current matrix size
    fprintf('%4.0f / %4.0f', N, max(Size));
 
    % CPU test begin --------------------------------------------------
    whilecount = 0;
    Telap_cpu = -1;
    while Telap_cpu < Tmin
        whilecount = whilecount + 1;
        if Telap_cpu == -1
            t1 = tic;
            Rc = besselj(1/pi,Ac);         %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            Rc = besselj(1/pi,Ac);         %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            Telap_cpu = toc(t1)/2;
            NoRunsCPU = ceil(1.5*Tmin/Telap_cpu);
        else
            NoRunsCPU = ceil(1.5*whilecount*NoRunsCPU/Telap_cpu*Tmin);
        end
 
        % Warm-up
        for no=1:NoRunsCPU
            Rc = besselj(1/pi,Ac);         %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
        end
 
        % Benchmark
        tstart1 = tic;
        for no=1:NoRunsCPU
            Rc = besselj(1/pi,Ac);         %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
        end
        Telap_cpu = toc(tstart1);
    end
 
    % Determine time for CPU loop alone
    RPT = min(5E3,ceil(MaxAvg/NoRunsCPU));
    tstart = tic;
    for AvgNo=1:RPT
        for no=1:NoRunsCPU
        end
    end
    T_CPU_Loop = toc(tstart)/RPT;
 
    % Compute CPU times
    T_CPU(ii) = max((Telap_cpu-T_CPU_Loop)/NoRunsCPU,2.5E-10);
    T_CPU_tot(ii) = Telap_cpu;
    fprintf('  |  T_CPU: %6.1f,', T_CPU_tot(ii));
    % CPU test end   --------------------------------------------------
 
    % GPU test begin --------------------------------------------------
    Ag = gsingle(Ac);
    geval(Ag);
 
    whilecount = 0;
    Telap_gpu = -1;
    while Telap_gpu < Tmin
        whilecount = whilecount + 1;
        if Telap_gpu == -1
            gsync;
            t1 = tic;
            Rg = besselj(1/pi,Ag);         %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            geval(Rg);
            Rg = besselj(1/pi,Ag);         %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            geval(Rg);
            gsync;
            Telap_gpu = toc(t1)/2;
            NoRunsGPU = ceil(1.5*Tmin/Telap_gpu);
        else
            NoRunsGPU = ceil(1.5*whilecount*NoRunsGPU/Telap_gpu*Tmin);
        end
 
        % Warm-up
        gsync;
        for no=1:NoRunsGPU
            Rg = besselj(1/pi,Ag);         %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            geval(Rg);
        end
 
        % Benchmark
        gsync;
        tstart1 = tic;
        for no=1:NoRunsGPU
            Rg = besselj(1/pi,Ag);         %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HERE %%%
            geval(Rg);
        end
        gsync;
        Telap_gpu = toc(tstart1);
    end
 
    % Determine time for GPU loop alone
    RPT = min(5E3,ceil(MaxAvg/NoRunsGPU));
    tstart = tic;
    for AvgNo=1:RPT
        for no=1:NoRunsGPU
        end
    end
    T_GPU_Loop = toc(tstart)/RPT;
 
    % Compute GPU times
    T_GPU(ii) = max((Telap_gpu-T_GPU_Loop)/NoRunsGPU,2.5E-10);
    T_GPU_tot(ii) = Telap_gpu;
    fprintf('   T_GPU: %6.1f', T_GPU_tot(ii));
 
    % Speed-up
    Speedup(ii) = T_CPU(ii)/T_GPU(ii);
    fprintf('   |   Speed-up:  %6.1f', Speedup(ii));
 
    % Memory
    gpu_info = gpu_entry(13);
    Mem_MB(ii) = gpu_info.gpu_free/1E6;
    clear gpu_hook;
    fprintf('   |   Mem free [MB]:  %6.1f', Mem_MB(ii));
    % GPU test end   --------------------------------------------------
 
 
    % Print *** as a warning for simulation time violation
    % (should not be possible unless something spookey is going on)
    if T_CPU_tot(ii)>=Tmin && T_GPU_tot(ii)>=Tmin
        fprintf('\n');
    else
        fprintf('  ***\n');
    end
 
    % Save data and plot for every 10 data points
    if ii/10==floor(ii/10)
        save([ Dname '/' Fname '.mat'], 'Size', 'Speedup', 'FCT', ...
             'T_CPU', 'T_CPU_tot', 'T_GPU', 'T_GPU_tot');
 
        % Plot data
        bench_plot( Size, Speedup, FCT, T_CPU, T_GPU, TitleStr );
    end
end
 
 
%% PLOT RESULTS, SAVE FIGURE AND DATA
print( gcf, '-djpeg99', '-r100', [ Dname '/' Fname '.jpg'] );
print( gcf, '-depsc2', '-r2400', [ Dname '/' Fname '.eps'] );
save([ Dname '/' Fname '.mat'], 'Size', 'Speedup', 'FCT', ...
     'T_CPU', 'T_CPU_tot', 'T_GPU', 'T_GPU_tot');
end
 
 
 
function [] = bench_plot( Size, Speedup, FCT, T_CPU, T_GPU, TitleStr )
% bench_plot Plot data for the benchmark function
ii = length(find(T_CPU>0));
 
% Plot data
figure(1); clf(1);
subplot(2,1,1);
plot(Size(1:ii), Speedup(1:ii), 'g-', ...
     'Linewidth',1.5);
grid;
ylabel('GPU Speed-up   [-]');
title([FCT ': ' TitleStr]);
subplot(2,1,2);
plot(Size(1:ii), 1E3*T_CPU(1:ii), 'r-', ...
     Size(1:ii), 1E3*T_GPU(1:ii), 'b-', ...
     'LineWidth', 1.5);
grid;
legend('CPU', 'GPU', 'Location', 'NorthWest');
xlabel('Square Matrix Size, N\timesN   [-]');
ylabel('Execution time   [ms]');
end

The source file for the master file named Asus_GTX260M_BESSELJ.m is:

%% MASTER SCRIPT FOR BESSELJ BENCHMARK
 
% IMPORTANT NOTE:  The name of the benchmark file MUST be like:
%                  Name_FCT.m where FCT is the name of the function
%                  to be benchmarked (all capital letters) - e.g. BESSELJ.
%                  A full name could be: AsusG51J_BESSELJ.m or
%                  Asus_G51J_BESSELJ.
%
% Platform:        Asus G51J
% CPU:             Intel Core i7-720QM 1.6GHz
% CPU GFlops:      25.6 (http://www.intel.com/support/processors/sb/cs-023143.htm)
% CPU mem.:        4 GB
% GPU:             NVIDIA GTX260M
% GPU GFlops:      462 (http://www.nvidia.com/object/product_geforce_gtx_260m_us.html)
% GPU mem.:        1024 MB
% Operating sys.:  Microsoft Windows 7 x64
% Jacket ver.:     1.4.0 (build 6080)
% NVIDIA driver:   257.21
% CUDA Toolkit:    3.1
 
 
%% SET UP INPUT DATA
% Set number of threads to 1
maxNumCompThreads(1);
 
% Matrix size
Size = [2:1:2000];
 
% Core name of plot and data files
Fname = 'Asus_G51J_01';
Dname = './Asus.GTX260M';
 
% Name of title in plot
TitleStr = 'Asus G51J: Core i7-720QM & GTX260M';
 
% Computation type;
%   State = 'FULL' for cold start
%   State = 'RESUME' for continuation of computations
%   State = 'PLOT' plot existing data
State = 'FULL';
 
% Run benchmark 
bench_BESSELJ( Size, Dname, Fname, TitleStr, State );

Before running the master file a directory named ./Asus.GTX260M must be created. The files Asus_GTX260M_BESSELJ.m and bench_BESSELJ.m must be placed in the same directory and the directory Asus.GTX260M must be a sub-directory.

The benchmark can be stopped by CTRL-C when running. It can then be resumed by setting State = 'RESUME in the master file Asus_GTX260M_BESSELJ.m and then restart that master file again. The benchmark then proceeds.



Go Home: Torben's Corner


Views
Personal tools