Warming up Jacket

From Jacket Wiki

Jump to: navigation, search

Jacket is a Just-In-Time based compiled language, which takes time to execute in the first run of the code. In some cases this can be annoying - for example when some timing tests are made. There are, however, some things which can be done to make this better. Warming up Jacket is one of them. Regarding warm-up it is essentially a two step procedure; Jacket must be warmed up, but the same is also the case for CUDA.


Warm-up procedure

Often ginfo is used but this only warms up the core part of Jacket - and not CUDA. To ensure that also CUDA is warmed up something more than ginfo is needed. Suppose we have a function Jmyfunction, which contains some GPU code. Say for example it is something like:

function [ out_ ] = Jmyfunction( in_ )
% Jacket function where out_ = const*in_.^2
const_ = gsingle(0.667);
out_ = zeros(size(in_));       % Preallocate out
out_ = const_ * in_.^2;        % Compute output
end

where in_ is the gsingle input to be processed, and out_ is the output from the function Jmyfunction. An underscore is used to identify Jacket variables. In the m-file, which controls our simulation or test, we can then do the following:

% Size of quadratic input matrix
Size = 1000;
 
% Preallocate variables
in_ = grand(Size,Size);
out_  = gzeros(Size,Size);
 
% Warm up Jacket and CUDA for 'myfunction'
dummy_ = gzeros(5,5);                     % Predefine dummy variable
[ dummy_ ] = Jmyfunction( grand(5,5) );   % warm-up the GPU for myfunction
clear dummy_;                             % Clear workspace for dummy_
 
% Do the computation and get the timing
t1 = tic;
[ out_ ] = Jmyfunction( in_ );
Telapsed = toc(t1)

As seen above the warm-up is done by calling precisely the relevant Jacket function - but with a very small input matrix. Many tests have been conducted with different types of functions, and it seems as if it is fully sufficient to just warm-up with small input arrays. The output from running Jmyfunction_master could for example be:

>> Jmyfunction_master
 
Telapsed =
 
   2.2580e-04
 
>>

This shows the core principle - simple but very effective. Just call the Jacket functions with small arrays means that Jacket and CUDA is warmed up and the following runs will then execute as fast as it is possible on the given computer and GPU.


Example: Matrix multiplication

A simple example is where we have a function to use the GPU to multiply two matrices of the same size. Of course a function to matrix multiplication is not a typical type of function. But it serves as a simple example. Also it makes sense to use MATLAB 'functions' as most MATLAB programmers quickly start using functions to have a nice overview of the programs. In the following, underscores are used for GPU variables to make an easy identification of these. We define a Jacket multiply function Jmultiply as:

function [ Res_ , T_GPU ] = Jmultiply( A1_, A2_ )
% Jacket function to multiply two matrices, and time the computation
Res_ = gzeros(size(A1_));      % Preallocate the result matrix
tstart = tic;                  % Start the timer
Res_ = A1_ * A2_;              % Make the computation
gforce(Res_);                  % Force the GPU to compute Res_
T_GPU = toc(tstart);           % Stop the timer
end

The Jmultiply_master.m m-file, which requests the computation of the matrix multiplication, contains the following piece of code:

%% Jmultiply_master.m
% MATLAB .m file to control the warm-up of Jacket.
clear all;
 
% Set matrix size
Size = 800;
 
%% Preallocate variables
H1_ = grand(Size,Size);
H2_ = grand(Size,Size);
R_  = gzeros(Size,Size);
 
%% Warm up Jacket and CUDA for Jmultiply
Rtmp_ = gzeros(5,5);
if exist('spmd')==5,
    spmd;
        [Rtmp_, T] = Jmultiply( grand(5,5), grand(5,5) );   % Warm-up the GPU
    end
else
    [Rtmp_, T] = Jmultiply( grand(5,5), grand(5,5) );       % Warm-up the GPU
end
clear Rtmp_ T;
 
%% Do the computation and get the timing
[ R_, T_GPU ] = Jmultiply( H1_, H2_ );
fprintf('Time to multiply two %3.0f x %3.0f matrices: %10.1f [ms]\n', ...
    Size, Size, T_GPU*1E3);

The spmd-end commands are used to push the execution of Jmultiply on all workers if these are defined. This is used when multiple GPUs are available via use of the MATLAB PCT (Parallel Computing Toolbox). The use of multiple workers therefore requires PCT (Parallel Computing Toolbox) from MATLAB. The if-else-end sentence checks if PCT is available - and it uses spmd; if parallel workers are available. The trick to warm up computations is to invoke the Jmultiply function with just small matrices. In some cases it is sufficient just to call the type of computations used in the function - meaning that the .m file just includes a A_ * B_ functionality. However, calling the Jacket functions is the safe way to proper warming up - and should you add more functionality to the Jacket function, the call to the function will always work. Generally, there does not seem to be any major advantages of using larger vectors/matrices for the warm-up procedure. Just using small vectors/matrices seems to do the trick. Running the example could result in:

>> Jmultiply_master
Time to multiply two 800 x 800 matrices:      115.0 [ms]
>>

The above is the output from a typical run on Reference System #3.


Further Reading

Read this discussion on time measurement on the NVIDIA forums that also talks about CUDA warmup.


Go Home: Torbens Corner


Views