Handling Scalars In Jacket

From Jacket Wiki

Jump to: navigation, search

When starting using Jacket one of the first things we are told in the manual is to use gsingle/gdouble/gXXXXX to cast variables to the GPU for subsequent computations on the GPU. However, with Jacket 1.3.0 and even more with Jacket 1.3.1, it is often not the best to cast scalars to the GPU for computations.

Contents

Benchmarking Technique

In the following we will look at the above topic. We have a computationally heavy function, which as input takes an array and two constants. The code for the function computefun is the following:

function [R] = computefun( a, k1, k2 )
RPT = 40;
 
if strcmp(class(a),'gsingle')
    Rt = gzeros(RPT,1,'single');
else
    Rt = zeros(RPT,1,'single');
end
 
for rpt=1:RPT
    Rx = k2*a - k1.^2*a.^2 ...
         + k2.^2*a.^3 - k1.^3*a.^4 ...
         + k2.^4*a.^5 - k1.^5*a.^6 ...
         + k2.^7*a.^7 - k1.^8*a.^8 ...
         + k2.^9*a.^9 - k1.^10*a.^10 ...
         + k2.^11*a.^11 - k1.^12*a.^12 ...
         + k2.^13*a.^13 - k1.^14*a.^14 ...
         + k2.^15*a.^15 - k1.^16*a.^16 ...
         + k2.^17*a.^17 - k1.^18*a.^18 ...
         + k2.^19*a.^19 - k1.^20*a.^20;
    Rt(rpt) = sum(abs(fft(Rx)).^2)/(length(a)^2);
end
R = sum(Rt);
end

This function contains various computations: power series, fft, abs, sum etc. Also a number of repetitions are included to be able to control the execution time in an easy way - it does nothing more but increase the time in the loop. Since we are testing the GPU we will in the following always define the input vector a as a gsingle - a gdouble might just as well be used if the GPU allows this.


Master Script

The master script makes five different benchmarks. These benchmarks are made in the following way:

  1. CONSTANT-1 (k1) is a CPU variable (i.e. single) and CONSTANT-2 (k2) is a CPU variable (i.e. a single).
  2. CONSTANT-1 (k1) is a CPU variable (i.e. single) and CONSTANT-2 (k2) is a GPU variable (i.e. a gsingle).
  3. CONSTANT-1 (k1) is a GPU variable (i.e. gsingle) and CONSTANT-2 (k2) is a CPU variable (i.e. a single).
  4. CONSTANT-1 (k1) is a GPU variable (i.e. gsingle) and CONSTANT-2 (k2) is a GPU variable (i.e. a gsingle).
  5. CONSTANT-1 (k1) and CONSTANT-2 (k2) are not mapped in any way but just used as the MATLAB default double variable. This means that Jacket decides how the constants are transferred and computed.

The script to perform this benchmark is the following:

%% MASTER_COMPUTEFUN - BENCHMARK SCRIPT TO TEST SCALARS WITH JACKET
clear all;
 
% User must set the length of the vector
LEN = 2^21;
 
% Create reference data
aref = randn(LEN,1);
k1ref = pi/4;
k2ref = pi/5;
 
 
 
%% CONSTANT-1: CPU, CONSTANT-2: CPU
a = gsingle(aref);
k1 = single(k1ref);
k2 = single(k2ref);
 
[R] = computefun(a, k1, k2);
geval(R); gsync;
tstart = tic;
[R] = computefun(a, k1, k2);
geval(R); gsync;
T_cpu_cpu = toc(tstart);
 
 
%% CONSTANT-1: GPU, CONSTANT-2: CPU
a = gsingle(aref);
k1 = gsingle(k1ref);
k2 = single(k2ref);
 
[R] = computefun(a, k1, k2);
geval(R); gsync;
tstart = tic;
[R] = computefun(a, k1, k2);
geval(R); gsync;
T_gpu_cpu = toc(tstart);
 
 
%% CONSTANT-1: CPU, CONSTANT-2: GPU
a = gsingle(aref);
k1 = single(k1ref);
k2 = gsingle(k2ref);
 
[R] = computefun(a, k1, k2);
geval(R); gsync;
tstart = tic;
[R] = computefun(a, k1, k2);
geval(R); gsync;
T_cpu_gpu = toc(tstart);
 
 
%% CONSTANT-1: GPU, CONSTANT-2: GPU
a = gsingle(aref);
k1 = gsingle(k1ref);
k2 = gsingle(k2ref);
 
[R] = computefun(a, k1, k2);
geval(R); gsync;
tstart = tic;
[R] = computefun(a, k1, k2);
geval(R); gsync;
T_gpu_gpu = toc(tstart);
 
 
%% CONSTANT-1: Jacket decides, CONSTANT-2: Jacket decides
a = gsingle(aref);
 
[R] = computefun(a, k1, k2);
geval(R); gsync;
tstart = tic;
[R] = computefun(a, k1ref, k2ref);
geval(R); gsync;
T_jkt_jkt = toc(tstart);
 
 
%% PRINT RESULTS
fprintf('=====================================================\n');
fprintf('CONSTANT-1: CPU,  CONSTANT-2: CPU   >>   %8.3f [s]\n', T_cpu_cpu);
fprintf('CONSTANT-1: CPU,  CONSTANT-2: GPU   >>   %8.3f [s]\n', T_cpu_gpu);
fprintf('CONSTANT-1: GPU,  CONSTANT-2: CPU   >>   %8.3f [s]\n', T_gpu_cpu);
fprintf('CONSTANT-1: GPU,  CONSTANT-2: GPU   >>   %8.3f [s]\n', T_gpu_gpu);
fprintf('CONSTANT-1: JKT,  CONSTANT-2: JKT   >>   %8.3f [s]\n', T_jkt_jkt);
fprintf('\n');
fprintf('JKT = Jacket decides (MATLAB double in this case)\n');
fprintf('=====================================================\n');

Note from the code above that a warm-up call to the computefun function is made in all cases prior to the benchmark call. Also note that geval(); and gsync; are used before the clock is started. This is to ensure that all GPU computations are done and that the CPU and GPU are synchronized. In a similar way, a call to geval(); and gsync; is made before the clock is stopped - this way we ensure that the computations are actually done.

Results

Time has now come to put the benchmark to work. I use a MacBook Pro (see Reference System #2) for the following results. The vector size is 2^21=2097152 and the constants are k1=pi/4 and k2=pi/5. The results are:

=====================================================
CONSTANT-1: CPU,  CONSTANT-2: CPU   >>      8.579 [s]
CONSTANT-1: CPU,  CONSTANT-2: GPU   >>     11.252 [s]
CONSTANT-1: GPU,  CONSTANT-2: CPU   >>     11.667 [s]
CONSTANT-1: GPU,  CONSTANT-2: GPU   >>     13.639 [s]
CONSTANT-1: JKT,  CONSTANT-2: JKT   >>      8.508 [s]
 
JKT = Jacket decides (MATLAB double in this case)
=====================================================

As seen above, the differences are quite large and for some examples it may be way worse than here. First of all it is important to notice that the absolutely worst thing we can do is to try to help Jacket by casting the two constants to the GPU. It takes 62% more time when using constants cast to the GPU than the fastest solution. That is worth considering. Also using hybrid computations (mixed GPU and CPU variables/constants) are bad as expected. The fastest is to either keep the scalars as CPU single types or even better to let Jacket decide what is best.


Conclusions

The Jacket JIT (Just In Time) compiler is getting better and better at this type of computations. Therefore the best solution is to let Jacket decide what to do. The worst thing that can be done is to cast the scalars to the GPU. In the test example this took more than 60% more time than when Jacket decides what to do.



Go Home: Torben's Corner


Views
Personal tools