Programming With Multiple Heterogeneous GPUs

From Jacket Wiki

Jump to: navigation, search

Contents

Although not officially supported by AccelerEyes, the use of multiple heterogenous GPUs is still very interesting. It may be interesting for users who actually have multiple different GPUs and would like to use these for solving a given problem. And it is also interesting to understand how to program such a setup. In the following we will use MATLABs SPMD command to put the GPUs to work.


The Problem

Let's in the following consider a typical class of parallel problems where we will work on an input data matrix \mathbf{A}, where we process each column independently. This means that we have the following:



  \quad \mathbf{A} \; = \; [\mathbf{a}_{1}, \ldots , \mathbf{a}_{M}] \quad \in \quad \mathbf{R}^{N \times M}


where:



  \quad \mathbf{a}_m \; = \; [a_{1,m}, \ldots , a_{N,m}]^{\rm T} \quad \in \quad \mathbf{R}^{N \times 1}


We map a function f(\cdot) onto the matrix \mathbf{A} such that:



  \quad f(\mathbf{A}) \; = \; g([h(\mathbf{a}_{1}), \ldots , h(\mathbf{a}_{M})])


meaning that we can split the f(\cdot) function up into a function h(\cdot) working on the columns and then a function g(\cdot), which post-process the data further if needed. This means that we can easily distribute the tasks h(\mathbf{a}_{1}), \ldots , h(\mathbf{a}_{M}) onto different GPUs, and 'just' have to sequentially process the g(\cdot) function.


A Case For Illustration

Let's find a case, which may serve to illustrate the ideas. As described above we have a function f(\mathbf{A}) operating on the matrix \mathbf{A}. This function is doing the following in this specific case chosen for illustration:

  1. The first step is to map a power series of 20th order with some different coefficients onto each \mathbf{a}_{m} column vector.
  2. The next step is to make an fft of all column vectors.
  3. Then the magnitude squared value of all elements in the column vector is computed.
  4. Then take the sum of all elements in each column vector and normalize to squared length of the vector (M).

The function then repeats the above 20 times and adds the values to better illustrate some of the concepts. The reason is to increase the execution time to reach some relevant simulation times. We would never use multiple GPUs if we are just talking about execution times of a few seconds.

The core function, which in the following is named computefun, is then made as:

function [R] = computefun( a, k1, k2 )
for rpt=1:20
    Rx = k2*a - k1.^2*a.^2 ...
         + k2.^2*a.^3 - k1.^3*a.^4 ...
         + k2.^4*a.^5 - k1.^5*a.^6 ...
         + k2.^7*a.^7 - k1.^8*a.^8 ...
         + k2.^9*a.^9 - k1.^10*a.^10 ...
         + k2.^11*a.^11 - k1.^12*a.^12 ...
         + k2.^13*a.^13 - k1.^14*a.^14 ...
         + k2.^15*a.^15 - k1.^16*a.^16 ...
         + k2.^17*a.^17 - k1.^18*a.^18 ...
         + k2.^19*a.^19 - k1.^20*a.^20;
    Rt(rpt) = sum(abs(fft(Rx)).^2)/(length(a)^2);
end
R = sum(Rt);
end


Handling the Heterogeneous Environment

When saying heterogeneous environment it is meant that we have two or more GPUs of at least two different types (different in the type of GPU chip, different in memory or perhaps something else). Let's for simplicity, and since the principles are the same anyway, say that we have two different GPUs. Jacket refers to GPUs as GPU-0,...,GPU-X (without the "-") where x-1 is the number of GPUs. They are generally organized such that the most powerful GPU has lowest number, and the weakest GPU has the highest number.

For the setup described above we can use a simple model to describe the time needed to do the computation on GPU number x given as tx as:



\quad t_1 \; = \; t_{\rm init} \; + \; q \cdot T_1



\quad t_2 \; = \; t_{\rm init} \; + \; (1-q) \cdot T_2


where tinit is the time needed to set up the parallel execution environment, transfer the likely large matrix to the GPU etc, q is the factor (0 \leq q \leq 1), and Tx is the time needed for GPU number x to do the computation alone but excluding the initialization time. To utilize the two GPUs the best possible we need to determine the value of q, which minimizes the function:



\quad \min_q \max\left\{ t_1 , t_2 \right\}


Say we organize the GPUs such that T_2 \geq T_1 in a way such that T2 = μT1. In this case we need to solve for



\quad \min_q \max\left\{ t_{\rm init} \; + \; q \cdot T_1 , t_{\rm init} \; + \; (q-1) \cdot \mu \cdot T_1 \right\}


We have the optimum case when t1 = t2. This means that the optimum loading of GPU number 1 is:



\quad q = \frac{\mu}{\mu + 1}


As an example, an Apple MacBook Pro with an NVIDIA GeForce 9400M and NVIDIA GeForce 9600M GT has a \mu \simeq 2.0-2.3 when the 9600M GT is used for GPU number 1. This would lead to a loading of the fastest 9600M GT of \quad q \simeq 0.67-0.70


The Parallelization Function

Above we have looked at the computation function and now the time has come to look at the function to control the computations. We need a construct, which allows us to distribute the load between two (or more) GPUs. One such function is the SPMD (Single Program Multiple Data). The advantage of this construct is that we get access to a labindex variable. For a pool of two parallel workers labindex=1 corresponds to Jacket GPU number 0 (the fastest of the two GPUs), and labindex=2 corresponds to Jacket GPU number 1 (the slowest of the two GPUs).

One way to achieve the possibility of distributing the loading of the GPUs is to define a variable, say LAB1LoadPct, which describes the loading to be put on GPU with labindex=1 in percent. The code can therefore be something like:

function [ R ] = multibenchXspmd( Ain, k1, k2, noWorkers, LAB1LoadPct )
%multibenchXspmd Benchmark based on use of non-linear processing.
 
%% SET UP VARIABLES
if noWorkers==2 && strcmp(class(Ain),'gsingle')
    eLOW = round(LAB1LoadPct/100*size(Ain,2));
    R = gzeros(size(Ain,2),1,'single');
elseif noWorkers==2 && ~strcmp(class(Ain),'gsingle')
    eLOW = round(size(Ain,2)/2);
    R = zeros(size(Ain,2),1,'single');
else
    eLOW = size(Ain,2);
    R = zeros(size(Ain,2),1,'single');
end
 
 
%% PERFORM COMPUTATIONS
spmd;
    if labindex==1
        for e=1:eLOW
            R(e) = computefun(Ain(:,e), k1, k2);
        end
    else
        for e=eLOW+1:size(Ain,2)
            R(e) = computefun(Ain(:,e), k1, k2);
        end
    end
 
end
 
end

The important thing to notice here is that eLOW, which depends on LAB1LoadPct, is used to distribute the loading between the two GPUs. This loading factor is transferred from a main program, which sets up the computations from the top level. Notice also that the class of the input matrix is used to preallocate/define some variables. This is an easy and convenient way to transfer the type of variables.


The Master Program

The master program sets up the computation with the correct input arrays, parameters etc. and also provides the relevant calls of the multibenchXspmd function, which performs the actual parallel computations on the available GPUs. The master program named master_multibenchXspmd.m is as:

%% MASTER_MULTIBENCH - BENCHMARK SCRIPT TO TEST JACKET WITH MATLAB PCT
clear all;
 
% User must set number of columns and rows. Computations are done
% down columns to speed up computations. Some large values of number
% of rows and columns may cause a crash of the GPU due to
% insufficient amount of memory. It may be necessary to restart
% MATLAB, reduce the number of rows and/or columns, and then redo
% the test.
Rows = 2^17;
Cols = 150;
 
% Set number of workers
noWorkers = 2;
 
% Load percentage on LAB1. 69% on GPU with labindex=1 is about optimum
LAB1LoadPct = 0;
 
% Create reference data
Aref = randn(Rows,Cols);
k1 = pi/4;
k2 = pi/5;
 
save data.mat Rows Cols noWorkers LAB1LoadPct Aref k1 k2;
 
 
%% SINGLE WORKER GPU
%==========================================================================
if matlabpool('size') > 0
    matlabpool close force;
end
 
% Input matrix
A = gsingle(Aref);
 
% Define GPU variables
[~] = multibenchXspmd( A, k1, k2, 1 );
tstart = tic;
R_single_GPU = multibenchXspmd( A, k1, k2, 1 );
T_single_GPU = toc(tstart);
save data.mat T_single_GPU -append;
 
%% TWO WORKER GPU
%==========================================================================
isOpenCorr = matlabpool('size') == noWorkers;
if ~isOpenCorr,
    matlabpool close force
    matlabpool(noWorkers)
end
 
% Reference matrix
A = gsingle(Aref);
 
% Reference when GPU with labindex=1 is doing all
[~] = multibenchXspmd( A, k1, k2, 2, 100 );
spmd; ginfo; end
tstart = tic;
R_multi_GPU_ref = multibenchXspmd( A, pi/3, pi/5, 2, 100 );
T_multi_GPU_ref = toc(tstart);
save data.mat T_multi_GPU_ref -append;
 
% Perform test and clear reference matrices
[~] = multibenchXspmd( A, k1, k2, noWorkers, LAB1LoadPct );
spmd; ginfo; end
clear all;   load data.mat;
A = gsingle(Aref);
tstart = tic;
R_multi_GPU = multibenchXspmd( A, pi/3, pi/5, 2, LAB1LoadPct );
T_multi_GPU = toc(tstart);
save data.mat T_multi_GPU -append;
 
 
%% PRINT DATA
fprintf('=============================================\n');
fprintf('SPMD TEST\n');
fprintf('---------------------------------------------\n');
fprintf('LAB 1 Load Percentage:               %8.1f\n', LAB1LoadPct);
fprintf('=============================================\n');
strGPU1 = 'Single GPU      ->   GPU Time [s]:   %8.1f\n';
fprintf(strGPU1, T_single_GPU);
strGPU1 = '2 Workers Ref.: ->   GPU Time [s]:   %8.1f\n';
fprintf(strGPU1, T_multi_GPU_ref);
strGPU1 = '2 Workers:      ->   GPU Time [s]:   %8.1f\n';
fprintf(strGPU1, T_multi_GPU);
str = 'Speed-up; 1-GPU / M-GPU [-]:         %8.1f\n';
fprintf(str, T_single_GPU/T_multi_GPU);
str = 'Speed-up Ref.; 1-GPU / M-GPU [-]:    %8.1f\n';
fprintf(str, T_multi_GPU_ref/T_multi_GPU);
fprintf('=============================================\n');

The reference (Ref.) case is when a two worker environment is defined and when 100% load is put on the fastest GPU. This is the reference needed to check if the prediction of speed-up is achieved. In the table above it is the 2 Workers Ref.. But obviously the user also wants to see what the fastest GPU can do when it is not burdened with the added overhead of being in a multi worker setting. This is the Single GPU value. The value listed for 2 Workers is the time used when both GPUs are used with the loading LAB1LoadPct of the GPU with labindex=1.


Computational Equipment

The results shown later on this page are done using an Apple MacBook Pro with an Intel Core 2 Duo (2.66 GHz) CPU and NVIDIA GeForce 9400M and 9600M GT GPUs. The info from casting the ginfo command to all labs is:

>> spmd; ginfo; end
Lab 1: 
  AccelerEyes Jacket v1.3.0 (build 4162M)
  CUDA driver: 19.5.1f01, CUDA toolkit 3.0
  Memory: 0 CPU-used, 0 GPU-used, 342 GPU-free (in MB)
  License Type: Designated Computer
  License Features: jacket sdk mgl4 dla 
  Multi-GPU: Licensed for 4 GPUs
 
  Detected CUDA-capable GPUs:
  GPU0 GeForce 9600M GT, 1220 MHz, 511 MB VRAM, Compute 1.1 (single) (in use)
  GPU1 GeForce 9400M, 1074 MHz, 253 MB VRAM, Compute 1.1 (single)
 
Lab 2: 
  AccelerEyes Jacket v1.3.0 (build 4162M)
  CUDA driver: 19.5.1f01, CUDA toolkit 3.0
  Memory: 0 CPU-used, 0 GPU-used, 235 GPU-free (in MB)
  License Type: Designated Computer
  License Features: jacket sdk mgl4 dla 
  Multi-GPU: Licensed for 4 GPUs
 
  Detected CUDA-capable GPUs:
  GPU0 GeForce 9600M GT, 1220 MHz, 511 MB VRAM, Compute 1.1 (single)
  GPU1 GeForce 9400M, 1074 MHz, 253 MB VRAM, Compute 1.1 (single) (in use)


Initialization Time

As seen above it is needed to determine the initialization time. This covers the time needed to set up the spmd environment and to move the data to the workers. The code for the function T_init.m is the following:

matlabpool close force;
matlabpool(2);
Aref = randn(2^17, 300);
 
%% TIME TO INITIALIZE GPU0 (LABINDEX=1)
tic;
spmd
   if labindex==1, R = gsingle(Aref); end
end
toc
 
tic;
spmd
   if labindex==1, R = gsingle(Aref); end
end
toc
 
%% TIME TO INITIALIZE GPU1 (LABINDEX=2)
tic;
spmd
   if labindex==2, R = gsingle(Aref); end
end
toc
 
tic;
spmd
   if labindex==2, R = gsingle(Aref); end
end
toc


Running this code provides the initialization time - remember that this changes with the array size.


Results

First, the initialization time tinit is determined. This is done by running the code immediately above with the same array size as will be used in later benchmarking:

>> T_init
Sending a stop signal to all the labs ... stopped.
Did not find any pre-existing parallel jobs created by matlabpool.
 
Starting matlabpool using the 'local' configuration ... connected to 2 labs.
Elapsed time is 3.163369 seconds.
Elapsed time is 3.055415 seconds.
Elapsed time is 3.261742 seconds.
Elapsed time is 2.911093 seconds.

The for two times are for GPU0 and the latter 2 are for GPU1. This means that t_{\rm init}\simeq 3.0 seconds. The next thing to do is to run the master_multibenchXspmd.m script in two settings; 1) where the loading on GPU1 is 100%, and the other where the loading on GPU2 is 100% (meaning that GPU1 loading is 0%). This provides us with the information to compute the optimum loading q and the relative performance of the two used GPUs μ. The result of these two runs is the following:

=============================================
SPMD TEST
---------------------------------------------
LAB 1 Load Percentage:                  100.0
=============================================
Single GPU      ->   GPU Time [s]:       45.6
2 Workers Ref.: ->   GPU Time [s]:       47.2
2 Workers:      ->   GPU Time [s]:       47.9
Speed-up; 1-GPU / M-GPU [-]:              1.0
Speed-up Ref.; 1-GPU / M-GPU [-]:         1.0
=============================================
 
 
=============================================
SPMD TEST
---------------------------------------------
LAB 1 Load Percentage:                    0.0
=============================================
Single GPU      ->   GPU Time [s]:       46.4
2 Workers Ref.: ->   GPU Time [s]:       47.0
2 Workers:      ->   GPU Time [s]:      107.6
Speed-up; 1-GPU / M-GPU [-]:              0.4
Speed-up Ref.; 1-GPU / M-GPU [-]:         0.4
=============================================

With q = 1 the time T1 can be determined as T1 = t1tinit = 47.2 − 3.0 = 44.2. Correspondingly, with q = 0 the time T2 can be determined as T2 = t2tinit = 107.6 − 3.0 = 104.6. The relative computational power of the 9600M GT and the 9400M is therefore μ = T2 / T1 = 104.6 / 44.2 = 2.37.


The optimum value of q is therefore q = \frac{\mu}{\mu + 1} = \frac{2.37}{2.37 + 1} = 0.70, and this means that the minimum execution time when using both GPUs can be computed to \; = \; t_{\rm init} \; + \; q \cdot T_1 = 3.0 +0.70*44.7 = 34.3.


The following then shows the result when the benchmark is conducted with a loading of GPU1 of q = 0.705 and 0.5% up/down from that:

=============================================
SPMD TEST
---------------------------------------------
LAB 1 Load Percentage:                   70.0
=============================================
Single GPU      ->   GPU Time [s]:       44.8
2 Workers Ref.: ->   GPU Time [s]:       45.7
2 Workers:      ->   GPU Time [s]:       34.9
Speed-up; 1-GPU / M-GPU [-]:              1.3
Speed-up Ref.; 1-GPU / M-GPU [-]:         1.3
=============================================
 
 
=============================================
SPMD TEST
---------------------------------------------
LAB 1 Load Percentage:                   69.5
=============================================
Single GPU      ->   GPU Time [s]:       44.9
2 Workers Ref.: ->   GPU Time [s]:       45.7
2 Workers:      ->   GPU Time [s]:       35.6
Speed-up; 1-GPU / M-GPU [-]:              1.3
Speed-up Ref.; 1-GPU / M-GPU [-]:         1.3
=============================================
 
 
=============================================
SPMD TEST
---------------------------------------------
LAB 1 Load Percentage:                   70.5
=============================================
Single GPU      ->   GPU Time [s]:       44.9
2 Workers Ref.: ->   GPU Time [s]:       45.9
2 Workers:      ->   GPU Time [s]:       35.0
Speed-up; 1-GPU / M-GPU [-]:              1.3
Speed-up Ref.; 1-GPU / M-GPU [-]:         1.3
=============================================

As seen above the computed q = 0.70 agrees quite well with the measurement. The computation time is 35.0 seconds and was computed earlier to 34.3 seconds. A better agreement can hardly be expected.


Conclusions

An example of how to use a multiple heterogeneous GPUs to solve a specific task has been presented. The distribution of workload between the GPUs has been considered. Experimental validation has been provided and a computed execution time when using two GPUs was 34.3 seconds and the experiment showed a used execution time of 35.0 seconds.


Further Reading

  1. J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White. Sourcebook of Parallel Computing. Elsevier, San Francisco, USA. 2005.



Go Home: Torben's Corner


Views
Personal tools