Programming With Multiple Heterogeneous GPUs
From Jacket Wiki
|
Although not officially supported by AccelerEyes, the use of multiple heterogenous GPUs is still very interesting. It may be interesting for users who actually have multiple different GPUs and would like to use these for solving a given problem. And it is also interesting to understand how to program such a setup. In the following we will use MATLABs SPMD command to put the GPUs to work.
The Problem
Let's in the following consider a typical class of parallel problems where we will work on an input data matrix
, where we process each column independently. This means that we have the following:
where:
We map a function
onto the matrix
such that:
meaning that we can split the
function up into a function
working on the columns and then a function
, which post-process the data further if needed. This means that we can easily distribute the tasks
onto different GPUs, and 'just' have to sequentially process the
function.
A Case For Illustration
Let's find a case, which may serve to illustrate the ideas. As described above we have a function
operating on the matrix
. This function is doing the following in this specific case chosen for illustration:
- The first step is to map a power series of 20th order with some different coefficients onto each
column vector.
- The next step is to make an fft of all column vectors.
- Then the magnitude squared value of all elements in the column vector is computed.
- Then take the sum of all elements in each column vector and normalize to squared length of the vector (M).
The function then repeats the above 20 times and adds the values to better illustrate some of the concepts. The reason is to increase the execution time to reach some relevant simulation times. We would never use multiple GPUs if we are just talking about execution times of a few seconds.
The core function, which in the following is named computefun, is then made as:
function [R] = computefun( a, k1, k2 ) for rpt=1:20 Rx = k2*a - k1.^2*a.^2 ... + k2.^2*a.^3 - k1.^3*a.^4 ... + k2.^4*a.^5 - k1.^5*a.^6 ... + k2.^7*a.^7 - k1.^8*a.^8 ... + k2.^9*a.^9 - k1.^10*a.^10 ... + k2.^11*a.^11 - k1.^12*a.^12 ... + k2.^13*a.^13 - k1.^14*a.^14 ... + k2.^15*a.^15 - k1.^16*a.^16 ... + k2.^17*a.^17 - k1.^18*a.^18 ... + k2.^19*a.^19 - k1.^20*a.^20; Rt(rpt) = sum(abs(fft(Rx)).^2)/(length(a)^2); end R = sum(Rt); end
Handling the Heterogeneous Environment
When saying heterogeneous environment it is meant that we have two or more GPUs of at least two different types (different in the type of GPU chip, different in memory or perhaps something else). Let's for simplicity, and since the principles are the same anyway, say that we have two different GPUs. Jacket refers to GPUs as GPU-0,...,GPU-X (without the "-") where x-1 is the number of GPUs. They are generally organized such that the most powerful GPU has lowest number, and the weakest GPU has the highest number.
For the setup described above we can use a simple model to describe the time needed to do the computation on GPU number x given as tx as:
where tinit is the time needed to set up the parallel execution environment, transfer the likely large matrix to the GPU etc, q is the factor (
), and Tx is the time needed for GPU number x to do the computation alone but excluding the initialization time. To utilize the two GPUs the best possible we need to determine the value of q, which minimizes the function:
Say we organize the GPUs such that
in a way such that T2 = μT1. In this case we need to solve for
We have the optimum case when t1 = t2. This means that the optimum loading of GPU number 1 is:
As an example, an Apple MacBook Pro with an NVIDIA GeForce 9400M and NVIDIA GeForce 9600M GT has a
when the 9600M GT is used for GPU number 1. This would lead to a loading of the fastest 9600M GT of
The Parallelization Function
Above we have looked at the computation function and now the time has come to look at the function to control the computations. We need a construct, which allows us to distribute the load between two (or more) GPUs. One such function is the SPMD (Single Program Multiple Data). The advantage of this construct is that we get access to a labindex variable. For a pool of two parallel workers labindex=1 corresponds to Jacket GPU number 0 (the fastest of the two GPUs), and labindex=2 corresponds to Jacket GPU number 1 (the slowest of the two GPUs).
One way to achieve the possibility of distributing the loading of the GPUs is to define a variable, say LAB1LoadPct, which describes the loading to be put on GPU with labindex=1 in percent. The code can therefore be something like:
function [ R ] = multibenchXspmd( Ain, k1, k2, noWorkers, LAB1LoadPct ) %multibenchXspmd Benchmark based on use of non-linear processing. %% SET UP VARIABLES if noWorkers==2 && strcmp(class(Ain),'gsingle') eLOW = round(LAB1LoadPct/100*size(Ain,2)); R = gzeros(size(Ain,2),1,'single'); elseif noWorkers==2 && ~strcmp(class(Ain),'gsingle') eLOW = round(size(Ain,2)/2); R = zeros(size(Ain,2),1,'single'); else eLOW = size(Ain,2); R = zeros(size(Ain,2),1,'single'); end %% PERFORM COMPUTATIONS spmd; if labindex==1 for e=1:eLOW R(e) = computefun(Ain(:,e), k1, k2); end else for e=eLOW+1:size(Ain,2) R(e) = computefun(Ain(:,e), k1, k2); end end end end
The important thing to notice here is that eLOW, which depends on LAB1LoadPct, is used to distribute the loading between the two GPUs. This loading factor is transferred from a main program, which sets up the computations from the top level. Notice also that the class of the input matrix is used to preallocate/define some variables. This is an easy and convenient way to transfer the type of variables.
The Master Program
The master program sets up the computation with the correct input arrays, parameters etc. and also provides the relevant calls of the multibenchXspmd function, which performs the actual parallel computations on the available GPUs. The master program named master_multibenchXspmd.m is as:
%% MASTER_MULTIBENCH - BENCHMARK SCRIPT TO TEST JACKET WITH MATLAB PCT clear all; % User must set number of columns and rows. Computations are done % down columns to speed up computations. Some large values of number % of rows and columns may cause a crash of the GPU due to % insufficient amount of memory. It may be necessary to restart % MATLAB, reduce the number of rows and/or columns, and then redo % the test. Rows = 2^17; Cols = 150; % Set number of workers noWorkers = 2; % Load percentage on LAB1. 69% on GPU with labindex=1 is about optimum LAB1LoadPct = 0; % Create reference data Aref = randn(Rows,Cols); k1 = pi/4; k2 = pi/5; save data.mat Rows Cols noWorkers LAB1LoadPct Aref k1 k2; %% SINGLE WORKER GPU %========================================================================== if matlabpool('size') > 0 matlabpool close force; end % Input matrix A = gsingle(Aref); % Define GPU variables [~] = multibenchXspmd( A, k1, k2, 1 ); tstart = tic; R_single_GPU = multibenchXspmd( A, k1, k2, 1 ); T_single_GPU = toc(tstart); save data.mat T_single_GPU -append; %% TWO WORKER GPU %========================================================================== isOpenCorr = matlabpool('size') == noWorkers; if ~isOpenCorr, matlabpool close force matlabpool(noWorkers) end % Reference matrix A = gsingle(Aref); % Reference when GPU with labindex=1 is doing all [~] = multibenchXspmd( A, k1, k2, 2, 100 ); spmd; ginfo; end tstart = tic; R_multi_GPU_ref = multibenchXspmd( A, pi/3, pi/5, 2, 100 ); T_multi_GPU_ref = toc(tstart); save data.mat T_multi_GPU_ref -append; % Perform test and clear reference matrices [~] = multibenchXspmd( A, k1, k2, noWorkers, LAB1LoadPct ); spmd; ginfo; end clear all; load data.mat; A = gsingle(Aref); tstart = tic; R_multi_GPU = multibenchXspmd( A, pi/3, pi/5, 2, LAB1LoadPct ); T_multi_GPU = toc(tstart); save data.mat T_multi_GPU -append; %% PRINT DATA fprintf('=============================================\n'); fprintf('SPMD TEST\n'); fprintf('---------------------------------------------\n'); fprintf('LAB 1 Load Percentage: %8.1f\n', LAB1LoadPct); fprintf('=============================================\n'); strGPU1 = 'Single GPU -> GPU Time [s]: %8.1f\n'; fprintf(strGPU1, T_single_GPU); strGPU1 = '2 Workers Ref.: -> GPU Time [s]: %8.1f\n'; fprintf(strGPU1, T_multi_GPU_ref); strGPU1 = '2 Workers: -> GPU Time [s]: %8.1f\n'; fprintf(strGPU1, T_multi_GPU); str = 'Speed-up; 1-GPU / M-GPU [-]: %8.1f\n'; fprintf(str, T_single_GPU/T_multi_GPU); str = 'Speed-up Ref.; 1-GPU / M-GPU [-]: %8.1f\n'; fprintf(str, T_multi_GPU_ref/T_multi_GPU); fprintf('=============================================\n');
The reference (Ref.) case is when a two worker environment is defined and when 100% load is put on the fastest GPU. This is the reference needed to check if the prediction of speed-up is achieved. In the table above it is the 2 Workers Ref.. But obviously the user also wants to see what the fastest GPU can do when it is not burdened with the added overhead of being in a multi worker setting. This is the Single GPU value. The value listed for 2 Workers is the time used when both GPUs are used with the loading LAB1LoadPct of the GPU with labindex=1.
Computational Equipment
The results shown later on this page are done using an Apple MacBook Pro with an Intel Core 2 Duo (2.66 GHz) CPU and NVIDIA GeForce 9400M and 9600M GT GPUs. The info from casting the ginfo command to all labs is:
>> spmd; ginfo; end Lab 1: AccelerEyes Jacket v1.3.0 (build 4162M) CUDA driver: 19.5.1f01, CUDA toolkit 3.0 Memory: 0 CPU-used, 0 GPU-used, 342 GPU-free (in MB) License Type: Designated Computer License Features: jacket sdk mgl4 dla Multi-GPU: Licensed for 4 GPUs Detected CUDA-capable GPUs: GPU0 GeForce 9600M GT, 1220 MHz, 511 MB VRAM, Compute 1.1 (single) (in use) GPU1 GeForce 9400M, 1074 MHz, 253 MB VRAM, Compute 1.1 (single) Lab 2: AccelerEyes Jacket v1.3.0 (build 4162M) CUDA driver: 19.5.1f01, CUDA toolkit 3.0 Memory: 0 CPU-used, 0 GPU-used, 235 GPU-free (in MB) License Type: Designated Computer License Features: jacket sdk mgl4 dla Multi-GPU: Licensed for 4 GPUs Detected CUDA-capable GPUs: GPU0 GeForce 9600M GT, 1220 MHz, 511 MB VRAM, Compute 1.1 (single) GPU1 GeForce 9400M, 1074 MHz, 253 MB VRAM, Compute 1.1 (single) (in use)
Initialization Time
As seen above it is needed to determine the initialization time. This covers the time needed to set up the spmd environment and to move the data to the workers. The code for the function T_init.m is the following:
matlabpool close force; matlabpool(2); Aref = randn(2^17, 300); %% TIME TO INITIALIZE GPU0 (LABINDEX=1) tic; spmd if labindex==1, R = gsingle(Aref); end end toc tic; spmd if labindex==1, R = gsingle(Aref); end end toc %% TIME TO INITIALIZE GPU1 (LABINDEX=2) tic; spmd if labindex==2, R = gsingle(Aref); end end toc tic; spmd if labindex==2, R = gsingle(Aref); end end toc
Running this code provides the initialization time - remember that this changes with the array size.
Results
First, the initialization time tinit is determined. This is done by running the code immediately above with the same array size as will be used in later benchmarking:
>> T_init Sending a stop signal to all the labs ... stopped. Did not find any pre-existing parallel jobs created by matlabpool. Starting matlabpool using the 'local' configuration ... connected to 2 labs. Elapsed time is 3.163369 seconds. Elapsed time is 3.055415 seconds. Elapsed time is 3.261742 seconds. Elapsed time is 2.911093 seconds.
The for two times are for GPU0 and the latter 2 are for GPU1. This means that
seconds. The next thing to do is to run the master_multibenchXspmd.m script in two settings; 1) where the loading on GPU1 is 100%, and the other where the loading on GPU2 is 100% (meaning that GPU1 loading is 0%). This provides us with the information to compute the optimum loading q and the relative performance of the two used GPUs μ. The result of these two runs is the following:
============================================= SPMD TEST --------------------------------------------- LAB 1 Load Percentage: 100.0 ============================================= Single GPU -> GPU Time [s]: 45.6 2 Workers Ref.: -> GPU Time [s]: 47.2 2 Workers: -> GPU Time [s]: 47.9 Speed-up; 1-GPU / M-GPU [-]: 1.0 Speed-up Ref.; 1-GPU / M-GPU [-]: 1.0 ============================================= ============================================= SPMD TEST --------------------------------------------- LAB 1 Load Percentage: 0.0 ============================================= Single GPU -> GPU Time [s]: 46.4 2 Workers Ref.: -> GPU Time [s]: 47.0 2 Workers: -> GPU Time [s]: 107.6 Speed-up; 1-GPU / M-GPU [-]: 0.4 Speed-up Ref.; 1-GPU / M-GPU [-]: 0.4 =============================================
With q = 1 the time T1 can be determined as T1 = t1 − tinit = 47.2 − 3.0 = 44.2. Correspondingly, with q = 0 the time T2 can be determined as T2 = t2 − tinit = 107.6 − 3.0 = 104.6. The relative computational power of the 9600M GT and the 9400M is therefore μ = T2 / T1 = 104.6 / 44.2 = 2.37.
The optimum value of q is therefore
, and this means that the minimum execution time when using both GPUs can be computed to
.
The following then shows the result when the benchmark is conducted with a loading of GPU1 of q = 0.705 and 0.5% up/down from that:
============================================= SPMD TEST --------------------------------------------- LAB 1 Load Percentage: 70.0 ============================================= Single GPU -> GPU Time [s]: 44.8 2 Workers Ref.: -> GPU Time [s]: 45.7 2 Workers: -> GPU Time [s]: 34.9 Speed-up; 1-GPU / M-GPU [-]: 1.3 Speed-up Ref.; 1-GPU / M-GPU [-]: 1.3 ============================================= ============================================= SPMD TEST --------------------------------------------- LAB 1 Load Percentage: 69.5 ============================================= Single GPU -> GPU Time [s]: 44.9 2 Workers Ref.: -> GPU Time [s]: 45.7 2 Workers: -> GPU Time [s]: 35.6 Speed-up; 1-GPU / M-GPU [-]: 1.3 Speed-up Ref.; 1-GPU / M-GPU [-]: 1.3 ============================================= ============================================= SPMD TEST --------------------------------------------- LAB 1 Load Percentage: 70.5 ============================================= Single GPU -> GPU Time [s]: 44.9 2 Workers Ref.: -> GPU Time [s]: 45.9 2 Workers: -> GPU Time [s]: 35.0 Speed-up; 1-GPU / M-GPU [-]: 1.3 Speed-up Ref.; 1-GPU / M-GPU [-]: 1.3 =============================================
As seen above the computed q = 0.70 agrees quite well with the measurement. The computation time is 35.0 seconds and was computed earlier to 34.3 seconds. A better agreement can hardly be expected.
Conclusions
An example of how to use a multiple heterogeneous GPUs to solve a specific task has been presented. The distribution of workload between the GPUs has been considered. Experimental validation has been provided and a computed execution time when using two GPUs was 34.3 seconds and the experiment showed a used execution time of 35.0 seconds.
Further Reading
- J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White. Sourcebook of Parallel Computing. Elsevier, San Francisco, USA. 2005.
Go Home: Torben's Corner