GPU Memory Transfer
From Jacket Wiki
Moving data from the CPU to the GPU and vice versa is one of the most important issues when using Jacket for computations. The GPU can only be used if the data is moved to it. And similarly, MATLAB can only postprocess the data if it is moved from the GPU to the CPU. Also, it must be possible to move the data rather fast for high performance computing.
|
Methodology and Code
The code to analyze the transfer rates is in principle quite simple. However, there are a few things to notice:
- In particular when transferring small arrays we are timing very short events, which is known to be quite inaccurate. The code includes repetition such that for example 5 or 10 GB are transferred in total (with a max. number of repetitions though to avoid very long events).
- As Jacket is so clever that it only transfer the same array once (despite use of geval) it is necessary to modify the array a bit to force transfer. Otherwise the results are grossly misleading.
- Warm-up is included to ensure reproducible results.
- Compensation for kernel activity is made to isolate the transfer and get a correct timing for that.
- The random arrays to be moved are generated in such a way that the arrays are precisely the same in case of repeated runs. This is done by setting the state of the generator.
The key function to perform the data transfer test is contained in the function file HostDevice.m with the following content:
function [] = HostDevice( Fname, GPU, Size, TotalTransfer ) % HostDevice Jacket benchmarking measuring transfer rates between CPU and GPU memory % % MEASUREMENT OF TRANSFER RATES CPU-GPU and GPU-CPU - SINGLE PRECISION VECTOR % Measures the transfer rates for moving arrays from the CPU to the GPU and the other % way from GPU to CPU. Advanced benchmarking using repeated measurements combined % with creation of random matrices just before transfer means that Jacket can't % apply tricks to artificially increase the transfer rates. Also the code to actually % perform the transfer are isolated such that only the transfer and nothing else is % measured. The function creates two figures showing transfer rates and transfer time % versus vector size. Also a .mat file is saved with all the needed data. % % INPUT: % Fname: String containing the filename in which data is stored. % GPU: String containing a short name of the GPU used. % Size: Vector containing the vector sizes to be measured. % TotalTransfer: Total amount of data transferred to ensure convergence. % % OUTPUT: % The function creates a number of files. If the master file calling the HostDevice % is named "master_HostDevice_GPU_xx.m" then the following files are created: % % master_HostDevice_GPU_xx_time.eps: JPEG figure of transfer time versus vector size. % master_HostDevice_GPU_xx_time.jpg: EPSF figure of transfer time versus vector size. % master_HostDevice_GPU_xx_rate.eps: JPEG figure of transfer rate versus vector size. % master_HostDevice_GPU_xx_rate.jpg: EPSF figure of transfer rate versus vector size. % master_HostDevice_GPU_xx.mat: Data file for matrix sizes and results. % % The files are saved in a folder names: "benchmarks/GPU" where GPU is the identifier % mentioned previously. % % EXAMPLE: % A master file would typically be created - for example as: % % % GPU name % GPU = 'C1060'; % % % Array sizes to be analyzed % Size = [1:1:9,10:10:250]*1E5; % % % Total amount of data transferred to ensure convergence of results % TotalTransfer = 2E9; % % % DO NOT CHANGE - CODE EXECUTION % HostDevice( mfilename, GPU, Size ); % % Since this script reads its own filename, it is important to choose a filename, which % makes sense. For example use something like: "master_HostDevice_C1060_01.m". The output % files created by MemTransfer are then named according to that. % % % CREATED BY: % Professor Torben Larsen, Aalborg University (tl.jacket@es.aau.dk), 05-DEC-2010 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % PREPARE MEASUREMENTS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Initialize data Size_bytes = 4*Size; c2g_rate = zeros(length(Size),1); g2c_rate = zeros(length(Size),1); c2g_time = zeros(length(Size),1); g2c_time = zeros(length(Size),1); % Create folder if ~exist(['./benchmarks/' GPU],'dir') mkdir(['./benchmarks/' GPU]); end % Loop over the different array sizes no = 0; for sz=Size % Update index number no = no + 1; % Print progress fprintf('VecTransfer: %4.0f of %4.0f: ', no, length(Size)); % Determine averaging factor such that a minimum amount of data is moved. % Here 10 times the largest array size is moved - however done such that % there is a certain max for the number of repetitions. % E = min(round(1E3*max(Size)/sz),1E3); % E = min(ceil(TotalTransfer/(4*sz)),1E3); E = min(ceil(TotalTransfer/(4*sz)),1E5); % Reference vector reset(RandStream.getDefaultStream); a = rand(sz,1,'single'); a_ = gsingle(a); geval(a_); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% MOVING DATA FROM CPU TO GPU %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Compensation: warm-up and measure time for e=1:E, a(1)=e; a; end tic; for e=1:E, a(1)=e; a; end; tvec1=toc; % Measurement: Warm-up and measure time for e=1:E, a(1)=e; geval(gsingle(a)); end gsync; tic; for e=1:E, a(1)=e; geval(gsingle(a)); end; gsync; tend1=toc; % Memory transfer and print result c2g_time(no) = (tend1-tvec1)/E; c2g_rate(no) = 4*sz/c2g_time(no); fprintf(' C>G: %6.3f [GB/s],', c2g_rate(no)/1E9); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% MOVING DATA FROM GPU TO CPU %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Compensation: warm-up and measure time for e=1:E, a_(1)=e; geval(a_); end gsync; tic; for e=1:E, a_(1)=e; geval(a_); end; gsync; tvec2=toc; % Measurement: Warm-up and measure time for e=1:E, a_(1)=e; single(a_); end gsync; tic; for e=1:E, a_(1)=e; single(a_); end; gsync; tend2=toc; % Memory transfer and print result g2c_time(no) = (tend2-tvec2)/E; g2c_rate(no) = 4*sz/g2c_time(no); fprintf(' G>C: %6.3f [GB/s]\n', g2c_rate(no)/1E9); %% PLOT DATA EVERY NOW AND THEN if rem(no,10)==0 pltfct(Fname, GPU, Size_bytes(1:no), c2g_rate(1:no), ... g2c_rate(1:no), c2g_time(1:no), g2c_time(1:no), TotalTransfer); end % Release GPU memory clear gpu_hook; end %% MAKE THE FINAL PLOTS AND SAVE DATA pltfct(Fname, GPU, Size_bytes, c2g_rate, g2c_rate, c2g_time, g2c_time, TotalTransfer); end function [] = pltfct(Fname, GPU, Size_bytes, c2g_rate, g2c_rate, c2g_time, g2c_time, TotalTransfer) figure(1); plot(Size_bytes/1E6, c2g_rate/1E9, 'b', ... Size_bytes/1E6, g2c_rate/1E9, 'r'); grid; legend('CPU>GPU', 'GPU>CPU'); title(['Memory transfer rate between CPU and GPU (in total moved: ' ... num2str(TotalTransfer/1E9) ' GB)']); xlabel('Array size [MB]'); ylabel('Throughput [GB/s]'); print( gcf, '-depsc2', ['./benchmarks/' GPU '/' Fname '_rate.eps'] ); print( gcf, '-djpeg80', ['./benchmarks/' GPU '/' Fname '_rate.jpg'] ); figure(2); plot(Size_bytes/1E6, c2g_time*1E3, 'b', ... Size_bytes/1E6, g2c_time*1E3, 'r'); grid; legend('CPU>GPU', 'GPU>CPU'); title(['Transfer time between CPU and GPU (in total moved: ' ... num2str(TotalTransfer/1E9) ' GB)']); xlabel('Array size [MB]'); ylabel('Transfer time [ms]'); print( gcf, '-depsc2', ['./benchmarks/' GPU '/' Fname '_time.eps'] ); print( gcf, '-djpeg80', ['./benchmarks/' GPU '/' Fname '_time.jpg'] ); % Save data save(['./benchmarks/' GPU '/' Fname '.mat'], 'Size_bytes', 'c2g_rate', 'g2c_rate', ... 'c2g_time', 'g2c_time', 'TotalTransfer'); end
A master script is then used to execute this script - the master script defines the arrays to be transferred etc. The script file should be named master_HostDevice_GPU_xx.m where xx is a running identifier (e.g. 01).
%% ---------------------------------------------------------------------------------- % Platform: Colfax CXT2000i % ---------------------------------------------------------------------------------- % CPU: Intel Core i7, 3.33 GHz % GPU: NVIDIA GeForce GTX580 % Jacket: 1.6.0 (build 9686) % MATLAB: 7.11.0.584 (R2010b) % NVIDIA driver: 263.06 % CUDA Toolkit: 3.1 % ---------------------------------------------------------------------------------- % GPU name GPU = 'GTX580'; % Array sizes to be analyzed Size = [1:1:9,10:10:2000,2025,2050,2100:100:25000]*1E3/4; % Total amount of data transferred to ensure convergence of results TotalTransfer = 20E9; % DO NOT CHANGE - CODE EXECUTION HostDevice( mfilename, GPU, Size, TotalTransfer );
GPU is a string with the name of the tested GPU, Size defines the array sizes to be tested, and TotalTransfer sets the total amount of data moved (max. 100,000 iterations though).
Some Results
Figs. 1 and 2 below show some measured results for a Colfax CXT2000i computer platform. This has an Asus P6T7 Supercomputer motherboard, Intel Core i7-975 Extreme CPU, 6 x 2 GB DD3 memory (1333 MHz) and NVIDIA Quadro 4000 and NVIDIA Tesla C2050 GPUs. The Asus motherboard has room for 7 x16 size cards - however, only 4 of them actually has the x16 speed. Besides the GPUs mentioned already, I have tested with GeForce GTX465, GeForce GTX580, Quadro FX3800, and Tesla C1060. In all cases I saw approximately the same performance for array sizes above approximately 5 MB being:
- PCI-E2 x16; Host (CPU) -> Device (GPU): 3.7-3.8 GB/s.
- PCI-E2 x16; Device (GPU) -> Host (CPU): 2.4-2.5 GB/s.
when using PCI-E v2 x16. When only having access to x8 speed on the PCI bus the result shown in Fig. 2 is achieved. For the x8 bus speed I measured:
- PCI-E2 x8; Host (CPU) -> Device (GPU): 2.4-2.5 GB/s.
- PCI-E2 x8; Device (GPU) -> Host (CPU): 1.9-2.0 GB/s.
The initialization cost no matter array size was approximately 85 microseconds. Besides this initialization time the transfer time scaled linearly with array size.
Go Home: Torben's Corner