Is Jacket Column or Row Major?

From Jacket Wiki

Jump to: navigation, search

The question as such is easily analyzed and quickly answered - Jacket is column major just as MATLAB. But the interesting question is what kind of difference we see between operations on columns versus operations on rows. One small example may put some light on this. Below a benchmark is given based on SUM computations on two large matrices. The computations are done on both a CPU and a on the GPU. The SUM function is then used to compute the sums across rows and down columns.


For comparison also another approach is used to compute sums across rows; namely to take the transpose of the square matrix, do the column oriented sum, and then finally take the transpose again. To have a reliable timing, 100 computations are done and the results produced are for these 100 computations.


The source code for the timing is:

clear all;
 
%% Define matrix dimensions
Rows = 7000;
Cols = 7000;
 
 
%% Define reference matrix for CPU and GPU
Acpu = rand(Rows,Cols);
Agpu = gsingle(Acpu);
 
 
%% Here we do the sum down all colums to produce a row vector
% First the CPU
Rcpu_cols = sum(Acpu,1); % Warm up
tstart1 = tic;
for run=1:100
    Rcpu_cols = sum(Acpu,1);
end
Tcpu_cols = toc(tstart1);
 
% Then the GPU
Rgpu_cols = sum(Agpu,1); % Warm up
gforce(Agpu,Rgpu_cols); gforce;
tstart1 = tic;
for run=1:100
    Rgpu_cols = sum(Agpu,1);
    gforce(Rgpu_cols);
end
Tgpu_cols = toc(tstart1);
 
 
%% Here we sum along all rows to produce a column vector
% First the CPU
Rcpu_rows = sum(Acpu,2); % Warm up
tstart1 = tic;
for run=1:100
    Rcpu_rows = sum(Acpu,2);
end
Tcpu_rows = toc(tstart1);
 
% Then the GPU
Rgpu_rows = sum(Agpu,2); % Warm up
gforce(Agpu,Rgpu_rows); gforce;
tstart1 = tic;
for run=1:100
    Rgpu_rows = sum(Agpu,2);
    gforce(Rgpu_rows);
end
Tgpu_rows = toc(tstart1);
 
 
%% Use transpose and column summation to see if that improves things. Of
%  course only possible directly for square matrices.
% First the CPU
Rcpu_tcols = sum(Acpu.',1).'; % Warm up
tstart1 = tic;
for run=1:100
    Rcpu_tcols = sum(Acpu.',1).';
end
Tcpu_tcols = toc(tstart1);
 
% Then the GPU
Rgpu_tcols = sum(Agpu.',1).'; % Warm up
gforce(Agpu,Rgpu_tcols); gforce;
tstart1 = tic;
for run=1:100
    Rgpu_tcols = sum(Agpu.',1).';
    gforce(Rgpu_tcols);
end
Tgpu_tcols = toc(tstart1);
 
 
%% Print data
fprintf('Sum down columns:\n');
fprintf('>> CPU: %5.2f \n', Tcpu_cols);
fprintf('>> GPU: %5.2f \n\n', Tgpu_cols);
 
fprintf('Sum along rows:\n');
fprintf('>> CPU: %5.2f \n', Tcpu_rows);
fprintf('>> GPU: %5.2f \n\n', Tgpu_rows);
 
fprintf('Sum of transpose down columns and transposed again:\n');
fprintf('>> CPU: %5.2f \n', Tcpu_tcols);
fprintf('>> GPU: %5.2f \n\n', Tgpu_tcols);


The results on a Colfax Custom work station based on an Intel Core i7 975 Extreme and an FX-3800 GPU are as follows:


>> matrix_sum
Sum down columns:
>> CPU:  1.90 
>> GPU:  0.59 
 
Sum along rows:
>> CPU:  2.17 
>> GPU: 17.27 
 
Sum of transpose down columns and transposed again:
>> CPU: 25.78 
>> GPU:  2.52


The computer system is Reference System #4, which is described in detail here. The results above are quite interesting. First of all observe that there is not a big difference for MATLAB on computing the sum across rows or down columns. The column computations are faster - but not by much. Using the approach to first transpose, then use column oriented sums and then transposing again is incredibly slow. And with the small difference between column and row manipulations this is not worth the effort. Remember though that this has so far only been tested for sum (and a few more), and a general conclusion can't be made now.


For the GPU things are different. There is a huge difference between making the sum down columns compared to sum across rows with the former being way faster. The columns oriented sum is around 30 times faster than the row oriented sum. Also when having square matrices it is much faster to compute row sums by first transposing the matrix, do column oriented sums, and taking the transpose again. Note though that MATLABs direct sum across rows is a bit faster than the transform method when using the GPU. The difference is so small though that it is better to keep variables on the GPU if more computations are to come.


The same code has been tested on a MacBook Pro with similar results.



Go Home: Torben's Corner


Views
Personal tools