Is Jacket Column or Row Major?
From Jacket Wiki
The question as such is easily analyzed and quickly answered - Jacket is column major just as MATLAB. But the interesting question is what kind of difference we see between operations on columns versus operations on rows. One small example may put some light on this. Below a benchmark is given based on SUM computations on two large matrices. The computations are done on both a CPU and a on the GPU. The SUM function is then used to compute the sums across rows and down columns.
For comparison also another approach is used to compute sums across rows; namely to take the transpose of the square matrix, do the column oriented sum, and then finally take the transpose again. To have a reliable timing, 100 computations are done and the results produced are for these 100 computations.
The source code for the timing is:
clear all; %% Define matrix dimensions Rows = 7000; Cols = 7000; %% Define reference matrix for CPU and GPU Acpu = rand(Rows,Cols); Agpu = gsingle(Acpu); %% Here we do the sum down all colums to produce a row vector % First the CPU Rcpu_cols = sum(Acpu,1); % Warm up tstart1 = tic; for run=1:100 Rcpu_cols = sum(Acpu,1); end Tcpu_cols = toc(tstart1); % Then the GPU Rgpu_cols = sum(Agpu,1); % Warm up gforce(Agpu,Rgpu_cols); gforce; tstart1 = tic; for run=1:100 Rgpu_cols = sum(Agpu,1); gforce(Rgpu_cols); end Tgpu_cols = toc(tstart1); %% Here we sum along all rows to produce a column vector % First the CPU Rcpu_rows = sum(Acpu,2); % Warm up tstart1 = tic; for run=1:100 Rcpu_rows = sum(Acpu,2); end Tcpu_rows = toc(tstart1); % Then the GPU Rgpu_rows = sum(Agpu,2); % Warm up gforce(Agpu,Rgpu_rows); gforce; tstart1 = tic; for run=1:100 Rgpu_rows = sum(Agpu,2); gforce(Rgpu_rows); end Tgpu_rows = toc(tstart1); %% Use transpose and column summation to see if that improves things. Of % course only possible directly for square matrices. % First the CPU Rcpu_tcols = sum(Acpu.',1).'; % Warm up tstart1 = tic; for run=1:100 Rcpu_tcols = sum(Acpu.',1).'; end Tcpu_tcols = toc(tstart1); % Then the GPU Rgpu_tcols = sum(Agpu.',1).'; % Warm up gforce(Agpu,Rgpu_tcols); gforce; tstart1 = tic; for run=1:100 Rgpu_tcols = sum(Agpu.',1).'; gforce(Rgpu_tcols); end Tgpu_tcols = toc(tstart1); %% Print data fprintf('Sum down columns:\n'); fprintf('>> CPU: %5.2f \n', Tcpu_cols); fprintf('>> GPU: %5.2f \n\n', Tgpu_cols); fprintf('Sum along rows:\n'); fprintf('>> CPU: %5.2f \n', Tcpu_rows); fprintf('>> GPU: %5.2f \n\n', Tgpu_rows); fprintf('Sum of transpose down columns and transposed again:\n'); fprintf('>> CPU: %5.2f \n', Tcpu_tcols); fprintf('>> GPU: %5.2f \n\n', Tgpu_tcols);
The results on a Colfax Custom work station based on an Intel Core i7 975 Extreme and an FX-3800 GPU are as follows:
>> matrix_sum Sum down columns: >> CPU: 1.90 >> GPU: 0.59 Sum along rows: >> CPU: 2.17 >> GPU: 17.27 Sum of transpose down columns and transposed again: >> CPU: 25.78 >> GPU: 2.52
The computer system is Reference System #4, which is described in detail here. The results above are quite interesting. First of all observe that there is not a big difference for MATLAB on computing the sum across rows or down columns. The column computations are faster - but not by much. Using the approach to first transpose, then use column oriented sums and then transposing again is incredibly slow. And with the small difference between column and row manipulations this is not worth the effort. Remember though that this has so far only been tested for sum (and a few more), and a general conclusion can't be made now.
For the GPU things are different. There is a huge difference between making the sum down columns compared to sum across rows with the former being way faster. The columns oriented sum is around 30 times faster than the row oriented sum. Also when having square matrices it is much faster to compute row sums by first transposing the matrix, do column oriented sums, and taking the transpose again. Note though that MATLABs direct sum across rows is a bit faster than the transform method when using the GPU. The difference is so small though that it is better to keep variables on the GPU if more computations are to come.
The same code has been tested on a MacBook Pro with similar results.
Go Home: Torben's Corner