Array Indexing In Jacket
From Jacket Wiki
|
Indexing of arrays is very important when considering performance. It is very easy to loose the speedup that we may obtain from fast computations on the GPU. This page takes a look at array indexing and demonstrates with simple examples how indexing could be done to obtain high performance. As some of the examples use the new class concept of Jacket, the following assumes that the reader has studied this blog post, the post on Torben's Corner on MATLAB/Jacket coding and are familiar with CLASS and ISA.
Code
In this case the input to the function is the two vectors a and b. We would like the function to work immediately with MATLAB as well as Jacket. There are five different versions of the code (an overview is provided after the code listings):
FOO1.m
function [ y ] = foo1( a, b ) % foo1 Computes the output r for both MATLAB and Jacket variables. % N is the length of the input vectors a and b; no sanity check for ease N = length(a); % Determine the common class cls = superiorfloat(a,b); % Create d vector; will type wise follow the a and b vectors d = randn(N,1,cls); % Compute the output r = b(1:N-1) .* exp(sin( d(1:N-1) ./ d(2:N) .* a(1:N-1) ./ a(2:N) )) + b(2:N); y = sum(r(:)); end
This version works with MATLAB as well as Jacket input variables without any problems at all. Unfortunately, performance is not the best so we will try another version to see if we can improve things a bit. The idea here is to allocate the index variables as garrays. This is implemented in the following foo2.m function:
FOO2.m
function [ y ] = foo2( a, b ) % foo2 Computes the output r for both MATLAB and Jacket variables. % N is the length of the input vectors a and b; no sanity check for ease N = length(a); % Determine the common class cls = superiorfloat(a,b); % Define scalar '0' and '1' used in colon expansion one = ones(cls); % Create d vector; will type wise follow the a and b vectors d = randn(N,1,cls); % Indexing pointers following the class of the input vectors p1 = one:(N-1)*one; p2 = 2*one:N*one; % Compute the output; r = b(p1) .* exp(sin( d(p1) ./ d(p2) .* a(p1) ./ a(p2) )) + b(p2); y = sum(r(:)); end
FOO3.m
function [ y ] = foo3( a, b ) % foo1 Computes the output r for both MATLAB and Jacket variables. % % max(N) < length(a)-1 % Vector length N = length(a); % Determine the common class cls = superiorfloat(a,b); % Create d vector; will type wise follow the a and b vectors d = randn(N,1,cls); % Compute the output r = b(1:end-1) .* exp(sin( d(1:end-1) ./ d(2:end) .* a(1:end-1) ./ a(2:end) )) + b(2:end); y = sum(r(:)); end
FOO4.m
function [ y ] = foo4( a, b, N ) % foo1 Computes the output r for both MATLAB and Jacket variables. % % max(N) < length(a)-1 % Determine the common class cls = superiorfloat(a,b); % Define scalar '0' and '1' used in colon expansion one = ones(cls); % Create d vector; will type wise follow the a and b vectors d = randn(length(a),1,cls); % Compute the output p1 = one : N-1; p2 = 2*one : N; r = b(p1) .* exp(sin( d(p1) ./ d(p2) .* a(p1) ./ a(p2) )) + b(p2); y = sum(r(:)); end
FOO5.m
function [ y ] = foo5( a, b, N ) % foo1 Computes the output r for both MATLAB and Jacket variables. % % max(N) < length(a)-1 % Determine the common class cls = superiorfloat(a,b); % Create d vector; will type wise follow the a and b vectors d = randn(length(a),1,cls); % Compute the output X = length(a) - N + 1; r = b(1:end-X-1) .* exp(sin( d(1:end-X-1) ./ d(2:end-X) .* a(1:end-X-1) ./ a(2:end-X) )) + b(2:end-X); y = sum(r(:)); end
FOO Overview
All functions generate the random matrix based on the class of the input vector. The differences are:
- FOO1.m: Indexing done like a(1:length(a)-1).
- FOO2.m: Indexing done like a(p1) where the class of p1 is single for MATLAB input variables and gsingle for Jacket input variables.
- FOO3.m: Indexing done like a(1:end-1) - no class derived indexing.
- FOO4.m: Indexing done similar to FOO2.m but here the highest index n used in p1:1:n can be any value, which don't exceed the vector length.
- FOO5.m: As in FOO4.m this function allows indexing with n used in X=length(a)-n; p1:1:end-X;, which can be any value not exceeding the vector length.
So the former three functions utilize the full vector length according to the equation appearing in all functions; the latter two are more flexible in the sense that they can be used for only parts of the input vectors.
MASTER_FOO.m Script
To control the test of the two functions foo1.m, foo2.m, foo3.m, foo4.m, and foo5.m we will make the following master_foo.m script:
% Script file to test the foo.m function. The objective is to show the % difference in performance depending on how indexing of arrays is done. % A computational function is used to illustrate the concepts. % Vector length SZ = 1E7; % Number of repetitions for MATLAB and Jacket, respectively RPTM = 3; RPTJ = 100; %% FOO1 - SINGLE - FOO1 a=randn(SZ,1,'single'); b=randn(SZ,1,'single'); t1=tic; for ii=1:RPTM y1_single_matlab=foo1(a,b); end; t1_single_matlab=toc(t1)/RPTM; a=grandn(SZ,1,'single'); b=grandn(SZ,1,'single'); geval(a,b); gsync; t1=tic; for ii=1:RPTJ y1_single_jacket=foo1(a,b); geval(y1_single_jacket); end; gsync; t1_single_jacket=toc(t1)/RPTJ; Speedup1_single = t1_single_matlab / t1_single_jacket; %% FOO2 - SINGLE - FOO2 a=randn(SZ,1,'single'); b=randn(SZ,1,'single'); t1=tic; for ii=1:RPTM y2_single_matlab=foo2(a,b); end; t2_single_matlab=toc(t1)/RPTM; a=grandn(SZ,1,'single'); b=grandn(SZ,1,'single'); geval(a,b); gsync; t1=tic; for ii=1:RPTJ y2_single_jacket=foo2(a,b); geval(y2_single_jacket); end; gsync; t2_single_jacket=toc(t1)/RPTJ; Speedup2_single = t2_single_matlab / t2_single_jacket; %% FOO3 - SINGLE - FOO3 a=randn(SZ,1,'single'); b=randn(SZ,1,'single'); t1=tic; for ii=1:RPTM y3_single_matlab=foo3(a,b); end; t3_single_matlab=toc(t1)/RPTM; a=grandn(SZ,1,'single'); b=grandn(SZ,1,'single'); geval(a,b); gsync; t1=tic; for ii=1:5*RPTJ y3_single_jacket=foo3(a,b); geval(y3_single_jacket); end; gsync; t3_single_jacket=toc(t1)/(5*RPTJ); Speedup3_single = t3_single_matlab / t3_single_jacket; %% FOO4 - SINGLE - FOO4 a=randn(SZ,1,'single'); b=randn(SZ,1,'single'); t1=tic; for ii=1:RPTM y4_single_matlab=foo4(a,b,SZ-1); end; t4_single_matlab=toc(t1)/RPTM; a=grandn(SZ,1,'single'); b=grandn(SZ,1,'single'); geval(a,b); gsync; t1=tic; for ii=1:RPTJ y4_single_jacket=foo4(a,b,SZ-1); geval(y4_single_jacket); end; gsync; t4_single_jacket=toc(t1)/RPTJ; Speedup4_single = t4_single_matlab / t4_single_jacket; %% FOO5 - SINGLE - FOO5 a=randn(SZ,1,'single'); b=randn(SZ,1,'single'); t1=tic; for ii=1:RPTM y5_single_matlab=foo5(a,b,SZ-1); end; t5_single_matlab=toc(t1)/RPTM; a=grandn(SZ,1,'single'); b=grandn(SZ,1,'single'); geval(a,b); gsync; t1=tic; for ii=1:RPTJ y5_single_jacket=foo5(a,b,SZ-1); geval(y5_single_jacket); end; gsync; t5_single_jacket=toc(t1)/RPTJ; Speedup5_single = t5_single_matlab / t5_single_jacket; %% RESULTS fprintf('===================================\n'); fprintf('SINGLE SINGLE\n'); fprintf('FOO1 - MATLAB [s]: %10.6f\n', t1_single_matlab); fprintf(' - JACKET [s]: %10.6f\n', t1_single_jacket); fprintf(' - Speedup [-]: %10.6f\n', Speedup1_single); fprintf('-----------------------------------\n'); fprintf('FOO2 - MATLAB [s]: %10.6f\n', t2_single_matlab); fprintf(' - JACKET [s]: %10.6f\n', t2_single_jacket); fprintf(' - Speedup [-]: %10.6f\n', Speedup2_single); fprintf('-----------------------------------\n'); fprintf('FOO3 - MATLAB [s]: %10.6f\n', t3_single_matlab); fprintf(' - JACKET [s]: %10.6f\n', t3_single_jacket); fprintf(' - Speedup [-]: %10.6f\n', Speedup3_single); fprintf('-----------------------------------\n'); fprintf('FOO4 - MATLAB [s]: %10.6f\n', t4_single_matlab); fprintf(' - JACKET [s]: %10.6f\n', t4_single_jacket); fprintf(' - Speedup [-]: %10.6f\n', Speedup4_single); fprintf('-----------------------------------\n'); fprintf('FOO5 - MATLAB [s]: %10.6f\n', t5_single_matlab); fprintf(' - JACKET [s]: %10.6f\n', t5_single_jacket); fprintf(' - Speedup [-]: %10.6f\n', Speedup5_single); fprintf('===================================\n');
The master_foo.m script test only for single and gsingle input vectors - the code can easily be expanded to double precision should that be interesting. The script file includes repetition loops to ensure that we measure for sufficiently long to obtain reproducible results.
Results
The platform used is an Intel Xeon X5570 CPU with 48 GB memory and an NVIDIA Tesla C2070 GPU. The specific execution times are not all that critical for this example and therefore other details are omitted - Jacket 1.7 is used though. The results obtained by using the master_foo.m script above (with 10E6 vector elements) is:
>> master_foo =================================== SINGLE SINGLE FOO1 - MATLAB [s]: 0.686132 - JACKET [s]: 0.854018 - Speedup [-]: 0.803416 ----------------------------------- FOO2 - MATLAB [s]: 0.892546 - JACKET [s]: 0.024704 - Speedup [-]: 36.129526 ----------------------------------- FOO3 - MATLAB [s]: 0.488589 - JACKET [s]: 0.007817 - Speedup [-]: 62.507196 ----------------------------------- FOO4 - MATLAB [s]: 0.884938 - JACKET [s]: 0.024538 - Speedup [-]: 36.063305 ----------------------------------- FOO5 - MATLAB [s]: 0.685769 - JACKET [s]: 0.007836 - Speedup [-]: 87.519500 =================================== >>
We can't directly compare the output results from foo1.m and foo2.m as the random function is not identical in the MATLAB and Jacket cases. To test for validity, a deterministic d vector has been used though to ensure that identical results are obtained from the functions foo1.m and foo2.m.
Conclusion
As shown above the indexing is very important and there is one lesson to learn. Whenever possible relate the indexing to end - and if you need to only access part of a vector, say the first n elements then do "X=length(a)-n; ... a(1:end-X)". This procedure maintains almost full speed. The same method should be applied if the place to start in the vector varies; as in "X=length(a)/2; ... a(X+1:end)=a(1:end-X)" - I have performed the tests but not included the files (I believe there are plenty already). The conclusion for MATLAB is basically the same - although here the indexing when only using parts of a vector is not as fast as when using the entire vector.
Go Home: Torben's Corner