Spherical Harmonic Transformation

From Jacket Wiki

Jump to: navigation, search
Back to Jacket By Example

In order to demonstrate the standard workflow for Jacketizing MATLAB code and using Jacket SDK, we provide an example that demonstrates the process. In this example, we show a script that performs Spherical Harmonic Transformation.

Contents

Profile

The first step is to profile the script using the MATLAB Profiler. In the present example, the profiler shows that the function legendre which computes the legendre polynomial, takes up the bulk of the time.

Profiler screenshot before Jacketizing.

We can use Jacket SDK to implement any function which is not already supported by Jacket or would not give a significant advantage by Jacketizing (for example, if the code can not be vectorized). In this example, we have implemented the function legendre.

Writing a Jacket Function

The important steps in developing a custom Jacket function using Jacket SDK are as follows:

jktFunction

jktFunction is the entry point for functions developed using Jacket SDK. The input and output variable are accessible inside the jktFunction as mxArray variables.
1. The first step is to determine the number of input and output arguments and the dimension and class of all the input arguments and check if they are valid. The class and dimensions of the inputs are determined using Jacket SDK functions jkt_class and jkt_dims.
2. Then a new mxArray of appropriate dimensions and class is allocated using jkt_new_array.
3. Device side pointers to the arrays are obtained using the function jkt_mem. These pointers can be passed to a CUDA kernel which can access this memory and perform the necessary computation on the GPU. (Do not attempt to read data from a pointer address returned by jkt_mem outside a CUDA kernel as these are device side pointers. Instead use jkt_mem_host to get a host side pointer to a jacket variable.)
5 The CUDA kernel is called with the appropriate block size and warp size and input parameters. The results are written to the memory address of the output array.

Note: The file jacket.h should be included in the declaration section of the source files. These contain the definitions of the functions provided with Jacket SDK. Refer to Jacket_SDK for detailed information on each function.

CUDA Kernel

The CUDA kernel runs on the GPU in parallel. Each thread determines the data that it needs to operate on, performs the computation and writes the output back in the GPU global memory. In this example, we need to compute the Legendre polynomials at a large number of input points. Each GPU thread computes the value of the polynomial at one input value.
1. Determine the value that the thread needs to operate on using the CUDA variables blockIdx and threadIdx and read the input value.
2. The kernel function computes the value of the polynomial using a simple recurrence relation and writes the output to the appropriate position in the output array.

Compiling

Refer to SDK Instructions, Compilation & Execution for compilation instructions on different platforms.

View source file:legendre.cu

Jacketize

Our implementation of the function legendre gives over a 100x improvement over the original MATLAB version. Next we jacketize the code by changing the datatypes to Jacket's GPU matrix types and replacing zeros, ones with Jacket's equivalent gzeros, gones. In this case, we only Jacketize the code within the for loop. To do this we only need to recast the variables f and u to GPU types, replace a zeros by gzeros and replace legendre(...) by our implementation mylegendre(...).

    8 %-- Allocating memory
--- 9 shc = zeros((2*p+1)*(p+1),1); 
+++ 9 shc = gzeros((2*p+1)*(p+1),1); 
   10 
   11 %-- DFT
   12 f = fft(f,[],2)*pi/p; 
   13 f = circshift(f',p)'; 
   14 
   15 %-- Legendre transform
   16 [el wt]=grule(p+1); 
   17 u = acos(el(:)); 
   18 f=f.*(wt'*([ones(1,p) (-1).^(0:p)]/sqrt(2*pi))); 
   19
+++   f=gsingle(f);
+++   u=gsingle(u);
   20 for n=0:p 
---21     Pn = legendre(n, cos(u),'norm')'; 
+++21     Pn = mylegendre(n, cos(u),'norm'); 
   22     Pn = Pn(:,[n+1:-1:2 1:n+1]); 
   23     for m=-n:n; 
   24         ind = (m+p)*(p+1)+1; 
   25         shc(ind+n) = sum(Pn(:,n+m+1).*f(:,m+p+1)); 
   26     end 
   27 end 
+++   shc=single(shc);

After making these changes, when we again run the profiler, we observe that even though the new function takes a negligible amount of time, other parts of the code have become slower.

Profiler screenshot after Jacketizing.

Optimize

Now the bottleneck is the vector-vector operation on line 27 inside the innermost loop. We can remove this inner loop and replace the vector-vector operations by matrix-matrix operations which are handled more effeciently by Jacket. We replace lines 25-28 in the above code by the following:

   25     m=-n:n; 
   26     ind = (m+p)*(p+1)+1; 
   27     shc(ind+n) = sum(Pn(:,n+m+1).*f(:,m+p+1));
Profiler screenshot after Optimizing.

Now, when we again run the profiler, we find a tremendous improvement in performance. The new code is over 10 times faster than the original MATLAB code.

Views