Lazy Evaluation

From Jacket Wiki

Jump to: navigation, search

Jacket uses a technique called lazy evaluation (also known as Just-In-Time computation, or JIT) in order to maximize the speedup available to your programs. Jacket has patent pending technology which allows GPU kernels for your computations to be built on-the-fly, as they are needed, and the ability to optimize memory transfers.

Contents

Why is Lazy Evaluation important?

The simplest way to perform a computation with the GPU would be to perform each operation on the GPU as the user enters/runs M code. This would work, and might offer some speedup, but not nearly as much as is possible.

Every time Jacket uses your GPU, there is a cost (in time) for this. Besides the actual time for computing values, there's also time spent communicating instructions and transferring data to and from the GPU.

As an example, imagine the sum of four matrices, equal in size. G = A + B + C + D. This is a very simple operation, but it will help us discuss the point. Let's suppose the cost of transferring one of these matrices to the GPU is T_To, and the cost of transferring it back is T_From. Let's also suppose the overhead to start a kernel execution is K, and the amount of time it takes to add two matrices is E. Lastly, let's assume the amount of time it takes to read a value from GPU memory is R and the time it takes to write it back is W.

The computer naturally breaks this down into the 3 operations, such that the effective computation is:

E = A + B
F = E + C
G = F + D

So, based on the above, we can compute the time it takes for each operation, and the sum total of the operations.

Time(E) = 2*T_To + K + R + E + W
Time(F) = T_To + K + R + E + W
Time(G) = T_To + K + R + E + W + T_From
Time(total) =  4*T_To + 3*K + 3*R + 3*E + 3*W + T_From

If we could execute this computation all at once on the GPU, we could saving 2 kernel execution overheads (2*K) and 2 writes (2*W). Additionally, we wouldn't need memory to store the interim results.

G = A + B + C + D

The time for execution of this version would be:

Time = 4*T_To + K + 3*R + 3*E + W + T_From

So, by performing this as one computation, we save 2*K + 2*W. It's hard to give a specific speedup in this case, since this would vary depending on hardware.

In reality, computations are much more complex, and as a result, the time savings becomes much more significant!

How does it work?

Jacket keeps track of the computations you are performing. Another way to think of this is that, rather than performing the computation immediately, Jacket saves the formula and input to compute a given value. Consider the following bit of code:

I = grand(128);
A = cos(I);
B = A .^ 2;
C = 1 - B;

Behind the scenes, Jacket will store these:

A = cos(I);
B = cos(I) .^ 2;
C = 1 - (cos(I) .^ 2)

It doesn't matter if I gets changed before C is evaluated, Jacket keeps track of the original input, so you'll still get the correct answer.

Once you need the value of C, it will be computed automatically. This saves time versus having to perform kernel executions on the GPU at each step because the entire computation of C can be batched as a single kernel - not as each operation is encountered. Jacket uses this information to further optimize the bigger picture version of your computation, where that would be impossible when performing each operation one-at-a-time.

The other great thing about this is that if you run code to compute a value which you never end up actually using, Jacket can skip the computation all together!

It's important to note that if the memory needed to compute C or the complexity of the computation starts to get extreme, Jacket might also proactively compute parts of C on it's own. Additionally, not all functions are supported in JIT.

Greater control through GCOMPILE or Jacket SDK

In some cases, you may want to exercise control over the operations contained in a GPU kernel, without relying on Jacket's lazy evaluation system. There are two ways to control GPU kernels:

  1. GCOMPILE - Create your own custom kernel for your computations. Since you are aware of which computations you are doing, you can batch several of them together into a single kernel! Check out our blog post about GCOMPILE as well as a wiki page about the usage of GCOMPILE.
  2. Jacket SDK - Write your own CUDA kernels and compile them to be compatible with Jacket and inherit Jacket's benefits. See the blog post Jacket SDK trumps standalone MEX.

Compare with the Parallel Computing Toolbox™

The Parallel Computing Toolbox™ (PCT) available from MathWorks® does not do lazy GPU evaluation. As a result, every arithmetic operation requires a separate CUDA kernel and batches of arithmetic result in dozens of kernel executions. Jacket is the only solution which avoids this costly kernel launches. For more information, check out our comparison of Jacket versus PCT where we show a 44x speedup between Jacket and PCT on the Horner example.

Views
Personal tools