Fast Computation of Cross Correlation

From Jacket Wiki

Jump to: navigation, search

Contents


Theory

Computation of cross correlation is used in many different scientific areas and it can at times be quite time consuming. Not least if large data sets must be handled. Suppose we have two vectors given by:



  \mathbf{u} \; = \; [u(1), u(2), \ldots , u(K)]^{\rm T}


  \mathbf{v} \; = \; [v(1), v(2), \ldots , v(N)]^{\rm T}


where it is implicitly understood that:



  u(k) \; = \; 0 \quad \mbox{for} \quad k=-\infty,...,-1,0 \quad \And \quad k=K+1,K+2,...,+\infty


  v(n) \; = \; 0 \quad \mbox{for} \quad n=-\infty,...,-1,0 \quad \And \quad n=N+1,N+1,...,+\infty


Say we want to compute the cross correlation, which mathematically can be described as:



  C(w) \; = \; \sum_{n=1}^N u(w+n) \cdot v^{\ast}(n) \quad \mbox{for} \quad w=-N+1,-N+2, \ldots ,-1,0,1, \ldots, K-2, K-1


where it is also known that:



  C(w) \; = \; 0 \quad \mbox{for} \quad w=-\infty,...,-N-1,-N \quad \And \quad w=K,K+1,...,+\infty


where ^* indicates complex conjugate (of the signal v in this case). This means that there are N + K − 1 cross correlation values, C(w), to compute. Although implicitly given by zero we should ensure that all needed u-values are directly given - also the values, which are zero.


Method A

When doing the actual computations by vector-vector products (or matrix-vector products) it is necessary to ensure that the u-vector is defined at all arguments. This means that we need to ensure that:



  u(n) \; = \; 0 \quad \mbox{for} \quad n = -N+2,-N+3,\ldots ,-1,0 \quad \And \quad n=K+1,K+2,\ldots ,K+N-1


This comes from the possible values for w in the expression for C(w). The problem is handled by forming an x-vector defined as:



  \mathbf{x} \; = \; [ \underbrace{0,\ldots,0}_{N-1}, u(1), \ldots ,u(K), \underbrace{0,\ldots,0}_{N-1} ]^{\rm T}


In MATLAB notation we then have:

  x = zeros(K+2*(N-1),1);
  x(N:N+K-1) = u;

The corresponding y-vector is directly given by the v-vector such that:



  \mathbf{y} \; = \; \mathbf{v} \; = \; [ v(1), \ldots ,v(N)]^{\rm T}


It is important to notice that the vector being offset (\mathbf{u}) needs to be defined at more data points that the other vector (\mathbf{v}). Potentially circular handling may also be used - meaning that u(1) is used for u(N + 1) etc. This is, however, most often not the desired kind of computations. Usually it is done so that zero padding is introduced on both x and y - this will be treated later. Notice from the C(w) equation above that y is the signal being "moved" and that w = 0 corresponds to simply correlate at zero offset.


Example 1: Let's take a small example with N = 2 and K = 3. This means that the vectors are given by:

  x = [0 u(1) u(2) u(3) 0]^T
  y = [v(1) v(2)]^T

This gives the following cross correlations:

  C(-1) = u(1)*conj(v(2))
  C(0)  = u(1)*conj(v(1)) + u(2)*conj(v(2))
  C(1)  = u(2)*conj(v(1)) + u(3)*conj(v(2))
  C(2)  = u(3)*conj(v(1))

Again this corresponds to "move the y-vector to the left and then move it right one sample at a time while performing element wise multiplication and addition". END>> Example 1


The computation can be done like:

% Vector sizes
K = 5000;
N = 5000;
 
% Create vectors to correlate
u = randn(K,1);
v = randn(N,1);
 
% Form x-vector and y-vector
x = zeros(K+2*(N-1),1);
x(N:N+K-1) = u;
y = v;
 
%% USE OF LOOP
ts = tic;
C = zeros(N+K-1,1);
for w=-N+1:1:K-1
    C(w+N) = transpose(x(w+N:w+2*N-1)) * conj(y);
end
W1 = (-(N-1):1:K-1);
T_cpu_loop = toc(ts)

It is also possible to avoid the for-end loop in different ways. One approach is to replace the loop structure with:

idx = bsxfun(@plus,repmat((1:N),N+K-1,1),(0:N+K-2).');
C2 = x(idx) * conj(y);
W2 = (-(N-1):1:K-1);

but this is too memory hungry and too slow also. It does give some nice code though. On my MacBook Pro with an Intel Core 2 Duo 2.8 GHz the loop based technique took around 0.25 seconds and the bsxfun based technique took around 2.8 seconds for a K = N = 5000 size problem. This means that the matrix idx uses 100 MB of memory to store the necessary indices. Currently it is unfortunately not possible to use gfor-gend for the loop based technique above.


Method B

Although very simple to understand and implement, the above implementation is computationally demanding. It is fortunately possible to use a technique, which is completely different - namely the same procedure used in the xcorr function in the Signal Processing Toolbox in MATLAB. This technique is based on Fourier series. The reason for this use is that it turns out that the cross correlation can be described by:



  C(k) \; = \; \frac{1}{K+N-1} \sum_{n=1}^{K+N-1} X(n) \; \cdot \; Y^{\ast}(n) \; \cdot \; \exp\left[ j \cdot \frac{2\pi}{K+N-1} \cdot (k-1) \cdot (n-1) \right], \quad k=1,2, \ldots ,K+N-1


where X(n) and Y(n) are the Fourier coefficients at frequency point n, and the time domain vectors must be defined as:



  \mathbf{x} \; = \; [ \underbrace{0,\ldots,0}_{N-1}, u(1), \ldots ,u(K)]^{\rm T}


  \mathbf{y} \; = \; [ v(1), \ldots ,v(N), \underbrace{0,\ldots,0}_{K-1} ]^{\rm T}


This ensures that C(w) has the required N + K − 1 cross correlations. The equation itself is an inverse Fourier transformation. Notice that k = 1,...,K + N − 1, which gives the lags:



  k \; = \; 1,\ldots ,K+N-1 \quad \longrightarrow \quad w \; = \; -(N-1), \ldots ,-1,0,1, \ldots ,K-1


Written in a slightly pseudo code way we have the following:

  C = ifft( fft(x) .* conj(fft(y)) )
  W = (-(N-1):1:K-1)

where C here is the vector containing all the cross correlations, and W contains the lags such that the lag corresponding to C(k) is W(k) where k = 1,...,K + N − 1.


Method C

Method B as described above uses a vector length of K + N − 1. In many cases it is faster to make the vector length as P=2^nextpow2(K+N-1), which gives a very fast computation of the FFTs. The only thing remaining is to figure out how to set up the vectors and what information of the output correlation, which needs to be omitted. As for the vectors they should be like:



  \mathbf{x} \; = \; [ \underbrace{0,\ldots,0}_{N-1}, u(1), \ldots ,u(K), \underbrace{0,\ldots,0}_{P-K-N+1}]^{\rm T}


  \mathbf{y} \; = \; [ v(1), \ldots ,v(N), \underbrace{0,\ldots,0}_{P-N} ]^{\rm T}


When placing the vectors like this it is fortunately easy to extract the relevant correlations. Say we compute:

  Ct = ifft( fft(x) .* conj(fft(y)) )
  C  = Ct(1:K+N-1)

and the lag vector is thus again:

  W = (-(N-1):1:K-1)

When using the MATLAB standard fft and ifft functions, notice that they can be asked to do zero padding. Since these are compiled functions they tend to run faster. An implementation in MATLAB may therefore look like:

  LEN = 2^nextpow2(K+N-1);
  x = gzeros(K+N-1,1,'single');
  x(N:N+K-1) = u;
  y = gzeros(K+N-1,1,'single');
  y(1:N) = v;
 
  Ct = real(ifft( fft(x,LEN) .* conj(fft(y,LEN)) ));
  C = Ct(1:K+N-1);

As seen from this piece of code the vectors x and y are only initially created with K + N − 1 elements. The fft calls then ensures the rest of the zero padding is handled via the LEN variable, which is computed to the next power of two as wished. Also notice that the cross correlation C(w) is forced to be a real - it should be a real seen from a mathematical point of view but sometimes a small residual imaginary part may exist due to numerical inaccuracy. Therefore there correlation is forced to be a real.


A Jacket Enabled Cross Correlation Function

Below there is a proposal for a function to compute the cross correlation based on the FFT technique. Although it is not pretty, it is flexible. Jacket does not yet allow class inheritance. If speed is top priority with many calls to crosscorr below, it is of course possible to make a functions specifically for gsingles and gdouble. The function is:

function [CC, LAGINDX] = crosscorr( u, v )
%CROSSCORR Computes the cross correlation between u and v.
% [CC, LAGINDX] = crosscorr( u, v ) computes the cross correlation between
% vectors u (length K) and y (length N) where both can be real/complex.
% The cross correlation is computed as:
%
%           N
%   C(w) = Sum  x(w+n) * y^*(n),   -(N-1) <= w <= K-1
%          n=1
%
% where ^* indicates complex conjugate (of the signal y in this case), and:
%
%   x = [0,...,0,  u(1),...,u(K), 0,...,0]^T
%        -------   -------------  -------
%          N-1           K        P-K-N+1
%
%   y = [v(1),...,v(N), 0,...,0 ]^T
%        -------------  -------
%              N          P-N
%
% and P=nextpow2(K+N-1). The output from the CROSSCORR function is thus
% given by:
%
%   CC(m) = C(m-N),   m=1,2,...,K+N-1
%   W(m)  = m-N,      m=1,2,...,K+N-1
% 
%
% Author:
%   - Torben Larsen, Aalborg University, Denmark. E-mail: tl@es.aau.dk
% Copyright:
%   - Torben's Corner
%     http://wiki.accelereyes.com/wiki/index.php?title=Torben%27s_Corner
%
% Reference:
%   - W.H. Press, S.A. Teukolsky, W.T. Wetterling, and B.P. Flannery:
%     "Numerical Recipes - The Art of Scientific Computing", Third Edition,
%     Cambridge University Press, 2007 (see ยง13.2 p. 648-649).
%
% The function CROSSCORR is fully Jacket supported.
K = length(u);
N = length(v);
LEN = 2^nextpow2(N+K-1);
 
VarType = class(u);
if strcmp(VarType,'gsingle') == 1
    x = gzeros(N+K-1,1,'single');
    x(N:N+K-1) = u;
    y = gzeros(N+K-1,1,'single');
    y(1:N) = v;
elseif strcmp(VarType,'gdouble') == 1
    x = gzeros(N+K-1,1,'double');
    x(N:N+K-1) = u;
    y = gzeros(N+K-1,1,'double');
    y(1:N) = v;
else
    x = zeros(N+K-1,1,class(u));
    x(N:N+K-1) = u;
    y = zeros(N+K-1,1,class(u));
    y(1:N) = v;
end
 
CCt = real(ifft( fft(x,LEN) .* conj(fft(y,LEN)) ));
CC = CCt(1:K+N-1);
LAGINDX = (-(N-1):1:K-1);
end


Benchmarking

The benchmarking is done for a few different cases to get an idea of the performance. A complete test is not included at this time. The following benchmarks were achieved on an Apple MacBook Pro with an Intel Core 2 Duo 2.8 GHz CPU and an NVIDIA GeForce 9400M GPU. The technique with the for-end loop using a CPU serves as reference. In all cases the full computation of all K+N-1 cross correlations have been computed. The example uses K=N=2^15=32768.

First, the speed-up of the FFT based technique was compared to the for-end loop technique when using the CPU. The FFT based technique shows a speed-up of 510 times. Using the xcorr function from the Signal Processing Toolbox gives a speed-up of approx. 228 times. There is quite a bit of overhead when using xcorr so this difference seems likely.

Turning to the GPU provided a speed-up of only approx. 66 when using the FFT based technique. Here it should be noted that nothing is done to ensure that the computationally efficient radix-2 decimation can be used. The speed-up when using the xcorr function is approx. 645 times - it should be noted that xcorr uses zero-padding to ensure a 2^x vector length.

The final test is when the improved B method above is used on the GPU. This technique takes advantage of the radix-2 decimation technique and uses zero-padding to ensure a 2^x vector length. Using this technique gives a speed-up of more than 1600 times, which is really a huge speed-up - also when comparing to the 645 times from the xcorr technique. The pure FFT based technique is then about 2.5 times faster on the GPU than using the xcorr function - likely to the overhead when using the more generic xcorr function. In many cases, when the application is known, it is better to use a taylor made technique and gain the speed that is potentially there.



Go Home: Torben's Corner


Views
Personal tools