% barREUcuda
% Abdulmajed Dakkak
% June 18th, 2008
The CUDA programming language offers new opportunities to a field regarded by
many to be fairly difficult --- parallel programming. Through a simple
framework that comparable to Google's recent (over hyped) **map reduce** frame
work, we can identify certain components in a serial program and parallelize
repetitive function calls.
The concept behind CUDA is fairly simple. You break your algorithm into small
pieces and hand each thread a little piece of computation. And, since most
algorithms have a while loop embedded in them, the code inside the while loop
can be the code handed to each thread. If that loop is repetitive, i.e. there
is little difference between them, then we can gain a performance boost, since
the subroutines are called all in one clock cycle (or `warp`).
CUDA accomplishes this by extending (only a bit) `C`. They add declarations
that define whether a function runs on the device (the `GPU`) or the host (the
`CPU`). They also add new types of memory which need to be studied more
carefully.
Take, for example, quicksort, which is defined by the following
`python/functional` code.
def quicksort(m):
pivot = m[0]
p = filter(lambda x: x <= pivot, m)
p = filter(lambda x: x > pivot, m)
return quicksort(p) + [pivot] + quicksort(q)
The above implementation does not sort in place, thus requiring $O(n)$ size.
The following is a `C` implementation that sorts in place.
void qsort(float* m, int start, int end)
{
int i;
int j = start;
float pivot = m[start];
for(i = start+1; i < end; i++)
{
if(m[i] < pivot)
m[j++] = m[i]
}
m[j] = pivot;
qsort(m, start, j-1);
qsort(m, j+1, end);
}
Recursion in an algorithm is a big hint that the algorithm can be parallelized
--- even though CUDA's function cannot recursed. Most algorithms can be easily
parallelized. This is specifically the case for programs that simulate a
large quantity of relatively independent objects. It so happens that most
physics can be reduced to simulating many particles' interactions with the
environment, and many abstractions exist to simulate this interaction. Take 1D
rule based cellular automata, for example. In a serial program on would keep
track of the previous row and would write the following `C` function to
compute the next row
while(row_count < MAXROWS)
{
for(i = 0; i < ROWLENGTH; i++)
{
if(i == 0)
{
row[i] = rule(0, old_row[i], old_row[i+1])
}
else if(i == ROWLENGTH-1)
{
row[i] = rule(old_row[i-1], old_row[i], 0)
}
else
{
row[i] = rule(old_row[i-1], old_row[i], old_row[i+1])
}
}
tmp = old_row
old_row = row
row = tmp
}
A CUDA version of the previous program might look like the following
__device__ void rule(int left, int top, int right)
{
// insert automata rule here
...
}
__global__ void kernel(int* old_row, int* row)
{
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i == 0)
{
row[i] = rule(0, old_row[i], old_row[i+1])
}
else if(i == ROWLENGTH-1)
{
row[i] = rule(old_row[i-1], old_row[i], 0)
}
else
{
row[i] = rule(old_row[i-1], old_row[i], old_row[i+1])
}
}
void run( )
{
dim3 block(16, 16);
dim3 grid(ROWLENGTH / block.x, ROWLENGTH / block.y);
while(row_count < MAXROWS)
{
kernel<<< grid, block>>>(old_row, row);
...
display(row);
}
...
}
The above code will easily achieve more than 20x speedup, but with some
optimization, it can run much much faster. A similar program can be sketched
for 2D cellular automata such as the Game of Life, and a 3D cellular automata
could also be implemented (when I find an example of 3D cellular automata).
The CUDA SDK provides a variety of examples exposing the power of parallel
processing. CUDA's N-Body problem, for example, simulates tens of thousands of
particles using the direct method which calculates the force between two
particles by a slight variation of Newton's 2-Body formula
$\vec{F}_{ij} = G \frac{m_i m_j}{|\vec{r}|^2 + \epsilon^2} \bullet \frac{\vec{r}}{|\vec{r}|}$
Such algorithm is unfeasible on commodity CPU's, and most physicist have used
a computationally feasible, albeit undesired, algorithm for the n-body
problem. CUDA changed the landscape, making supercomputing more "democratic."
Other examples in the SDK include `fluidsGL` which simulates fluid dynamics
based on the idea of stable fluids. It too achieves speed that is unattained
on a CPU. On the downside, however, the program is in 2D, and relatively
uninteresting for 3D visualization.
In this project we would like to bring the power of the CUDA chip to the
CAVE/CUBE. There are a few reasons to do that:
1. **Price** --- CUDA enabled chips are reasonably priced and are within the
reach of a consumer. The lower end chips are under `100` dollars, while the
higher end ones go for over `1,000` dollars.
2. **Abundance** --- NVIDIA claims to have sold 40 million cards that are CUDA
enabled.
3. **Power** --- While price and abundance are important to guarantee that this is
not just a hype, the performance of the CUDA chip makes the case for its
importance. A CUDA enabled GPU can easily outperform a CPU for half the price.
The CUDA chip is not reserved just for visualization. Two recent papers
[^gem_aes] [^cuda_aes] implemented the AES cryptography algorithm with
considerable speedup. Other people used the power provided by CUDA for more
sinister purpose [^breaking_md5]. By porting the `md5crack` program to CUDA,
the author was able to crack all md5 passwords less than 8 characters by brute
force in under 16 minutes.
Take, for example, the University of Antwerp in Belgium which recently built a
super computer with four NVIDIA GPUs for under `4000` dollars. The system,
which is contained in one desktop workstation, can outperform a cluster of
hundreds of machines.
[^gem_aes]: Yamanouchi, T.: AES Encryption and Decryption on the GPU. In: GPU Gems 3, Addison-Wesley Professional, Reading (2007)
[^cuda_aes]: S.A. Manavski. CUDA COMPATIBLE GPU AS AN EFFICIENT HARDWARE ACCELERATOR FOR AES CRYPTOGRAPHY.
[^breaking_md5]: Notes: Cuda md5 hashing experiments. http://majuric.org/software/cudamd