% barREUcuda % Abdulmajed Dakkak % June 18th, 2008 The CUDA programming language offers new opportunities to a field regarded by many to be fairly difficult --- parallel programming. Through a simple framework that comparable to Google's recent (over hyped) **map reduce** frame work, we can identify certain components in a serial program and parallelize repetitive function calls. The concept behind CUDA is fairly simple. You break your algorithm into small pieces and hand each thread a little piece of computation. And, since most algorithms have a while loop embedded in them, the code inside the while loop can be the code handed to each thread. If that loop is repetitive, i.e. there is little difference between them, then we can gain a performance boost, since the subroutines are called all in one clock cycle (or warp). CUDA accomplishes this by extending (only a bit) C. They add declarations that define whether a function runs on the device (the GPU) or the host (the CPU). They also add new types of memory which need to be studied more carefully. Take, for example, quicksort, which is defined by the following python/functional code. def quicksort(m): pivot = m[0] p = filter(lambda x: x <= pivot, m) p = filter(lambda x: x > pivot, m) return quicksort(p) + [pivot] + quicksort(q) The above implementation does not sort in place, thus requiring $O(n)$ size. The following is a C implementation that sorts in place. void qsort(float* m, int start, int end) { int i; int j = start; float pivot = m[start]; for(i = start+1; i < end; i++) { if(m[i] < pivot) m[j++] = m[i] } m[j] = pivot; qsort(m, start, j-1); qsort(m, j+1, end); } Recursion in an algorithm is a big hint that the algorithm can be parallelized --- even though CUDA's function cannot recursed. Most algorithms can be easily parallelized. This is specifically the case for programs that simulate a large quantity of relatively independent objects. It so happens that most physics can be reduced to simulating many particles' interactions with the environment, and many abstractions exist to simulate this interaction. Take 1D rule based cellular automata, for example. In a serial program on would keep track of the previous row and would write the following C function to compute the next row while(row_count < MAXROWS) { for(i = 0; i < ROWLENGTH; i++) { if(i == 0) { row[i] = rule(0, old_row[i], old_row[i+1]) } else if(i == ROWLENGTH-1) { row[i] = rule(old_row[i-1], old_row[i], 0) } else { row[i] = rule(old_row[i-1], old_row[i], old_row[i+1]) } } tmp = old_row old_row = row row = tmp } A CUDA version of the previous program might look like the following __device__ void rule(int left, int top, int right) { // insert automata rule here ... } __global__ void kernel(int* old_row, int* row) { unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; if(i == 0) { row[i] = rule(0, old_row[i], old_row[i+1]) } else if(i == ROWLENGTH-1) { row[i] = rule(old_row[i-1], old_row[i], 0) } else { row[i] = rule(old_row[i-1], old_row[i], old_row[i+1]) } } void run( ) { dim3 block(16, 16); dim3 grid(ROWLENGTH / block.x, ROWLENGTH / block.y); while(row_count < MAXROWS) { kernel<<< grid, block>>>(old_row, row); ... display(row); } ... } The above code will easily achieve more than 20x speedup, but with some optimization, it can run much much faster. A similar program can be sketched for 2D cellular automata such as the Game of Life, and a 3D cellular automata could also be implemented (when I find an example of 3D cellular automata). The CUDA SDK provides a variety of examples exposing the power of parallel processing. CUDA's N-Body problem, for example, simulates tens of thousands of particles using the direct method which calculates the force between two particles by a slight variation of Newton's 2-Body formula $\vec{F}_{ij} = G \frac{m_i m_j}{|\vec{r}|^2 + \epsilon^2} \bullet \frac{\vec{r}}{|\vec{r}|}$ Such algorithm is unfeasible on commodity CPU's, and most physicist have used a computationally feasible, albeit undesired, algorithm for the n-body problem. CUDA changed the landscape, making supercomputing more "democratic." Other examples in the SDK include fluidsGL which simulates fluid dynamics based on the idea of stable fluids. It too achieves speed that is unattained on a CPU. On the downside, however, the program is in 2D, and relatively uninteresting for 3D visualization. In this project we would like to bring the power of the CUDA chip to the CAVE/CUBE. There are a few reasons to do that: 1. **Price** --- CUDA enabled chips are reasonably priced and are within the reach of a consumer. The lower end chips are under 100 dollars, while the higher end ones go for over 1,000 dollars. 2. **Abundance** --- NVIDIA claims to have sold 40 million cards that are CUDA enabled. 3. **Power** --- While price and abundance are important to guarantee that this is not just a hype, the performance of the CUDA chip makes the case for its importance. A CUDA enabled GPU can easily outperform a CPU for half the price. The CUDA chip is not reserved just for visualization. Two recent papers [^gem_aes] [^cuda_aes] implemented the AES cryptography algorithm with considerable speedup. Other people used the power provided by CUDA for more sinister purpose [^breaking_md5]. By porting the md5crack program to CUDA, the author was able to crack all md5 passwords less than 8 characters by brute force in under 16 minutes. Take, for example, the University of Antwerp in Belgium which recently built a super computer with four NVIDIA GPUs for under 4000 dollars. The system, which is contained in one desktop workstation, can outperform a cluster of hundreds of machines. [^gem_aes]: Yamanouchi, T.: AES Encryption and Decryption on the GPU. In: GPU Gems 3, Addison-Wesley Professional, Reading (2007) [^cuda_aes]: S.A. Manavski. CUDA COMPATIBLE GPU AS AN EFFICIENT HARDWARE ACCELERATOR FOR AES CRYPTOGRAPHY. [^breaking_md5]: Notes: Cuda md5 hashing experiments. http://majuric.org/software/cudamd