Title:
Tuned and asynchronous stencil kernels for CPU/GPU systems

Thumbnail Image
Author(s)
Venkatasubramanian, Sundaresan
Authors
Advisor(s)
Vuduc, Richard
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Supplementary to
Abstract
We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi's iterative method for the 2-D Poisson equation on a structured grid, in both single- and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060. Motivated to find a still faster implementation, we further consider "wildly asynchronous" implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on the principle of a chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations, thereby potentially trading off more flops (via more iterations to converge) for a higher degree of asynchronous parallelism. Our relaxed-synchronization implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly "fast-and-loose" algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs.
Sponsor
Date Issued
2009-05-18
Extent
Resource Type
Text
Resource Subtype
Thesis
Rights Statement
Rights URI