The Design and Implementation Ocelot’s Dynamic Binary Translator from PTX to Multi-Core x86
Author(s)
Diamos, Gregory
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Ocelot is a dynamic compilation framework designed
to map the explicitly parallel PTX execution model
used by NVIDIA CUDA applications onto diverse many-core
architectures. Ocelot includes a dynamic binary translator from
PTX to many-core processors that leverages the LLVM code
generator to target x86. The binary translator is able to execute
CUDA applications without recompilation and Ocelot can in
fact dynamically switch between execution on an NVIDIA GPU
and a many-core CPU. It has been validated against over 100
applications taken from the CUDA SDK [1], the UIUC Parboil
benchmarks [2], the Virginia Rodinia benchmarks [3], the GPUVSIPL
signal and image processing library [4], and several
domain specific applications.
This paper presents a detailed description of the implementation
of our binary translator highlighting design decisions
and trade-offs, and showcasing their effect on application performance.
We explore several code transformations that are
applicable only when translating explicitly parallel applications
and suggest additional optimization passes that may be useful
to this class of applications. We expect this study to inform the
design of compilation tools for explicitly parallel programming
models (such as OpenCL) as well as future CPU and GPU
architectures.
Sponsor
Date
2009
Extent
Resource Type
Text
Resource Subtype
Technical Report