Title:
Acceleration and optimization of dynamic parallelism for irregular applications on GPUs

dc.contributor.advisor Yalamanchili, Sudhakar
dc.contributor.author Wang, Jin
dc.contributor.committeeMember Kim, Hyesoon
dc.contributor.committeeMember Vuduc, Richard
dc.contributor.committeeMember Krishna, Tushar
dc.contributor.committeeMember Pande, Santosh
dc.contributor.department Electrical and Computer Engineering
dc.date.accessioned 2017-01-11T14:03:48Z
dc.date.available 2017-01-11T14:03:48Z
dc.date.created 2016-12
dc.date.issued 2016-11-15
dc.date.submitted December 2016
dc.date.updated 2017-01-11T14:03:48Z
dc.description.abstract The objective of this thesis is the development, implementation and optimization of a GPU execution model extension that efficiently supports time-varying, nested, fine-grained dynamic parallelism occurring in the irregular data intensive applications. These dynamically formed pockets of structured parallelism can utilize the recently introduced device-side nested kernel launch capabilities on GPUs. However, the low utilization of GPU resources and the high cost of the device kernel launch make it still difficult to harness dynamic parallelism on GPUs. This thesis then presents an extension to the common Bulk Synchronous Parallel (BSP) GPU execution model -- Dynamic Thread Block Launch (DTBL), which provides the capability of spawning light-weight thread blocks from GPU threads on demand and coalescing them to existing native executing kernels. The finer granularity of a thread block provides effective and efficient control of smaller-scale, dynamically occurring nested pockets of structured parallelism during the computation. Evaluations of DTBL show an average of 1.21x speedup over the baseline implementations. The thesis proposes two classes of optimizations of this model. The first is a thread block scheduling strategy that exploits spatial and temporal reference locality between parent kernels and dynamically launched child kernels. The locality-aware thread block scheduler is able to achieve another 27% increase in the overall performance. The second is an energy efficiency optimization which utilizes the SMX occupancy bubbles during the execution of a DTBL application and converts them to SMX idle period where a flexible DVFS technique can be applied to reduce the dynamic and leakage power to achieve better energy efficiency. By presenting the implementations, measurements and key insights, this thesis takes a step in addressing the challenges and issues in emerging irregular applications.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/56294
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject General-purpose GPU
dc.subject Dynamic parallelism
dc.subject Irregular applications
dc.title Acceleration and optimization of dynamic parallelism for irregular applications on GPUs
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.corporatename School of Electrical and Computer Engineering
local.contributor.corporatename College of Engineering
relation.isOrgUnitOfPublication 5b7adef2-447c-4270-b9fc-846bd76f80f2
relation.isOrgUnitOfPublication 7c022d60-21d5-497c-b552-95e489a06569
thesis.degree.level Doctoral
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
WANG-DISSERTATION-2016.pdf
Size:
4.75 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.86 KB
Format:
Plain Text
Description: