Runtime Approaches to Improve the Efficiency of Hybrid and Irregular Applications

Thumbnail Image
Bak, Seonmyeong
Sarkar, Vivek
Associated Organization(s)
Organizational Unit
Organizational Unit
Supplementary to
On-node parallelism has increased significantly in high-performance computing systems. This huge amount of parallelism can be used to speed up regular paral- lel applications relatively easily because straightforward approaches usually suffice to map their computation patterns and data layouts on to available on-node parallelism. However, irregular parallel applications require considerable effort to run on the mod- ern processors with massive amounts of intra-node parallelism. Parallel programming models and runtime approaches have been proposed to help programmers to write those applications quickly, but it’s still not easy to write efficient irregular paral- lel applications. Two key challenges in mapping irregular applications onto on-node parallelism are load balance and computation-communication overlap. In this thesis proposal, we address these challenges through new runtime approaches and new APIs that enable users to provide minimal information for application-aware scheduling. First, we introduce new algorithms to improve the scheduling of irregular task graphs containing a mix of communication and computation tasks with data-parallelism and blocking operations. We combine gang-scheduling with work-stealing for data- parallel tasks with frequent inter/intra-node communication in the task graphs so as to reduce interference and expensive context switching operations. We also propose improved victim selection policies for work-stealing to improve the load balance and overlap of ready tasks that have child tasks. Next, we propose an efficient integrated runtime system to handle load balancing of irregular applications written in hybrid parallel programming models. We introduce a unified runtime system that integrates distributed and shared-memory programming, as exemplified by the combination of Charm++ and OpenMP. In this approach, all processing resources (cores) can be used flexibly across both the distributed and shared-memory levels, thereby enabling more efficient load balancing at the intra-node level and reduced waiting times for global synchronization at the inter-node level. Finally, we propose a set of APIs that enable users to specify functions used to decompose a target loop into subspaces and to create chunks within each subspace for application-specific load balancing. Our runtime leverages the information provided in the APIs to create user-defined chunks and store balanced groups of chunks in a shared data structure indexed by static loop constructs. In this way, the stored information from one invocation of a loop can be reused in following invocations for an improved initial load balance.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI