Compiler analysis and optimization of memory management In modern processors

Thumbnail Image
Barua, Prithayan A.
Sarkar, Vivek
Associated Organization(s)
Organizational Unit
Organizational Unit
Supplementary to
Modern processors, such as CPUs and GPUs have continued to deliver higher performance with increasing microarchitecture innovations in recent years. However, it is non-trivial for application developers to achieve anywhere close to peak performance on these platforms. Memory is the most common system performance bottleneck because main memory performance has not kept up with the pace of improvement in processor performance. Caches generally serve as a first level of the memory hierarchy to hide the longer main memory access latency, which can be up to 100x slower than caches. But on-chip caches can be expensive, take up a lot of silicon area, and consume significant amounts of processor power. Hence, efficient utilization of data in the cache memory is always one of the most crucial performance optimizations for application developers. This is the first problem we address in our thesis. We develop a static compiler analysis to model the data cache usage of a given loop nest. Then we use the cache analysis in a cost model to guide the unroll-jam loop transformation with the objective of maximizing reuse of the data in cache. This compiler optimization can help developers improve the performance of their applications on modern CPUs. We consider main memory as the next level in the memory hierarchy and observe that the improvement in main memory access latency is <2x, compared to >100x improvement in bandwidth, over the last two decades. Modern GPUs use high bandwidth memory and exploit it as a distinguishing feature to sustain thousands of concurrent threads and hide the long memory access latency. It can be non-trivial for application developers to utilize this high memory bandwidth efficiently. This thesis proposes a static analytical model for the GPU memory bandwidth utilization of a kernel. We then use the analysis to introduce a new cost model to guide a thread coarsening transformation to improve bandwidth utilization. In the memory hierarchy for accelerator devices like GPUs, the Host (CPU) memory is the next level in the organization. Any data required by a kernel executing on the GPU must be copied from the CPU main memory to the device memory. This memory copy is one of the slowest operations and can dominate many GPU applications' execution time. Since device memory usually persists across multiple kernel instances, there is an opportunity to reuse the device memory data across multiple kernel executions. As a final part of our thesis, we develop an intermediate representation to model host-device memory copy operations. Then we use it first to design an analysis to detect incorrect usage of host-device memory copy in OpenMP applications. Then we develop an optimization to remove redundant host-device memory copies. Our compiler tools can improve developer productivity and deliver high performance by automatically managing the GPU memory hierarchy.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI