Person:

Kim, Hyesoon

Permanent Link

https://hdl.handle.net/1853/71403

Associated Organization(s)

Organizational Unit

School of Computer Science

Full item page

Publication Search Results

Now showing 1 - 10 of 10

HPerf: A Lightweight Profiler for Task Distribution on CPU+GPU Platforms

(Georgia Institute of Technology, 2015) Lee, Joo Hwan ; Nigania, Nimit ; Kim, Hyesoon ; Brett, Bevin

Heterogeneous computing has emerged as one of the major computing platforms in many domains. Although there have been several proposals to aid programming for heterogeneous computing platforms, optimizing applications on heterogeneous computing platforms is not an easy task. Identifying which parallel regions (or tasks) should run on GPUs or CPUs is one of the critical decisions to improve performance. In this paper, we propose a profiler, HPerf, to identify an efficient task distribution on CPUs+GPUs system with low profiling overhead. HPerf is a hierarchical profiler. First it performs lightweight profiling and then if necessary, it performs detailed profiling to measure caching and data transfer cost. Compared to a brute-force approach, HPerf reduces the profiling overhead significantly and compared to a naive decision, HPerf improves the performance of OpenCL applications up to 25%.
Exploration of the energy and thermal behaviors of emerging architectures

(Georgia Institute of Technology, 2014-09-30) Yalamanchili, Sudhakar ; Kim, Hyesoon
Qameleon: Hardware/software cooperative automated tuning for heterogeneous architectures

(Georgia Institute of Technology, 2013-08) Kim, Hyesoon ; Vuduc, Richard

The main goal of this project is to develop a framework that simplifies programming for heterogeneous platforms. The framework consists of (i) a runtime system to generate code that partitions and schedules work among heterogeneous processors, (ii) a general automated tuning mechanism based on machine learning and (iii) performance and power modeling techniques and profiling techniques to aid code generation.
Power Modeling for GPU Architecture Using McPAT

(Georgia Institute of Technology, 2013) Lim, Jieun ; Lakshminarayana, Nagesh B. ; Kim, Hyesoon ; Song, William ; Yalamanchili, Sudhakar ; Sung, Wonyong

Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applications. Since GPUs operate many processing units and manage multiple levels of memory hierarchy, they consume a significant amount of power. Although several power models for CPUs are available, the power consumption of GPUs has not been studied much yet. In this paper, we develop a new power model for GPUs by utilizing McPAT, a CPU power tool. We generate initial power model data from McPAT with a detailed GPU configuration, and then adjust the models by comparing them with empirical data.We use the NVIDIA’s Fermi architecture for building the power model, and our model estimates the GPU power consumption with an average error of 7.7% and 12.8% for the microbenchmarks and Merge benchmarks, respectively.
The AM-Bench: An Android Multimedia Benchmark Suite

(Georgia Institute of Technology, 2012) Lee, Chayong ; Kim, Euna ; Kim, Hyesoon

Despite the significant evolution of mobile devices and the increased use of mobile devices, not many mobile benchmarks have been studied. Even though mobile applications share similar characteristics with traditional desktop oriented applications, different programming environments and user usage patterns present different characteristics. In this paper, we introduce an open source based mobile multimedia benchmark for Android platforms (AM-Bench). The AM-Bench consists of several multimedia benchmarks running on Android platforms. We explain the characteristics of the AM-Bench and compare performance on four Android-based platforms.
Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Architecture

(Georgia Institute of Technology, 2012) Lee, Jaekyu ; Li, Si ; Kim, Hyesoon ; Yalamanchili, Sudhakar

Future chip multiprocessors (CMP) will only grow in core count and diversity in terms of frequency, power consumption, and resource distribution. Incorporating a GPU architecture into CMP, which is more efficient with certain types of applications, is the next stage in this trend. This heterogeneous mix of architectures will use an on-chip interconnection to access shared resources such as last-level cache tiles and memory controllers. The configuration of this on-chip network will likely have a significant impact on resource distribution, fairness, and overall performance. The heterogeneity of this architecture inevitably exerts different pressures on the interconnection due to the differing characteristics and requirements of applications running on CPU and GPU cores. CPU applications are sensitive to latency, while GPGPU applications require massive bandwidth. This is due to the difference in the thread-level parallelism of the two architectures. GPUs use more threads to hide the effect of memory latency but require massive bandwidth to supply those threads. On the other hand, CPU cores typically running only one or two threads concurrently are very sensitive to latency. This study surveys the impact and behavior of the interconnection network when CPU and GPGPU applications run simultaneously. This will shed light on other architectural interconnection studies on CPU-GPU heterogeneous architectures.
Evaluating Scalability of Multi-threaded Applications on a Many-core Platform

(Georgia Institute of Technology, 2012) Gupta, Vishal ; Kim, Hyesoon ; Schwan, Karsten

Multicore processors have been effective in scaling application performance by dividing computation among multiple threads running in parallel. However, application performance does not necessarily improve as more cores are added. Application performance can be limited due to multiple bottlenecks including contention for shared resources such as caches and memory. In this paper, we perform a scalability analysis of parallel applications on a 64-threaded Intel Nehalem-EX based system. We find that applications which scale well on small number of cores, exhibit poor scalability on large number of cores. Using hardware performance counters, we show that many performance limited applications are limited by memory bandwidth on manycore platforms and exhibit improved scalability when provisioned with higher memory bandwidth. By regulating the number of threads used and applying dynamic voltage and frequency scaling for memory bandwidth limited benchmarks, significant energy savings can be achieved.
A New Temperature Distribution Measurement Method on GPU Architectures Using Thermocouples

(Georgia Institute of Technology, 2012) Dasgupta, Aniruddha ; Hong, Sunpyo ; Kim, Hyesoon ; Park, Jinil

In recent years, the many-core architecture has seen a rapid increase in the number of on-chip cores with a much slower increase in die area. This has led to very high power densities in the chip. Hence, in addition to power, temperature has become a first-order design constraint for high-performance architectures. However, measuring temperature is very limited to on-chip temperature sensors, which might not always be available to researchers. In this paper, we propose a new temperature-measurement system using thermocouples for many-core GPU architectures and devise a new method to control GPU scheduling. This system gives us a temperature distribution heatmap of the chip. In addition to monitoring temperature distribution, our system also does run-time power consumption monitoring. The results show that there is a strong corelation between the on-chip heatmap patterns and power consumption. Furthermore, we provide actual experimental results that show the relationship between TPC utilizations and their active locations that reduce temperature and power consumption.
SD³: A Scalable Approach to Dynamic Data-Dependence Profiling

(Georgia Institute of Technology, 2011) Kim, Minjang ; Lakshminarayana, Nagesh B. ; Kim, Hyesoon ; Chi-Keung Luk,

As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important technique to exploit parallelism in programs. More specifically, manual or automatic parallelization can use the outcomes of data-dependence profiling to guide where to parallelize in a program. However, state-of-the-art data-dependence profiling techniques are not scalable as they suffer from two major issues when profiling large and long-running applications: (1) runtime overhead and (2) memory overhead. Existing data-dependence profilers are either unable to profile large-scale applications or only report very limited information. In this paper, we propose a scalable approach to data-dependence profiling that addresses both runtime and memory overhead in a single framework. Our technique, called SD³, reduces the runtime overhead by parallelizing the dependence profiling step itself. To reduce the memory overhead, we compress memory accesses that exhibit stride patterns and compute data dependences directly in a compressed format. We demonstrate that SD³ reduces the runtime overhead when profiling SPEC 2006 by a factor of 4.1⨯ and 9.7⨯ on eight cores and 32 cores, respectively. For the memory overhead, we successfully profile SPEC 2006 with the reference input, while the previous approaches fail even with the train input. In some cases, we observe more than a 20⨯ improvement in memory consumption and a 16⨯ speedup in profiling time when 32 cores are used.
An analytical GPU performance model and a dynamic compilation system for CPU/GPU systems

(Georgia Institute of Technology, 2009-09-11) Kim, Hyesoon