Series
CERCS Technical Report Series

Series Type
Publication Series
Description
Associated Organization(s)
Associated Organization(s)

Publication Search Results

Now showing 1 - 7 of 7
Thumbnail Image
Item

Energy Introspector: Coordinated Architecture-Level Simulation of Processor Physics

2013 , Song, William J. , Mukhopadhyay, Saibal , Rodrigues, Arun , Yalamanchili, Sudhakar

Increased power and heat dissipation in microprocessors impose limitations on performance scaling. Power and thermal management techniques coupled with workload dynamics cause increasing spatiotemporal variations in electrical and thermal stresses. The coupling between various physical phenomena (e.g., power, temperature, reliability, delay) will be critical to microarchitectural operations in future processors. Thus, we need modeling tools to enable the exploration of such physical interactions and drive development of microarchitectural solutions. This paper introduces a novel framework, Energy Introspector (EI), for the coordinated simulation of microarchitecture and physics models. The EI framework features flexible modeling of processor component hierarchy that enables simulating different microarchitecture and package designs. The proposed framework uses standardized interface to drive different implementations of physics models and captures their interactions. The EI supports parallel computation of models in anticipation of large-scale simulations (e.g., high core-count processors). We present a case study using the EI framework to assess reliability and performance tradeoffs with a full-system cycle-level simulation of an asymmetric chip multiprocessor (ACMP).

Thumbnail Image
Item

Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Architecture

2012 , Lee, Jaekyu , Li, Si , Kim, Hyesoon , Yalamanchili, Sudhakar

Future chip multiprocessors (CMP) will only grow in core count and diversity in terms of frequency, power consumption, and resource distribution. Incorporating a GPU architecture into CMP, which is more efficient with certain types of applications, is the next stage in this trend. This heterogeneous mix of architectures will use an on-chip interconnection to access shared resources such as last-level cache tiles and memory controllers. The configuration of this on-chip network will likely have a significant impact on resource distribution, fairness, and overall performance. The heterogeneity of this architecture inevitably exerts different pressures on the interconnection due to the differing characteristics and requirements of applications running on CPU and GPU cores. CPU applications are sensitive to latency, while GPGPU applications require massive bandwidth. This is due to the difference in the thread-level parallelism of the two architectures. GPUs use more threads to hide the effect of memory latency but require massive bandwidth to supply those threads. On the other hand, CPU cores typically running only one or two threads concurrently are very sensitive to latency. This study surveys the impact and behavior of the interconnection network when CPU and GPGPU applications run simultaneously. This will shed light on other architectural interconnection studies on CPU-GPU heterogeneous architectures.

Thumbnail Image
Item

Execution Environment Support for Many Core Heterogeneous Accelerator Platforms

2010 , Gupta, Vishakha , Yalamanchili, Sudhakar , Duato, José

We are seeing the advent of large scale, heterogeneous systems comprised of homogeneous general purpose cores intermingled with customized heterogeneous cores and interconnected to diverse memory hierarchies. The presence of accelerators requires support for new programming abstractions and run-time environments that can efficiently harvest platform resources comprised of general purpose and specialized processing cores, their diverse memory units and memory management support, and communication links that connect them. This paper describes an execution model and systems infrastructure for modeling and supporting multiaccelerator architectures in general and experiences with an implementation for interconnected network of Cell Broadband engine processors in particular. The primary contributions of this paper are i) a pooled accelerator execution model for orchestrating computations on and data movements across multiple accelerators, ii) an API for implementing the model effectively and iii) a distributed simulation environment for modeling multiple, communicating Cell/B.E. processors.

Thumbnail Image
Item

Power Modeling for GPU Architecture Using McPAT

2013 , Lim, Jieun , Lakshminarayana, Nagesh B. , Kim, Hyesoon , Song, William , Yalamanchili, Sudhakar , Sung, Wonyong

Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applications. Since GPUs operate many processing units and manage multiple levels of memory hierarchy, they consume a significant amount of power. Although several power models for CPUs are available, the power consumption of GPUs has not been studied much yet. In this paper, we develop a new power model for GPUs by utilizing McPAT, a CPU power tool. We generate initial power model data from McPAT with a detailed GPU configuration, and then adjust the models by comparing them with empirical data.We use the NVIDIA’s Fermi architecture for building the power model, and our model estimates the GPU power consumption with an average error of 7.7% and 12.8% for the microbenchmarks and Merge benchmarks, respectively.

Thumbnail Image
Item

System Impact of 3D Processor-Memory Interconnect: A Limit Study

2011 , Rasquinha, Mitchelle , Hassan, Syed Minhaj , Song, William , Chae, Kwanyeob , Cho, Minki , Mukhopadhyay, Saibal , Yalamanchili, Sudhakar

3D integration with through-silicon-vias (TSVs) can provide enormous bandwidth between processor die and memory die. The central goal of our work is to explore the limits of performance improvement that can be achieved with such integration. Towards this end we propose a model of the impact of 3D TSVs on system performance. The model leads to several key observations i) increased miss tolerance (smaller caches) and hence improved core scaling for a fixed die size, ii) higher sustained IPC per core, iii) significantly smaller, energy efficient DRAM banks, iv) redistribution of system power to the cores and on -die interconnect, and v) TSV utilization is a function of the relationship between reference locality and the bandwidth properties of the intradie network. These observations are repeated in cycle level simulations of a 64 tile architecture.

Thumbnail Image
Item

Centralized Buffer Router with Elastic Links and Bubble Flow Control

2013 , Hassan, Syed Minhaj , Yalamanchili, Sudhakar

While router buffers have been used as performance multipliers, they are also major consumers of area and power in on-chip networks. In this paper, we propose centralized elastic bubble router - a router micro-architecture based on the use of centralized buffers (CB) with elastic buffered (EB) links. At low loads, the CB is power gated, bypassed, and optimized to produce single cycle operation. A novel extension to bubble flow control enables routing deadlock and message dependent deadlock to be avoided with the same mechanism having constant buffer size per router independent of the number of message types. This solution enables end-to-end latency reduction via high radix switches with low overall buffer requirements. Comparisons made with other low latency routers across different topologies show consistent performance improvement, for example 26% improvement in no load latency of a 2D Mesh and 4X improvement in saturation throughput in a 2D-Generalized Hypercube.

Thumbnail Image
Item

A Power Capping Controller for Multicore Processors

2011 , Almoosa, Nawaf , Song, William , Wardi, Yorai Y. , Yalamanchili, Sudhakar

This paper presents an online controller for tracking power-budgets in multicore processors using dynamic voltage-frequency scaling. The proposed control law comprises an integral controller whose gain is adjusted online based on the derivative of the power-frequency relationship. The control law is designed to achieve rapid settling time, and its tracking property is formally proven. Importantly, the controller design does not require off-line analysis of application workloads making it feasible for emerging heterogeneous and asymmetric multicore processors. Simulation results are presented for controlling power dissipation in multiple cores of an asymmetric multicore processor. Each core is i) equipped with the controller, ii) assigned a power budget, and iii) operates independently in tracking to its power budget. We use a cycle-level multicore simulator driven by traces from SPEC2006 benchmarks demonstrating that the proposed algorithm achieves a faster settling time than examples of a static setting of the controller gain.