Series
CERCS Technical Report Series

Series Type
Publication Series
Description
Associated Organization(s)
Associated Organization(s)

Publication Search Results

Now showing 1 - 5 of 5
  • Item
    POD: A Parallel-On-Die Architecture
    (Georgia Institute of Technology, 2007-05) Woo, Dong Hyuk ; Fryman, Joshua Bruce ; Knies, Allan D. ; Eng, Marsha ; Lee, Hsien-Hsin Sean
    As power constraints, complexity and design verification cost make it difficult to improve single-stream performance, parallel computing paradigm is taking a place amongst mainstream high-volume architectures. Most current commercial designs focus on MIMD-style CMPs built with rather complex single cores. While such designs provide a degree of generality, they may not be the most efficient way to build processors for applications with inherently scalable parallelism. These designs have been proven to work well for certain classes of applications such as transaction processing, but they have driven the development of new languages and complex architectural features. Instead of building MIMD-CMPs for all workloads, we propose an alternative parallel on-die many-core architecture called POD based on a large SIMD PE array. POD helps to address the key challenges of on-chip communication bandwidth, area limitations, and energy consumed by routers by factoring out features necessary for MIMD machines and focusing on architectures that match many scalable workloads. In this paper, we evaluate and quantify the advantages of the POD architecture based its ISA on a commercially relevant CISC architecture and show that it can be as efficient as more specialized array processors based on one-off ISAs. Our single-chip POD is capable of best-in-class scalar performance up to 1.5 TFLOPS of single-precision floating-point arithmetic. Our experimental results show that in some application domains, our architecture can achieve nearly linear speedup on a large number of SIMD PEs, and this speedup is much bigger than the maximum speedup that MIMD-CMPs on the same die size can achieve. Furthermore, owing to synchronized computation and communication, it shows that POD can efficiently suppress energy consumption on the novel communication method in our interconnection network.
  • Item
    SoftCache: Dynamic Optimizations for Power and Area Reduction in Embedded Systems (II)
    (Georgia Institute of Technology, 2005) Fryman, Joshua Bruce ; Lee, Hsien-Hsin Sean ; Huneycutt, Chad Marcus
    We propose a SoftCache for low-power and reduced die area while providing application flexibility. Our implementations demonstrate that the network is a power efficient means for accessing remote memory. The impact of this work suggests that SoftCache systems may be useful in future consumer electronics. Our results show that die power is reduced by 20%, die area is reduced by 10%, and trans- ferring applications over the network is more energy-delay effective than local DRAM.
  • Item
    Intelligent Cache Management by Exploiting Dynamic UTI/MTI Behavior
    (Georgia Institute of Technology, 2005) Fryman, Joshua Bruce ; Huneycutt, Chad Marcus ; Snyder, Luke Aron ; Loh, Gabriel H. ; Lee, Hsien-Hsin Sean
    This work addresses the problem of the increasing performance disparity between the microprocessor and memory subsystem. Current L1 caches fabricated in deep submicron processes must either shrink to maintain timing, or suffer higher latencies, exacerbating the problem. We introduce a new classification for the behavior of memory traffic, which we refer to as target behavior. Classification of the target behavior falls into two categories: Uni-Targeted Instructions (UTI) and Multi-Targeted Instructions (MTI). On average, 30% of all dynamic memory LD/ST operations come from execution of UTIs, yet only a few hundred static instructions are actually UTIs. This makes isolation of the UTI targets an avenue for optimization. The addition of a small, fast cache structure which contains only UTI data would ideally reduce MTI pollution of UTI information. By intelligently selecting between larger, slower data caches and our UTI cache, we reduce the latency problem while increasing performance. Our distinct contributions fall in three areas, with implications to many others: (1) we present a new characterization of memory traffic based on the number of targets from LD/ST instructions; (2) we explore the underlying nature of the target division and devise a simple mechanism for exploiting regularity based on a UTI cache; (3) we explore a variety of prediction mechanisms and processor configuration options to determine sensitivity and the performance gains actually attainable under different modern processor configurations. We attain up to 42% IPC improvements on SPEC2000, with a mean improvement of 8%. Our solution also reduces L2 accesses by up to 89% (average 29%), while reducing load-load violation traps by up to 84% (average 13%), and store-load violation traps by up to 43% (average 8%).
  • Item
    SoftCache: A Technique for Power and Area Reduction in Embedded Systems
    (Georgia Institute of Technology, 2003) Fryman, Joshua Bruce ; Lee, Hsien-Hsin Sean ; Huneycutt, Chad Marcus ; Farooqui, Naila F. ; Mackenzie, Kenneth M. ; Schimmel, D. E. (David E.)
    Explicitly software managed cache systems are postulated as a solution for power considerations in computing devices. The savings expected in a SoftCache lies in the removal of tag storage, associativity logic, comparators, and other hardware dedicated to memory hierarchies. The penalty lies in high cache-miss cost and additional instructions required to effect a cache model. In this paper, we characterize SoftCaches by placing them in the overall computing landscape, analyzing the energy and space trade-offs. We present results that indicate a SoftCache saves power and space over hardware caches. Based on the TSMC 0.25um process from MOSIS, we use schematic and layout representations of hardware and SoftCache models for comparison. Accounting for additional instructions executed and simplification of logic, we examine high SoftCache miss cost in relation to the overall system. For a 256KB "mode" change every 1.45 hours, the SoftCache exhibits 1% application slowdown for energy savings of 30% or more in a low-power device such as the SA-110 microprocessor used in PocketPC platforms.
  • Item
    Energy Efficient Network Memory for Ubiquitous Devices
    (Georgia Institute of Technology, 2003) Fryman, Joshua Bruce ; Huneycutt, Chad Marcus ; Lee, Hsien-Hsin Sean ; Mackenzie, Kenneth M. ; Schimmel, D. E. (David E.)
    This paper explores the energy and delay issues that occur when some or all of the local storage is moved out of the embedded device, and into a remote network server. We demonstrate using the network to access remote storage in lieu of local DRAM results in significant power savings. Mobile applications continually demand additional memory, with traditional designs increasing DRAM to address this problem. Modern devices also incorporate low-power network links to support connected ubiquitous environments. Engineers then attempt to minimize utilization of the network due to its perceived large power consumption. This perception is misleading. For 1KB application "pages", network memory is more power efficient than one 2MB DRAM part when the mean time between page transfers exceeds 0.69s. During each transfer the application delay to the user is only 16ms.