CERCS Technical Report Series

Series

CERCS Technical Report Series

Permanent Link

https://hdl.handle.net/1853/70997

Series Type

Publication Series

Associated Organization(s)

Organizational Unit

Center for Experimental Research in Computer Systems

Full item page

Publication Search Results

Now showing 1 - 10 of 17

Security Refresh: Prevent Malicious Wear-out and Increase Durability for Phase-Change Memory with Dynamically Randomized Address Mapping

(Georgia Institute of Technology, 2009-11) Seong, Nak Hee ; Woo, Dong Hyuk ; Lee, Hsien-Hsin Sean

Phase-change Random Access Memory (PRAM) is an emerging memory technology for future computing systems. It is nonvolatile and has a faster read latency and potentially higher storage density than other memory alternatives. Recently, system researchers have studied the trade-off of using PRAM to back up a DRAM cache as a last level memory or to implement it in a hybrid memory architecture. The main roadblock preventing PRAM from commercially viable, however, is its much lower write endurance. Several recent proposals attempted to address this issue by either reducing PRAM's write frequency or using wearleveling techniques to evenly distribute PRAM writes. Although the lifetime of PRAM could be extended by these techniques under normal operations of typical applications, most of them do not prevent a malicious code deliberately designed to wear it out. Furthermore, all of these prior techniques failed to consider the circumstances when a compromised OS is present and its security implication to the overall PRAM design. A compromised OS, (e.g., via simple buffer over ow) will allow adversaries to manipulate all processes and exploit side channels easily, accelerating the wear-out of targeted PRAM blocks and rendering a dysfunctional system. In this paper, we argue that a PRAM design not only has to consider normal wear-out under conventional application behavior, most importantly, it must take the worst-case scenario into account with the presence of malicious exploits and a compromised OS. Such design consideration will address both the durability and security issues of PRAM simultaneously. Toward this goal, in this work, we propose a novel, low-cost hardware mechanism called Security Refresh. Similar to the concept of protecting charge leak from DRAM, Security Refresh prevents information leak by constantly migrating its physical location (thus refresh) inside PRAM, obfuscating the actual data placement from users and system software. It uses a dynamic randomized address mapping scheme, which swaps data between random PRAM blocks using random keys generated by thermal noise upon each refresh due. The hardware is extremely low-cost without using any table. We presented two implementation alternatives and showed their tradeoff and respective wear-out endurance. For a given con guration, we show that the optimal lifetime of a PRAM block (256B) is 8 years. In addition, we showed the performance impact of Security Refresh is mostly negligible.
Chameleon: Virtualizing Idle Acceleration Cores of A Heterogeneous Multi-Core Processor for Caching and Prefetching

(Georgia Institute of Technology, 2008) Woo, Dong Hyuk ; Fryman, Joshua B. ; Knies, Allan D. ; Lee, Hsien-Hsin Sean

Heterogeneous multi-core processors have emerged as an energy- and area-efficient architectural solution to improving performance for domain-specific applications such as those with a plethora of data-level parallelism. These processors typically contain a large number of small, compute-centric cores for acceleration while keeping one or two high-performance ILP cores on the die to guarantee single-thread performance. Although a major portion of the transistors are occupied by the acceleration cores, these resources will sit idle when running unparallelized legacy codes or the sequential parts of an application. To address this under-utilization issue, in this paper, we introduce Chameleon, a flexible heterogeneous multi-core architecture to virtualize these resources for enhancing memory performance when running sequential programs. The Chameleon architecture can dynamically virtualize the idle acceleration cores into a last-level cache, a data prefetcher, or a hybrid between these two techniques. In addition, Chameleon can operate in an adaptive mode which dynamically configures the acceleration cores between the hybrid mode and the prefetch-only mode by monitoring the effectiveness of Chameleon caching scheme. In our evaluation using SPEC2006 benchmark suite, different levels of performance improvements were achieved in different modes for different applications. In the case of the adaptive mode, Chameleon improves the performance of SPECint06 and SPECfp06 by 33% and 22% on average. When considering only memory-intensive applications, Chameleon improves the system performance by 53% and 33%.
POD: A Parallel-On-Die Architecture

(Georgia Institute of Technology, 2007-05) Woo, Dong Hyuk ; Fryman, Joshua Bruce ; Knies, Allan D. ; Eng, Marsha ; Lee, Hsien-Hsin Sean

As power constraints, complexity and design verification cost make it difficult to improve single-stream performance, parallel computing paradigm is taking a place amongst mainstream high-volume architectures. Most current commercial designs focus on MIMD-style CMPs built with rather complex single cores. While such designs provide a degree of generality, they may not be the most efficient way to build processors for applications with inherently scalable parallelism. These designs have been proven to work well for certain classes of applications such as transaction processing, but they have driven the development of new languages and complex architectural features. Instead of building MIMD-CMPs for all workloads, we propose an alternative parallel on-die many-core architecture called POD based on a large SIMD PE array. POD helps to address the key challenges of on-chip communication bandwidth, area limitations, and energy consumed by routers by factoring out features necessary for MIMD machines and focusing on architectures that match many scalable workloads. In this paper, we evaluate and quantify the advantages of the POD architecture based its ISA on a commercially relevant CISC architecture and show that it can be as efficient as more specialized array processors based on one-off ISAs. Our single-chip POD is capable of best-in-class scalar performance up to 1.5 TFLOPS of single-precision floating-point arithmetic. Our experimental results show that in some application domains, our architecture can achieve nearly linear speedup on a large number of SIMD PEs, and this speedup is much bigger than the maximum speedup that MIMD-CMPs on the same die size can achieve. Furthermore, owing to synchronized computation and communication, it shows that POD can efficiently suppress energy consumption on the novel communication method in our interconnection network.
Adaptive Transaction Scheduling for Transactional Memory Systems

(Georgia Institute of Technology, 2007) Yoo, Richard M. ; Lee, Hsien-Hsin Sean

Transactional memory systems are expected to enable parallel programming at lower programming complexity, while delivering improved performance over traditional lock-based systems. Nonetheless, we observed that there are situations where transactional memory systems could actually perform worse, and that these situations will actually become dominant in future workloads as more and larger-scale trans- actional memory systems are available. Transactional memory systems can excel locks only when the executing workloads contain sufficient parallelism. When the workload lacks the inherent parallelism, blindly launching excessive transactions can adversely result in performance degradation. To quantita- tively demonstrate the issues, we introduce the concept of effective transactions in this paper. We show that the effectiveness of a transaction is closely related to a dynamic quantity we call contention inten- sity. By limiting the contention intensity below the desired level, we can significantly increase transaction effectiveness. Increased effectiveness directly increases the overall performance of a transactional memory system. Based on our study, we implemented a transaction scheduler which not only guarantees that hard- ware transactional memory systems perform better than locks, but also significantly improves performance for both the hardware and software transactional memory systems.
A Floorplan-Aware Dynamic Inductive Noise Controller for Reliable 2D and 3D Microprocessors

(Georgia Institute of Technology, 2006) Mohamood, Fayez ; Healy, Michael ; Lim, Sung Kyu ; Lee, Hsien-Hsin Sean

Power delivery is a growing reliability concern in microprocessors as the industry moves toward feature-rich, power-hungrier designs. To battle the ever-aggravating power consumption, modern microprocessor designers or researchers propose and apply aggressive power-saving techniques in the form of clock-gating and/or power-gating in order to operate the processor within a given power envelope. However, these techniques often lead to high-frequency current variations, which can stress the power delivery system and jeopardize reliability due to inductive noise (L di/dt) in the power supply network. In addition, with the advent of 3D stacked IC technology that facilitates the design of processors with much higher module density, the design of a low impedance power-delivery network can be a daunting challenge. To counteract these issues, modern microprocessors are designed to operate under the worst-case current assumption by deploying adequate decoupling capacitance. With the lowering of supply voltages and increased leakage power and current consumption, designing a processor for the worst case is becoming less appealing. In this paper, we propose a new dynamic inductive-noise controlling mechanism at the microarchitectural level that will limit the on-die current demand within predefined bounds, regardless of the native power and current characteristics of running applications. By dynamically monitoring the access patterns of microarchitectural modules, our mechanism can effectively limit simultaneous switching activity of close-by modules, thereby leveling voltage ringing at local power-pins. Compared to prior art, our di/dt controller is the first that takes the processor's floorplan as well as its power-pin distribution into account to provide a finer-grained control with minimal performance degradation. Based on the evaluation results using 2D and 3D floorplans, we show that our techniques can significantly improve inductive noise induced by current demand variation and reduce the average current variability by up to 7 times with an average performance overhead of 4.0% (2D floorplan) and 3.8% (3D floorplan).
SoftCache: Dynamic Optimizations for Power and Area Reduction in Embedded Systems (II)

(Georgia Institute of Technology, 2005) Fryman, Joshua Bruce ; Lee, Hsien-Hsin Sean ; Huneycutt, Chad Marcus

We propose a SoftCache for low-power and reduced die area while providing application flexibility. Our implementations demonstrate that the network is a power efficient means for accessing remote memory. The impact of this work suggests that SoftCache systems may be useful in future consumer electronics. Our results show that die power is reduced by 20%, die area is reduced by 10%, and trans- ferring applications over the network is more energy-delay effective than local DRAM.
Intelligent Cache Management by Exploiting Dynamic UTI/MTI Behavior

(Georgia Institute of Technology, 2005) Fryman, Joshua Bruce ; Huneycutt, Chad Marcus ; Snyder, Luke Aron ; Loh, Gabriel H. ; Lee, Hsien-Hsin Sean

This work addresses the problem of the increasing performance disparity between the microprocessor and memory subsystem. Current L1 caches fabricated in deep submicron processes must either shrink to maintain timing, or suffer higher latencies, exacerbating the problem. We introduce a new classification for the behavior of memory traffic, which we refer to as target behavior. Classification of the target behavior falls into two categories: Uni-Targeted Instructions (UTI) and Multi-Targeted Instructions (MTI). On average, 30% of all dynamic memory LD/ST operations come from execution of UTIs, yet only a few hundred static instructions are actually UTIs. This makes isolation of the UTI targets an avenue for optimization. The addition of a small, fast cache structure which contains only UTI data would ideally reduce MTI pollution of UTI information. By intelligently selecting between larger, slower data caches and our UTI cache, we reduce the latency problem while increasing performance. Our distinct contributions fall in three areas, with implications to many others: (1) we present a new characterization of memory traffic based on the number of targets from LD/ST instructions; (2) we explore the underlying nature of the target division and devise a simple mechanism for exploiting regularity based on a UTI cache; (3) we explore a variety of prediction mechanisms and processor configuration options to determine sensitivity and the performance gains actually attainable under different modern processor configurations. We attain up to 42% IPC improvements on SPEC2000, with a mean improvement of 8%. Our solution also reduces L2 accesses by up to 89% (average 29%), while reducing load-load violation traps by up to 84% (average 13%), and store-load violation traps by up to 43% (average 8%).
High Speed Memory Centric Protection on Software Execution Using One-Time-Pad Prediction

(Georgia Institute of Technology, 2004-07-23) Shi, Weidong ; Lee, Hsien-Hsin Sean ; Lu, Chenghuai ; Ghosh, Mrinmoy

This paper presents a new security model for protecting software confidentiality. Different from the previous process-centric systems designed for the same purpose, the new model ties cryptographic properties and security attributes to memory instead of a user process. The advantages of such memory centric design over the previous process-centric design are two folds. First, it provides a better security model and access control on software confidentiality that supports both selective and mixed software encryption. Second, the new model supports and facilitates information sharing in an open software system where both confidential data and code could be shared by different user processes without unnecessary duplication as required by the process-centric approach. Furthermore, the paper addresses the latency issue of executing one-time-pad (OTP) encrypted software through a novel OTP prediction technique. One-time-pad based protection schemes on data confidentiality can improve performance over block-cipher based protection approaches by parallelizing data fetch and OTP generation when a sequence number associated with a missing cache block is cached on-chip. On a sequence number cache miss, OTP generation can not be started until the missing sequence number is fetched from the memory. Since the latency of OTP generation is in the magnitude of the order of hundreds of core CPU cycles, it becomes performance critical to have OTP ready as soon as possible. OTP prediction meets this challenge by using idle decryption engine cycles to speculatively compute OTPs for memory blocks whose sequence number are missing in the cache. Profiling and simulation results show that significant performance improvement using speculative OTP over regular OTP under both small 4KB and large sequence number cache settings 32KB due to the capability of speculative OTP technique to reduce misses on sequence number. The performance improvement is in the range from 15% to 25% for seven SPEC2000 benchmarks. The new access control protection and OTP prediction scheme requires only small amount of additional hardware resources over the existing proposed tamper resistant system but with greatly improved performance, protection, flexibility, and inter-operability.
Architecture Support for High Speed Protection of Memory Integrity and Confidentiality in Symmetric Multiprocessor Systems

(Georgia Institute of Technology, 2004-06-01) Shi, Weidong ; Lee, Hsien-Hsin Sean ; Ghosh, Mrinmoy ; Lu, Chenghuai ; Zhang, Tao

Recently there is a growing interest in both the architecture and the security community to create a hardware based solution for authenticating system memory. As shown in the previous work, such silicon based memory authentication could become a vital component for creating future trusted computing environments and digital rights protection. Almost all the published work have focused on authenticating memory that is exclusively owned by one processing unit. However, in today's computing platforms, memory is often shared by multiple processing units which support shared system memory and snoop bus based memory coherence. Authenticating shared memory is a new challenge to memory protection. In this paper, we present a secure and fast architecture solution for authenticating shared memory. In terms of incorporating memory authentication into the processor pipeline, we proposed a new scheme called Authentication Speculative Execution. Unlike the previous approach for hiding or tolerating latency of memory authentication, our scheme does not trades security for performance. The novel ASE scheme is both secure to be combined with one-time-pad (OTP) based memory encryption and efficient to tolerate authentication latency. Results using modified rsim and splash2 benchmarks show only 5% overhead in performance on dual and quad processor platforms. Furthermore, ASE shows 80% performance advantage on average over conservative non-speculative execution based authentication. The scheme is of practical use for both symmetric multiprocessor systems and uni-processor systems where memory is shared by the main processor and other co-processors attached to the system bus.
Thermal-aware 3D Microarchitectural Floorplanning

(Georgia Institute of Technology, 2004) Ekpanyapong, Mongkol ; Healy, Michael ; Ballapuram, Chinnakrishnan S. ; Lim, Sung Kyu ; Lee, Hsien-Hsin Sean ; Loh, Gabriel H.

Next generation deep submicron processor design will need to take into consideration many performance limiting factors. Flip flops are inserted in order to prevent global wire delay from becoming nonlinear, enabling deeper pipelines and higher clock frequency. The move to 3D ICs will also likely be used to further shorten wirelength. This will cause thermal issues to become a major bottleneck to performance improvement. In this paper we propose a floorplanning algorithm which takes into consideration both thermal issues and profile weighted wirelength using mathematical programming. Our profile-driven objective improves performance by 20% over wirelength-driven. While the thermal-driven objective improves temperature by 24% on average over the profile-driven case.