Organizational Unit:

Center for Experimental Research in Computer Systems

Permanent Link

https://hdl.handle.net/1853/70632

Full item page

Publication Search Results

Now showing 1 - 10 of 152

Security Refresh: Prevent Malicious Wear-out and Increase Durability for Phase-Change Memory with Dynamically Randomized Address Mapping

(Georgia Institute of Technology, 2009-11) Seong, Nak Hee ; Woo, Dong Hyuk ; Lee, Hsien-Hsin Sean

Phase-change Random Access Memory (PRAM) is an emerging memory technology for future computing systems. It is nonvolatile and has a faster read latency and potentially higher storage density than other memory alternatives. Recently, system researchers have studied the trade-off of using PRAM to back up a DRAM cache as a last level memory or to implement it in a hybrid memory architecture. The main roadblock preventing PRAM from commercially viable, however, is its much lower write endurance. Several recent proposals attempted to address this issue by either reducing PRAM's write frequency or using wearleveling techniques to evenly distribute PRAM writes. Although the lifetime of PRAM could be extended by these techniques under normal operations of typical applications, most of them do not prevent a malicious code deliberately designed to wear it out. Furthermore, all of these prior techniques failed to consider the circumstances when a compromised OS is present and its security implication to the overall PRAM design. A compromised OS, (e.g., via simple buffer over ow) will allow adversaries to manipulate all processes and exploit side channels easily, accelerating the wear-out of targeted PRAM blocks and rendering a dysfunctional system. In this paper, we argue that a PRAM design not only has to consider normal wear-out under conventional application behavior, most importantly, it must take the worst-case scenario into account with the presence of malicious exploits and a compromised OS. Such design consideration will address both the durability and security issues of PRAM simultaneously. Toward this goal, in this work, we propose a novel, low-cost hardware mechanism called Security Refresh. Similar to the concept of protecting charge leak from DRAM, Security Refresh prevents information leak by constantly migrating its physical location (thus refresh) inside PRAM, obfuscating the actual data placement from users and system software. It uses a dynamic randomized address mapping scheme, which swaps data between random PRAM blocks using random keys generated by thermal noise upon each refresh due. The hardware is extremely low-cost without using any table. We presented two implementation alternatives and showed their tradeoff and respective wear-out endurance. For a given con guration, we show that the optimal lifetime of a PRAM block (256B) is 8 years. In addition, we showed the performance impact of Security Refresh is mostly negligible.
A Characterization and Analysis of GPGPU Kernels

(Georgia Institute of Technology, 2009-05-05) Kerr, Andrew ; Diamos, Gregory ; Yalamanchili, Sudhakar

General purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating data- and compute-intensive applications, pushed to the forefront by the introduction of Cbased programming environments such as NVIDIA’s CUDA, [1], OpenCL [2], and Intel’s Ct [3]. While significant effort has been focused on developing and evaluating applications and software tools, comparatively little has been devoted to the analysis and characterization of applications to assist future work in compiler optimizations, application re-structuring, and GPGPU micro-architecture design. This paper proposes a set of metrics for GPGPU workloads and uses these metrics to analyze the behavior of GPGPU programs. We report on an analysis of over 50 kernels and applications including the full NVIDIA CUDA SDK [4] covering control flow, data flow, parallelism and memory behavior. The analysis was performed using a full function emulator we developed that implements the NVIDIA virtual machine referred to as PTX (Parallel Thread eXecution architecture) - a machine model and low level virtual ISA. The emulator can execute compiled kernels from the CUDA compiler, currently supports the full PTX 1.3 specification [5], and has been validated against the full CUDA SDK. The results quantify the importance of optimizations such as those for branch re-convergence, the prevalance of sharing between threads, and the opportunities for additional parallelism.
Thermal Field Management for Many-core Processors

(Georgia Institute of Technology, 2009-04-10) Cho, Minki ; Sathe, Nikhil ; Yalamanchili, Sudhakar ; Mukhopadhyay, Saibal

This paper first presents an analysis of the global thermal field in many core processors in deep nanometer (to 16nm) nodes under power and thermal budget. We show that the thermal field can have significant spatiotemporal non-uniformity along with high maximum temperature. We propose spatiotemporal power multiplexing as a proactive method to reduce spatial and temporal temperature gradients. Several power multiplexing policies are evaluated for a 256 core many-core processor in 16nm nodes which demonstrate that the simple cyclic core-activation can achieve highly uniform thermal field with low maximum temperature.
Cloud Computing: A Taxonomy of Platform and Infrastructure-level Offerings

(Georgia Institute of Technology, 2009-04) Hilley, David
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot

(Georgia Institute of Technology, 2009) Diamos, Gregory ; Kerr, Andrew ; Kesavan, Mukil

Parallel Thread Execution ISA (PTX) is a virtual instruction set used by NVIDIA GPUs that explicitly expresses hierarchical MIMD and SIMD style parallelism in an application. In such a programming model, the programmer and compiler are left with the not trivial, but not impossible, task of composing applications from parallel algorithms and data structures. Once this has been accomplished, even simple architectures with low hardware complexity can easily exploit the parallelism in an application. With these applications in mind, this paper presents Ocelot, a binary translation framework designed to allow architectures other than NVIDIA GPUs to leverage the parallelism in PTX programs. Specifically, we show how (i) the PTX thread hierarchy can be mapped to many-core architectures, (ii) translation techniques can be used to hide memory latency, and (iii) GPU data structures can be efficiently emulated or mapped to native equivalents. We describe the low level implementation of our translator, ending with a case study detailing the complete translation process from PTX to SPU assembly used by the IBM Cell Processor.
Camouflage: Automated Sanitization of Field Data

(Georgia Institute of Technology, 2009) Clause, James ; Orso, Alessandro

Privacy and security concerns have adversely affected the usefulness of many types of techniques that leverage information gathered from deployed applications. To address this issue, we present a new approach for automatically sanitizing failure-inducing inputs. Given an input I that causes a failure f, our technique can generate a sanitized input I' that is different from I but still causes f. I' can then be sent to the developers to help them debug f, without revealing the possibly sensitive information contained in I. We implemented our approach in a prototype tool, camouflage, and performed an empirical evaluation. In the evaluation, we applied camouflage to a large set of failure-inducing inputs for several real applications. The results of the evaluation are promising; they show that camouflage is both practical and effective at generating sanitized inputs. In particular, for the inputs that we considered, I and I' shared no sensitive information.
SPA: Symbolic Program Approximation for Scalable Path-sensitive Analysis

(Georgia Institute of Technology, 2009) Harrold, Mary Jean ; Santelices, Raul

Symbolic execution is a static-analysis technique that has been used for applications such as test-input generation and change analysis. Symbolic execution’s path sensitivity makes scaling it difficult. Despite recent advances that reduce the number of paths to explore, the scalability problem remains. Moreover, there are applications that require the analysis of all paths in a program fragment, which exacerbate the scalability problem. In this paper, we present a new technique, called Symbolic Program Approximation (SPA), that performs an approximation of the symbolic execution of all paths between two program points by abstracting away certain symbolic subterms to make the symbolic analysis practical, at the cost of some precision. We discuss several applications of SPA, including testing of software changes and static invariant discovery. We also present a tool that implements SPA and an empirical evaluation on change analysis and testing that shows the applicability, effectiveness, and potential of our technique.
Speculative Execution on Multi-GPU Systems

(Georgia Institute of Technology, 2009) Diamos, Gregory ; Yalamanchili, Sudhakar

The lag of parallel programming models and languages behind the advance of heterogeneous many-core processors has left a gap between the computational capability of modern systems and the ability of applications to exploit them. Emerging programming models, such as CUDA and OpenCL, force developers to explicitly partition applications into components (kernels) and assign them to accelerators in order to utilize them effectively. An accelerator is a processor with a different ISA and micro-architecture than the main CPU. These static partitioning schemes are effective when targeting a system with only a single accelerator. However, they are not robust to changes in the number of accelerators or the performance characteristics of future generations of accelerators. In previous work, we presented the Harmony execution model for computing on heterogeneous systems with several CPUs and accelerators. In this paper, we extend Harmony to target systems with multiple accelerators using control speculation to expose parallelism. We refer to this technique as Kernel Level Speculation (KLS). We argue that dynamic parallelization techniques such as KLS are sufficient to scale applications across several accelerators based on the intuition that there will be fewer distinct accelerators than cores within each accelerator. In this paper, we use a complete prototype of the Harmony runtime that we developed to explore the design decisions and trade-offs in the implementation of KLS. We show that KLS improves parallelism to a sufficient degree while retaining a sequential programming model. We accomplish this by demonstrating good scaling of KLS on a highly heterogeneous system with three distinct accelerator types and ten processors
Redactable Signatures on Data with Dependencies

(Georgia Institute of Technology, 2009) Bauer, David ; Blough, Douglas M. ; Mohan, Apurva

The storage of personal information by service providers entails a significant risk of privacy loss due to data breaches. One way to mitigate this problem is to limit the amount of personal information that is provided. Our prior work on minimal disclosure credentials presented a computationally efficient mechanism to facilitate this capability. In that work, personal data was broken into individual claims, which could be released in arbitrary subsets while still being cryptographically verifiable. In expanding the applications for that work, we encountered the problem of connections between different claims, which manifest as dependencies on the release of those claims. In this new work, we provide an efficient way to provide the same selective disclosure, but with cryptographic enforcement of dependencies between claims, as specified by the certifier of the claims. This constitutes a mechanism for redactable signatures on data with release dependencies. Our scheme was implemented and benchmarked over a wide range of input set sizes, and shown to verify thousands of claims in tens to hundreds of milliseconds. We also describe ongoing work in which the approach is being used within a larger system for holding and dispensing personal health records.
Consistency in Real-time Collaborative Editing Systems Based on Partial Persistent Sequences

(Georgia Institute of Technology, 2009) Wu, Qinyi ; Pu, Calton

In real-time collaborative editing systems, users create a shared document by issuing insert, delete, and undo operations on their local replica anytime and anywhere. Data consistency issues arise due to concurrent editing conflicts. Traditional consistency models put restrictions on editing operations updating different portions of a shared document, which is unnecessary for many editing scenarios, and cause their view synchronization strategies to become less efficient. To address these problems, we propose a new data consistency model that preserves convergence and synchronizes editing operations only when they access overlapped or contiguous characters. Our view synchronization strategy is implemented by a novel data structure–partial persistent sequence. A partial persistent sequence is an ordered set of items indexed by persistent and unique position identifiers. It captures data dependencies of editing operations and encodes them in a way that they can be correctly executed on any document replica. As a result, a simple and efficient view synchronization strategy can be implemented.