Person:

Schwan, Karsten

Permanent Link

https://hdl.handle.net/1853/71682

Associated Organization(s)

Organizational Unit

College of Computing

Full item page

Publication Search Results

Now showing 1 - 10 of 106

Data staging on future platforms: Systems management for high performance and resilience

(Georgia Institute of Technology, 2014-05) Schwan, Karsten ; Eisenhauer, Greg S. ; Wolf, Matthew
Design of a Write-Optimized Data Store

(Georgia Institute of Technology, 2013) Amur, Hrishikesh ; Andersen, David G. ; Kaminsky, Michael ; Schwan, Karsten

The WriteBuffer (WB) Tree is a new write-optimized data structure that can be used to implement per-node storage in unordered key-value stores. TheWB Tree provides faster writes than the Log-Structured Merge (LSM) Tree that is used in many current high-performance key-value stores. It achieves this by replacing compactions in LSM Trees, which are I/O-intensive, with light-weight spills and splits, along with other techniques. By providing nearly 30 higher write performance compared to current high-performance key-value stores, while providing comparable read performance (1-2 I/Os per read using 1-2B per key of memory), the WB Tree addresses the needs of a class of increasingly popular write-intensive workloads.
ClusterWatch: Flexible, Lightweight Monitoring for High-end GPGPU Clusters

(Georgia Institute of Technology, 2013) Slawinska, Magdalena ; Schwan, Karsten ; Eisenhauer, Greg

The ClusterWatch middleware provides runtime flexibility in what system-level metrics are monitored, how frequently such monitoring is done, and how metrics are combined to obtain reliable information about the current behavior of GPGPU clusters. Interesting attributes of ClusterWatch are (1) the ease with which different metrics can be added to the system—by simply deploying additional “cluster spies,” (2) the ability to filter and process monitoring metrics at their sources, to reduce data movement overhead, (3) flexibility in the rate at which monitoring is done, (4) efficient movement of monitoring data into backend stores for long-term or historical analysis, and most importantly, (5) specific support for monitoring the behavior and use of the GPGPUs used by applications. This paper presents our initial experiences with using ClusterWatch to assess the performance behavior of the a larger-scale GPGPU-based simulation code. We report the overheads seen when using ClusterWatch, the experimental results obtained for the simulation, and the manner in which ClusterWatch will interact with infrastructures for detailed program performance monitoring and profiling such as TAU or Lynx. Experiments conducted on the NICS Keeneland Initial Delivery System (KIDS), with up to 64 nodes, demonstrate low monitoring overheads for high fidelity assessments of the simulation’s performance behavior, for both its CPU and GPU components.
CCM: Scalable, On-Demand Compute Capacity Management for Cloud Datacenters

(Georgia Institute of Technology, 2013) Kesavan, Mukil ; Ahmad, Irfan ; Krieger, Orran ; Soundararajan, Ravi ; Gavrilovska, Ada ; Schwan, Karsten

We present CCM (Cloud Capacity Manager) – a prototype system, and, methods for dynamically multiplexing the compute capacity of cloud datacenters at scales of thousands of machines, for diverse workloads with variable demands. This enables mitigation of resource consumption hotspots and handling unanticipated demand surges, leading to improved resource availability for applications and better datacenter utilization levels. Extending prior studies primarily concerned with accurate capacity allocation and ensuring acceptable application performance, CCM also focuses on the tradeoffs due to two unavoidable issues in large scale commodity datacenters: (i) maintaining low operational overhead, and (ii) coping with the increased incidences of management operation failures. CCM is implemented in an industry-strength cloud infrastructure built on top of the VMware vSphere virtualization platform and is currently deployed in a 700 physical host datacenter. Its experimental evaluation uses production workload traces and a suite of representative cloud applications to generate dynamic scenarios. Results indicate that the pragmatic cloud-wide nature of CCM provides up to 25% more resources for workloads and improves datacenter utilization by up to 20%, compared to the alternative approach of multiplexing capacity within multiple smaller datacenter partitions.
Personal Clouds: Sharing and Integrating Networked Resources to Enhance End User Experiences

(Georgia Institute of Technology, 2013) Jang, Minsung ; Schwan, Karsten ; Bhardwaj, Ketan ; Gavrilovska, Ada ; Avasthi, Adhyas

End user experiences on mobile devices with their rich sets of sensors are constrained by limited device battery lives and restricted form factors, as well as by the ‘scope’ of the data available locally. The 'Personal Cloud' distributed software abstractions address these issues by enhancing the capabilities of a mobile device via seamless use of both nearby and remote cloud resources. In contrast to vendor-specific, middleware-based cloud solutions, Personal Cloud instances are created at hypervisorlevel, to create for each end user the federation of networked resources best suited for the current environment and use. Specifically, the Cirrostratus extensions of the Xen hypervisor can federate a user’s networked resources to establish a personal execution environment, governed by policies that go beyond evaluating network connectivity to also consider device ownership and access rights, the latter managed in a secure fashion via standard Social Network Services. Experimental evaluations with both Linux- and Android-based devices, and using Facebook as the SNS, show the approach capable of substantially augmenting a device's innate capabilities, improving application performance and the effective functionality seen by end users.
Evaluating Scalability of Multi-threaded Applications on a Many-core Platform

(Georgia Institute of Technology, 2012) Gupta, Vishal ; Kim, Hyesoon ; Schwan, Karsten

Multicore processors have been effective in scaling application performance by dividing computation among multiple threads running in parallel. However, application performance does not necessarily improve as more cores are added. Application performance can be limited due to multiple bottlenecks including contention for shared resources such as caches and memory. In this paper, we perform a scalability analysis of parallel applications on a 64-threaded Intel Nehalem-EX based system. We find that applications which scale well on small number of cores, exhibit poor scalability on large number of cores. Using hardware performance counters, we show that many performance limited applications are limited by memory bandwidth on manycore platforms and exhibit improved scalability when provisioned with higher memory bandwidth. By regulating the number of threads used and applying dynamic voltage and frequency scaling for memory bandwidth limited benchmarks, significant energy savings can be achieved.
Memory-Efficient GroupBy-Aggregate using Compressed Buffer Trees

(Georgia Institute of Technology, 2012) Amur, Hrishikesh ; Richter, Wolfgang ; Andersen, David G. ; Kaminsky, Michael ; Schwan, Karsten ; Balachandran, Athula ; Zawadzki, Erik

Memory is rapidly becoming a precious resource in many data processing environments. This paper introduces a new data structure called a Compressed Buffer Tree (CBT). Using a combination of buffering, compression, and lazy aggregation, CBTs can improve the memory efficiency of the GroupBy-Aggregate abstraction which forms the basis of many data processing models like MapReduce and databases. We evaluate CBTs in the context of MapReduce aggregation, and show that CBTs can provide significant advantages over existing hash-based aggregation techniques: up to 2x less memory and 1.5x the throughput, at the cost of 2.5x CPU.
PreDatA - Preparatory Data Analytics on Peta-Scale Machines

(Georgia Institute of Technology, 2010) Zheng, Fang ; Abbasi, Hasan ; Docan, Ciprian ; Lofstead, Jay ; Klasky, Scott ; Liu, Qing ; Parashar, Manish ; Podhorszki, Norbert ; Schwan, Karsten ; Wolf, Matthew

Peta-scale scientific applications running on High End Computing (HEC) platforms can generate large volumes of data. For high performance storage and in order to be useful to science end users, such data must be organized in its layout, indexed, sorted, and otherwise manipulated for subsequent data presentation, visualization, and detailed analysis. In addition, scientists desire to gain insights into selected data characteristics ‘hidden’ or ‘latent’ in the massive datasets while data is being produced by simulations. PreDatA, short for Preparatory Data Analytics, is an approach for preparing and characterizing data while it is being produced by the large scale simulations running on peta-scale machines. By dedicating additional compute nodes on the peta-scale machine as staging nodes and staging simulation’s output data through these nodes, PreDatA can exploit their computational power to perform selected data manipulations with lower latency than attainable by first moving data into file systems and storage. Such in-transit manipulations are supported by the PreDatA middleware through RDMAbased data movement to reduce write latency, application-specific operations on streaming data that are able to discover latent data characteristics, and appropriate data reorganization and metadata annotation to speed up subsequent data access. As a result, PreDatA enhances the scalability and flexibility of current I/O stack on HEC platforms and is useful for data pre-processing, runtime data analysis and inspection, as well as for data exchange between concurrently running simulation models. Performance evaluations with several production peta-scale applications on Oak Ridge National Laboratory’s Leadership Computing Facility demonstrate the feasibility and advantages of the PreDatA approach.
Towards Optimal Power Management: Estimation of Performance Degradation due to DVFS on Modern Processors

(Georgia Institute of Technology, 2010) Amur, Hrishikesh ; Prvulovic, Milos ; Schwan, Karsten

The alarming growth of the power consumption of data centers coupled with low average utilization of servers suggests the use of power management strategies. Such actions however require the understanding of the effects of the power management actions on the performance of data center applications running on managed platforms. The goal of our research is to accurately estimate power savings and consequent performance degradation from DVFS and thereby better guide the optimization of a performance/power metric of a platform. Towards that end, this paper presents precise performance and power models for DVFS strategies. Precise models are attained by better modeling the performance behavior of modern out-of-order processors, by taking into account, for instance, the effects of cache miss overlapping. Models are validated using benchmarks from the SPEC CPU2006 suite, which show that the observed degradation always falls within the predicted bounds. Also, the upper bound degradation estimates were up to 43% less than those due to a linear degradation model which allows for the aggressive use of DVFS.
Cellule: Lightweight Execution Environment for Accelerator-based Systems

(Georgia Institute of Technology, 2010) Gavrilovska, Ada ; Gupta, Vishakha ; Schwan, Karsten ; Tembey, Priyanka ; Xenidis, Jimi

The increasing prevalence of accelerators is changing the high performance computing (HPC) landscape to one in which future platforms will consist of heterogeneous multi-core chips comprised of both general purpose and specialized cores. Coupled with this trend is increased support for virtualization, which can abstract underlying hardware to aid in dynamically managing its use by HPC applications while at the same time, provide lightweight, efficient, and specialized execution environments (SEE) for applications to maximally exploit the hardware. This paper describes the Cellule architecture which uses virtualization to create high performance, low noise SEEs for accelerators. The paper describes important properties of Cellule and illustrates its advantages with an implementation on the IBM Cell processor. With compute-intensive workloads, performance improvements of up to 60% are attained when using Cellule’s SEE vs. the current Linux-based runtime, resulting in a system architecture that is suitable for future accelerators and specialized cores irrespective of whether they are on-chip or off-chip. A key principle, coordinated resource management for accelerator and general purpose resources, is shown to extend beyond Cell, using experimental results obtained on a different accelerator platform.