Title:
PreDatA - Preparatory Data Analytics on Peta-Scale Machines
PreDatA - Preparatory Data Analytics on Peta-Scale Machines
Authors
Zheng, Fang
Abbasi, Hasan
Docan, Ciprian
Lofstead, Jay
Klasky, Scott
Liu, Qing
Parashar, Manish
Podhorszki, Norbert
Schwan, Karsten
Wolf, Matthew
Abbasi, Hasan
Docan, Ciprian
Lofstead, Jay
Klasky, Scott
Liu, Qing
Parashar, Manish
Podhorszki, Norbert
Schwan, Karsten
Wolf, Matthew
Authors
Person
Advisors
Advisors
Associated Organizations
Organizational Unit
Series
Collections
Supplementary to
Permanent Link
Abstract
Peta-scale scientific applications running on High
End Computing (HEC) platforms can generate large volumes of
data. For high performance storage and in order to be useful
to science end users, such data must be organized in its layout,
indexed, sorted, and otherwise manipulated for subsequent data
presentation, visualization, and detailed analysis. In addition,
scientists desire to gain insights into selected data characteristics
‘hidden’ or ‘latent’ in the massive datasets while data is being
produced by simulations. PreDatA, short for Preparatory Data
Analytics, is an approach for preparing and characterizing
data while it is being produced by the large scale simulations
running on peta-scale machines. By dedicating additional compute
nodes on the peta-scale machine as staging nodes and
staging simulation’s output data through these nodes, PreDatA
can exploit their computational power to perform selected data
manipulations with lower latency than attainable by first moving
data into file systems and storage. Such in-transit manipulations
are supported by the PreDatA middleware through RDMAbased
data movement to reduce write latency, application-specific
operations on streaming data that are able to discover latent
data characteristics, and appropriate data reorganization and
metadata annotation to speed up subsequent data access. As a
result, PreDatA enhances the scalability and flexibility of current
I/O stack on HEC platforms and is useful for data pre-processing,
runtime data analysis and inspection, as well as for data exchange
between concurrently running simulation models. Performance
evaluations with several production peta-scale applications on
Oak Ridge National Laboratory’s Leadership Computing Facility
demonstrate the feasibility and advantages of the PreDatA
approach.
Sponsor
Date Issued
2010
Extent
Resource Type
Text
Resource Subtype
Technical Report