Title:
ClusterWatch: Flexible, Lightweight Monitoring for High-end GPGPU Clusters
ClusterWatch: Flexible, Lightweight Monitoring for High-end GPGPU Clusters
Authors
Slawinska, Magdalena
Schwan, Karsten
Eisenhauer, Greg
Schwan, Karsten
Eisenhauer, Greg
Authors
Person
Advisors
Advisors
Associated Organizations
Organizational Unit
Series
Collections
Supplementary to
Permanent Link
Abstract
The ClusterWatch middleware provides runtime
flexibility in what system-level metrics are monitored, how frequently
such monitoring is done, and how metrics are combined
to obtain reliable information about the current behavior of
GPGPU clusters. Interesting attributes of ClusterWatch are (1)
the ease with which different metrics can be added to the
system—by simply deploying additional “cluster spies,” (2) the
ability to filter and process monitoring metrics at their sources,
to reduce data movement overhead, (3) flexibility in the rate at
which monitoring is done, (4) efficient movement of monitoring
data into backend stores for long-term or historical analysis, and
most importantly, (5) specific support for monitoring the behavior
and use of the GPGPUs used by applications. This paper presents
our initial experiences with using ClusterWatch to assess the performance
behavior of the a larger-scale GPGPU-based simulation
code. We report the overheads seen when using ClusterWatch,
the experimental results obtained for the simulation, and the
manner in which ClusterWatch will interact with infrastructures
for detailed program performance monitoring and profiling such
as TAU or Lynx. Experiments conducted on the NICS Keeneland
Initial Delivery System (KIDS), with up to 64 nodes, demonstrate
low monitoring overheads for high fidelity assessments of the
simulation’s performance behavior, for both its CPU and GPU
components.
Sponsor
Date Issued
2013
Extent
Resource Type
Text
Resource Subtype
Technical Report