Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation.

Conference ·

Abstract not provided.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA), Office of Defense Nuclear Security
DOE Contract Number:
NA0003525
OSTI ID:
1891963
Report Number(s):
SAND2021-10802C; 700889
Resource Relation:
Conference: Proposed for presentation at the IEEE HPEC held September 20-24, 2021 in ,
Country of Publication:
United States
Language:
English

References (27)

GPCNeT: designing a benchmark suite for inducing and measuring contention in HPC networks
  • Chunduri, Sudheer; Groves, Taylor; Mendygral, Peter
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356215
conference November 2019
QMCPACK : an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids journal April 2018
Predicting application performance using supervised learning on communication features
  • Jain, Nikhil; Bhatele, Abhinav; Robson, Michael P.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503263
conference January 2013
Lpms conference August 2019
Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters
  • Pollard, Samuel D.; Jain, Nikhil; Herbein, Stephen
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00029
conference November 2018
Level-Spread: A New Job Allocation Policy for Dragonfly Networks conference May 2018
Choreo conference October 2013
Network-Aware Scheduling for Data-Parallel Jobs journal August 2015
The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
  • Agelastos, Anthony; Allan, Benjamin; Brandt, Jim
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.18
conference November 2014
Holistic Measurement-Driven System Assessment conference September 2017
Integrating Low-latency Analysis into HPC System Monitoring
  • Izadpanah, Ramin; Naksinehaboon, Nichamon; Brandt, Jim
  • ICPP 2018: 47th International Conference on Parallel Processing, Proceedings of the 47th International Conference on Parallel Processing https://doi.org/10.1145/3225058.3225086
conference August 2018
Diagnosing Performance Variations in HPC Applications Using Machine Learning book January 2017
Quantifying the impact of network congestion on application performance and network metrics conference September 2020
Fast Parallel Algorithms for Short-Range Molecular Dynamics journal March 1995
Run-to-run variability on Xeon Phi based cray XC systems
  • Chunduri, Sudheer; Harms, Kevin; Parker, Scott
  • SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126926
conference November 2017
Maximizing Throughput on a Dragonfly Network
  • Jain, Nikhil; Bhatele, Abhinav; Ni, Xiang
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.33
conference November 2014
Improving inter-node communications in multi-core clusters using a contention-free process mapping algorithm journal April 2013
Quiet Neighborhoods: Key to Protect Job Performance Predictability conference May 2015
The Case of Performance Variability on Dragonfly-based Systems conference May 2020
APHiD: Hierarchical Task Placement to Enable a Tapered Fat Tree Topology for Lower Power and Cost in HPC Networks conference May 2017
There goes the neighborhood: performance degradation due to nearby jobs
  • Bhatele, Abhinav; Mohror, Kathryn; Langer, Steven H.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503247
conference January 2013
Heterogeneity-Aware Workload Placement and Migration in Distributed Sustainable Datacenters conference May 2014
Load Balancing in a Cluster Computer
  • Werstein, Paul; Situ, Hailing; Huang, Zhiyi
  • 2006 Seventh International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT'06) https://doi.org/10.1109/PDCAT.2006.77
conference January 2006
Cooling-Aware Job Scheduling and Node Allocation for Overprovisioned HPC Systems conference May 2017
Technology-Driven, Highly-Scalable Dragonfly Topology
  • Kim, John; Dally, Wiliam J.; Scott, Steve
  • 2008 35th International Symposium on Computer Architecture (ISCA), 2008 International Symposium on Computer Architecture https://doi.org/10.1109/ISCA.2008.19
conference June 2008
A new metric for ranking high-performance computing systems journal January 2016
The Outer Rim Simulation: A Path to Many-core Supercomputers journal November 2019