Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation.

Zhang, Yijia; Aksar, Burak; Aaziz, Omar; Schwaller, Benjamin; Brandt, James; Leung, Vitus; Egele, Manuel; Coskun, Ayse K.

doi:10.1109/HPEC49654.2021.9622783

Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation.

Conference · Wed Sep 01 04:00:00 EDT 2021

DOI:https://doi.org/10.1109/HPEC49654.2021.9622783· OSTI ID:1891963

Zhang, Yijia; Aksar, Burak; Aaziz, Omar; Schwaller, Benjamin; Brandt, James; Leung, Vitus; Egele, Manuel; Coskun, Ayse K.

Abstract not provided.

View Conference

Research Organization:: Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Sandia National Laboratories, Livermore, CA

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA), Office of Defense Nuclear Security (NA-70)

DOE Contract Number:: NA0003525

OSTI ID:: 1891963

Report Number(s):: SAND2021-10802C; 700889

Country of Publication:: United States

Language:: English

References (27)

APHiD: Hierarchical Task Placement to Enable a Tapered Fat Tree Topology for Lower Power and Cost in HPC Networks Michelogiannakis, George; Ibrahim, Khaled Z.; Shalf, John 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) https://doi.org/10.1109/CCGRID.2017.33	conference	May 2017
The Case of Performance Variability on Dragonfly-based Systems Bhatele, Abhinav; Thiagarajan, Jayaraman J.; Groves, Taylor 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS47924.2020.00096	conference	May 2020
The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications Agelastos, Anthony; Allan, Benjamin; Brandt, Jim SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.18	conference	November 2014
The Outer Rim Simulation: A Path to Many-core Supercomputers Heitmann, Katrin; Finkel, Hal; Pope, Adrian The Astrophysical Journal Supplement Series, Vol. 245, Issue 1 https://doi.org/10.3847/1538-4365/ab4da1	journal	November 2019
Heterogeneity-Aware Workload Placement and Migration in Distributed Sustainable Datacenters Cheng, Dazhao; Jiang, Changjun; Zhou, Xiaobo 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.41	conference	May 2014
GPCNeT: designing a benchmark suite for inducing and measuring contention in HPC networks Chunduri, Sudheer; Groves, Taylor; Mendygral, Peter SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356215	conference	November 2019
Lpms Baicheng, Yan; Zhang, Yang; Limin, Xiao Workshop Proceedings of the 48th International Conference on Parallel Processing https://doi.org/10.1145/3339186.3339208	conference	August 2019
Load Balancing in a Cluster Computer Werstein, Paul; Situ, Hailing; Huang, Zhiyi 2006 Seventh International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT'06) https://doi.org/10.1109/PDCAT.2006.77	conference	January 2006
Quiet Neighborhoods: Key to Protect Job Performance Predictability Jokanovic, Ana; Sancho, Jose Carlos; Rodriguez, German 2015 IEEE International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2015.87	conference	May 2015
Technology-Driven, Highly-Scalable Dragonfly Topology Kim, John; Dally, Wiliam J.; Scott, Steve 2008 35th International Symposium on Computer Architecture (ISCA), 2008 International Symposium on Computer Architecture https://doi.org/10.1109/ISCA.2008.19	conference	June 2008
Predicting application performance using supervised learning on communication features Jain, Nikhil; Bhatele, Abhinav; Robson, Michael P. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503263	conference	January 2013
Choreo LaCurts, Katrina; Deng, Shuo; Goyal, Ameesh Proceedings of the 2013 conference on Internet measurement conference https://doi.org/10.1145/2504730.2504744	conference	October 2013
Fast Parallel Algorithms for Short-Range Molecular Dynamics Plimpton, Steve Journal of Computational Physics, Vol. 117, Issue 1 https://doi.org/10.1006/jcph.1995.1039	journal	March 1995
Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters Pollard, Samuel D.; Jain, Nikhil; Herbein, Stephen SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00029	conference	November 2018
Quantifying the impact of network congestion on application performance and network metrics Zhang, Yijia; Groves, Taylor; Cook, Brandon 2020 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER49012.2020.00026	conference	September 2020
Level-Spread: A New Job Allocation Policy for Dragonfly Networks Zhang, Yijia; Tuncer, Ozan; Kaplan, Fulya 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2018.00121	conference	May 2018
Cooling-Aware Job Scheduling and Node Allocation for Overprovisioned HPC Systems Cao, Thang; Huang, Wei; He, Yuan 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2017.19	conference	May 2017
Improving inter-node communications in multi-core clusters using a contention-free process mapping algorithm Soryani, Mohsen; Analoui, Morteza; Zarrinchian, Ghobad The Journal of Supercomputing, Vol. 66, Issue 1 https://doi.org/10.1007/s11227-013-0918-7	journal	April 2013
Maximizing Throughput on a Dragonfly Network Jain, Nikhil; Bhatele, Abhinav; Ni, Xiang SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.33	conference	November 2014
There goes the neighborhood: performance degradation due to nearby jobs Bhatele, Abhinav; Mohror, Kathryn; Langer, Steven H. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503247	conference	January 2013
Holistic Measurement-Driven System Assessment Jha, Saurabh; Brandt, Jim; Gentile, Ann 2017 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2017.124	conference	September 2017
A new metric for ranking high-performance computing systems Dongarra, Jack; Heroux, Michael A.; Luszczek, Piotr National Science Review, Vol. 3, Issue 1 https://doi.org/10.1093/nsr/nwv084	journal	January 2016
`QMCPACK` : an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids Kim, Jeongnim; Baczewski, Andrew D.; Beaudet, Todd D. Journal of Physics: Condensed Matter, Vol. 30, Issue 19 https://doi.org/10.1088/1361-648X/aab9c3	journal	April 2018
Integrating Low-latency Analysis into HPC System Monitoring Izadpanah, Ramin; Naksinehaboon, Nichamon; Brandt, Jim ICPP 2018: 47th International Conference on Parallel Processing, Proceedings of the 47th International Conference on Parallel Processing https://doi.org/10.1145/3225058.3225086	conference	August 2018
Diagnosing Performance Variations in HPC Applications Using Machine Learning Tuncer, Ozan; Ates, Emre; Zhang, Yijia Lecture Notes in Computer Science https://doi.org/10.1007/978-3-319-58667-0_19	book	January 2017
Network-Aware Scheduling for Data-Parallel Jobs Jalaparti, Virajith; Bodik, Peter; Menache, Ishai ACM SIGCOMM Computer Communication Review, Vol. 45, Issue 4 https://doi.org/10.1145/2829988.2787488	journal	August 2015
Run-to-run variability on Xeon Phi based cray XC systems Chunduri, Sudheer; Harms, Kevin; Parker, Scott SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126926	conference	November 2017

Similar Records

Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation.

Conference · Wed Sep 01 00:00:00 EDT 2021 · OSTI ID:1888952

Improving Power and Performance in HPC Networks.

Conference · Fri Jul 01 00:00:00 EDT 2016 · OSTI ID:1371618

PANN: Power Allocation via Neural Networks - Dynamic Bounded-Power Allocation in High Performance Computing

Conference · Fri Oct 06 00:00:00 EDT 2017 · OSTI ID:1409935

Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation.

Citation Formats

References (27)

Similar Records

Related Subjects