DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An evaluation of the state of time synchronization on leadership class supercomputers

Abstract

We present a detailed examination of time agreement characteristics for nodes within extreme-scale parallel computers. Using a software tool we introduce in this paper, we quantify attributes of clock skew among nodes in three representative high-performance computers sited at three national laboratories. Our measurements detail the statistical properties of time agreement among nodes and how time agreement drifts over typical application execution durations. We discuss the implications of our measurements, why the current state of the field is inadequate, and propose strategies to address observed shortcomings.

Authors:
ORCiD logo [1];  [1];  [1];  [2];  [3]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Computer Science and Mathematics Division
  2. Univ. Autonoma de Occidente, Cali (Colombia)
  3. Univ. of New Mexico, Albuquerque, NM (United States). Dept. of Computer Science
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1432152
Grant/Contract Number:  
AC05-00OR22725; AC02-06CH11357; AC02-05CH11231
Resource Type:
Accepted Manuscript
Journal Name:
Concurrency and Computation. Practice and Experience
Additional Journal Information:
Journal Volume: 30; Journal Issue: 4; Journal ID: ISSN 1532-0626
Publisher:
Wiley
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; clock synchronization; large-scale systems; system software; time service

Citation Formats

Jones, Terry, Ostrouchov, George, Koenig, Gregory A., Mondragon, Oscar H., and Bridges, Patrick G. An evaluation of the state of time synchronization on leadership class supercomputers. United States: N. p., 2017. Web. doi:10.1002/cpe.4341.
Jones, Terry, Ostrouchov, George, Koenig, Gregory A., Mondragon, Oscar H., & Bridges, Patrick G. An evaluation of the state of time synchronization on leadership class supercomputers. United States. https://doi.org/10.1002/cpe.4341
Jones, Terry, Ostrouchov, George, Koenig, Gregory A., Mondragon, Oscar H., and Bridges, Patrick G. Mon . "An evaluation of the state of time synchronization on leadership class supercomputers". United States. https://doi.org/10.1002/cpe.4341. https://www.osti.gov/servlets/purl/1432152.
@article{osti_1432152,
title = {An evaluation of the state of time synchronization on leadership class supercomputers},
author = {Jones, Terry and Ostrouchov, George and Koenig, Gregory A. and Mondragon, Oscar H. and Bridges, Patrick G.},
abstractNote = {We present a detailed examination of time agreement characteristics for nodes within extreme-scale parallel computers. Using a software tool we introduce in this paper, we quantify attributes of clock skew among nodes in three representative high-performance computers sited at three national laboratories. Our measurements detail the statistical properties of time agreement among nodes and how time agreement drifts over typical application execution durations. We discuss the implications of our measurements, why the current state of the field is inadequate, and propose strategies to address observed shortcomings.},
doi = {10.1002/cpe.4341},
journal = {Concurrency and Computation. Practice and Experience},
number = 4,
volume = 30,
place = {United States},
year = {Mon Oct 09 00:00:00 EDT 2017},
month = {Mon Oct 09 00:00:00 EDT 2017}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 4 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Practical uses of synchronized clocks in distributed systems
journal, July 1993


Robust Synchronization of Absolute and Difference Clocks Over Networks
journal, April 2009

  • Veitch, D.; Ridoux, J.; Korada, S. B.
  • IEEE/ACM Transactions on Networking, Vol. 17, Issue 2
  • DOI: 10.1109/TNET.2008.926505

Understanding and isolating the noise in the Linux kernel
journal, February 2013

  • Akkan, Hakan; Lang, Michael; Liebrock, Lorie
  • The International Journal of High Performance Computing Applications, Vol. 27, Issue 2
  • DOI: 10.1177/1094342013477892

Characterizing application sensitivity to OS interference using kernel-level noise injection
conference, November 2008

  • Ferreira, Kurt B.; Bridges, Patrick; Brightwell, Ron
  • 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2008.5219920

Gang scheduling performance benefits for fine-grain synchronization
journal, December 1992


Network classless time protocol based on clock offset optimization
journal, August 2006


Clock synchronization in high-end computing environments: a strategy for minimizing clock variance at runtime: CLOCK SYNCHRONIZATION IN HIGH-END COMPUTING ENVIRONMENTS
journal, July 2012

  • Jones, Terry; Koenig, Gregory A.
  • Concurrency and Computation: Practice and Experience, Vol. 25, Issue 6
  • DOI: 10.1002/cpe.2868

Linux kernel co-scheduling and bulk synchronous parallelism
journal, January 2012

  • Jones, Terry
  • The International Journal of High Performance Computing Applications, Vol. 26, Issue 2
  • DOI: 10.1177/1094342011433523

On the Accuracy and Stablility of Clocks Synchronized by the Network Time Protocol in the Internet System
journal, December 1989

  • Mills, David L.
  • ACM SIGCOMM Computer Communication Review, Vol. 20, Issue 1
  • DOI: 10.1145/86587.86591

Spanner: Google’s Globally Distributed Database
journal, August 2013

  • Corbett, James C.; Hochschild, Peter; Hsieh, Wilson
  • ACM Transactions on Computer Systems, Vol. 31, Issue 3
  • DOI: 10.1145/2518037.2491245

Time synchronization: pivotal element in cloud forensics: Time synchronization: pivotal element in cloud forensics
journal, June 2014

  • Marangos, Nikolaos; Rizomiliotis, Panagiotis; Mitrou, Lilian
  • Security and Communication Networks, Vol. 9, Issue 6
  • DOI: 10.1002/sec.1056

Fine-grained network time synchronization using reference broadcasts
journal, December 2002

  • Elson, Jeremy; Girod, Lewis; Estrin, Deborah
  • ACM SIGOPS Operating Systems Review, Vol. 36, Issue SI
  • DOI: 10.1145/844128.844143

Extreme scale computing: Modeling the impact of system noise in multi-core clustered systems
journal, July 2013

  • Seelam, Seetharami; Fong, Liana; Tantawi, Asser
  • Journal of Parallel and Distributed Computing, Vol. 73, Issue 7
  • DOI: 10.1016/j.jpdc.2013.01.016

On Maximum-Likelihood Estimation of Clock Offset
journal, January 2005


Clockscalpel: Understanding Root Causes of Internet Clock Synchronization Inaccuracy
book, January 2011


The Case for Feed-Forward Clock Synchronization
journal, February 2012

  • Ridoux, Julien; Veitch, Darryl; Broomhead, Timothy
  • IEEE/ACM Transactions on Networking, Vol. 20, Issue 1
  • DOI: 10.1109/TNET.2011.2158443

On Efficiently Implementing Global Time for Performance Evaluation on Multiprocessor Systems
journal, July 1995

  • Maillet, E.; Tron, C.
  • Journal of Parallel and Distributed Computing, Vol. 28, Issue 1
  • DOI: 10.1006/jpdc.1995.1090

Internal Timer Synchronization for Parallel Event Tracing
book, January 2008

  • Doleschal, Jens; Knüpfer, Andreas; Müller, Matthias S.
  • Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • DOI: 10.1007/978-3-540-87475-1_29

Noise-Tolerant Explicit Stencil Computations for Nonuniform Process Execution Rates
journal, April 2015

  • Hammouda, Adam; Siegel, Andrew R.; Siegel, Stephen F.
  • ACM Transactions on Parallel Computing, Vol. 2, Issue 1
  • DOI: 10.1145/2742351

Internet time synchronization: the network time protocol
journal, January 1991

  • Mills, D. L.
  • IEEE Transactions on Communications, Vol. 39, Issue 10
  • DOI: 10.1109/26.103043

Logical time in distributed computing systems
journal, August 1991


A bridging model for parallel computation
journal, August 1990


Time, clocks, and the ordering of events in a distributed system
journal, July 1978


Using correlated surprise to infer shared influence
conference, June 2010

  • Oliner, Adam J.; Kulkarni, Ashutosh V.; Aiken, Alex
  • 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN)
  • DOI: 10.1109/DSN.2010.5544921

Scheduling In-Situ Analytics in Next-Generation Applications
conference, May 2016

  • Mondragon, Oscar H.; Bridges, Patrick G.; Levy, Scott
  • 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
  • DOI: 10.1109/CCGrid.2016.42

Quantifying Scheduling Challenges for Exascale System Software
conference, January 2015

  • Mondragon, Oscar H.; Bridges, Patrick G.; Jones, Terry
  • Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers - ROSS '15
  • DOI: 10.1145/2768405.2768413

Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System
conference, January 2003

  • Jones, Terry; Tomlinson, Paul; Roberts, Mark
  • Proceedings of the 2003 ACM/IEEE conference on Supercomputing - SC '03
  • DOI: 10.1145/1048935.1050161

Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R
conference, January 2013

  • Brightwell, Ron; Oldfield, Ron; Maccabe, Arthur B.
  • Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers - ROSS '13
  • DOI: 10.1145/2491661.2481427

Identifying sources of Operating System Jitter through fine-grained kernel instrumentation
conference, September 2007

  • De, Pradipta; Kothari, Ravi; Mann, Vijay
  • 2007 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTR.2007.4629247

Understanding Performance Interference in Next-Generation HPC Systems
conference, November 2016

  • Mondragon, Oscar H.; Bridges, Patrick G.; Levy, Scott
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2016.32

How I Learned to Stop Worrying and Love In Situ Analytics: Leveraging Latent Synchronization in MPI Collective Algorithms
conference, January 2016

  • Levy, Scott; Ferreira, Kurt B.; Widener, Patrick
  • Proceedings of the 23rd European MPI Users' Group Meeting on - EuroMPI 2016
  • DOI: 10.1145/2966884.2966920

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation
conference, November 2010

  • Hoefler, Torsten; Schneider, Timo; Lumsdaine, Andrew
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2010.12

A Paradigm Change: From Performance Monitoring to Performance Analysis
conference, October 2009

  • DeRose, Luiz; Poxon, Heidi
  • 2009 21st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
  • DOI: 10.1109/SBAC-PAD.2009.28

Replay-Based Synchronization of Timestamps in Event Traces of Massively Parallel Applications
conference, September 2008

  • Becker, Daniel; Linford, John C.; Rabenseifner, Rolf
  • 2008 International Conference on Parallel Processing Workshops (ICPP-W), 2008 International Conference on Parallel Processing - Workshops
  • DOI: 10.1109/ICPP-W.2008.17

A Clock Synchronization Strategy for Minimizing Clock Variance at Runtime in High-End Computing Environments
conference, October 2010

  • Jones, Terry; Koenig, Gregory A.
  • 2010 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
  • DOI: 10.1109/SBAC-PAD.2010.33

The Gemini System Interconnect
conference, August 2010

  • Alverson, Robert; Roweth, Duncan; Kaplan, Larry
  • 2010 IEEE 18th Annual Symposium on High-Performance Interconnects (HOTI), 2010 18th IEEE Symposium on High Performance Interconnects
  • DOI: 10.1109/HOTI.2010.23

Cray Cascade: A scalable HPC system based on a Dragonfly network
conference, November 2012

  • Faanes, Greg; Bataineh, Abdulla; Roweth, Duncan
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2012.39

Transition due to base roughness in a dense granular flow down an inclined plane
journal, May 2012

  • Kumaran, V.; Maheshwari, S.
  • Physics of Fluids, Vol. 24, Issue 5
  • DOI: 10.1063/1.4710543

Early Experience on the Blue Gene/Q Supercomputing System
conference, May 2013

  • Morozov, Vitali; Kumaran, Kalyan; Vishwanath, Venkatram
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
  • DOI: 10.1109/IPDPS.2013.106

Technology-Driven, Highly-Scalable Dragonfly Topology
journal, June 2008

  • Kim, John; Dally, Wiliam J.; Scott, Steve
  • ACM SIGARCH Computer Architecture News, Vol. 36, Issue 3
  • DOI: 10.1145/1394608.1382129

Serving time in the cloud: Why time-as-a-service?
conference, April 2016

  • Mizrahi, Tal; Moses, Yoram
  • IEEE INFOCOM 2016 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)
  • DOI: 10.1109/INFCOMW.2016.7562052