skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An evaluation of the state of time synchronization on leadership class supercomputers

Journal Article · · Concurrency and Computation. Practice and Experience
DOI:https://doi.org/10.1002/cpe.4341· OSTI ID:1432152
ORCiD logo [1];  [1];  [1];  [2];  [3]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Computer Science and Mathematics Division
  2. Univ. Autonoma de Occidente, Cali (Colombia)
  3. Univ. of New Mexico, Albuquerque, NM (United States). Dept. of Computer Science

We present a detailed examination of time agreement characteristics for nodes within extreme-scale parallel computers. Using a software tool we introduce in this paper, we quantify attributes of clock skew among nodes in three representative high-performance computers sited at three national laboratories. Our measurements detail the statistical properties of time agreement among nodes and how time agreement drifts over typical application execution durations. We discuss the implications of our measurements, why the current state of the field is inadequate, and propose strategies to address observed shortcomings.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC05-00OR22725; AC02-06CH11357; AC02-05CH11231
OSTI ID:
1432152
Journal Information:
Concurrency and Computation. Practice and Experience, Vol. 30, Issue 4; ISSN 1532-0626
Publisher:
WileyCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 4 works
Citation information provided by
Web of Science

References (41)

Practical uses of synchronized clocks in distributed systems journal July 1993
Robust Synchronization of Absolute and Difference Clocks Over Networks journal April 2009
Understanding and isolating the noise in the Linux kernel journal February 2013
Characterizing application sensitivity to OS interference using kernel-level noise injection conference November 2008
Gang scheduling performance benefits for fine-grain synchronization journal December 1992
Network classless time protocol based on clock offset optimization journal August 2006
Clock synchronization in high-end computing environments: a strategy for minimizing clock variance at runtime: CLOCK SYNCHRONIZATION IN HIGH-END COMPUTING ENVIRONMENTS journal July 2012
Linux kernel co-scheduling and bulk synchronous parallelism journal January 2012
On the Accuracy and Stablility of Clocks Synchronized by the Network Time Protocol in the Internet System journal December 1989
Spanner: Google’s Globally Distributed Database journal August 2013
Time synchronization: pivotal element in cloud forensics: Time synchronization: pivotal element in cloud forensics journal June 2014
Fine-grained network time synchronization using reference broadcasts journal December 2002
Extreme scale computing: Modeling the impact of system noise in multi-core clustered systems journal July 2013
On Maximum-Likelihood Estimation of Clock Offset journal January 2005
Clockscalpel: Understanding Root Causes of Internet Clock Synchronization Inaccuracy book January 2011
The Case for Feed-Forward Clock Synchronization journal February 2012
On Efficiently Implementing Global Time for Performance Evaluation on Multiprocessor Systems journal July 1995
Internal Timer Synchronization for Parallel Event Tracing book January 2008
Noise-Tolerant Explicit Stencil Computations for Nonuniform Process Execution Rates journal April 2015
Internet time synchronization: the network time protocol journal January 1991
Logical time in distributed computing systems journal August 1991
A bridging model for parallel computation journal August 1990
Time, clocks, and the ordering of events in a distributed system journal July 1978
Using correlated surprise to infer shared influence conference June 2010
Scheduling In-Situ Analytics in Next-Generation Applications conference May 2016
Quantifying Scheduling Challenges for Exascale System Software
  • Mondragon, Oscar H.; Bridges, Patrick G.; Jones, Terry
  • Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers - ROSS '15 https://doi.org/10.1145/2768405.2768413
conference January 2015
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System conference January 2003
Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R
  • Brightwell, Ron; Oldfield, Ron; Maccabe, Arthur B.
  • Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers - ROSS '13 https://doi.org/10.1145/2491661.2481427
conference January 2013
Identifying sources of Operating System Jitter through fine-grained kernel instrumentation conference September 2007
Understanding Performance Interference in Next-Generation HPC Systems
  • Mondragon, Oscar H.; Bridges, Patrick G.; Levy, Scott
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.32
conference November 2016
How I Learned to Stop Worrying and Love In Situ Analytics: Leveraging Latent Synchronization in MPI Collective Algorithms conference January 2016
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation
  • Hoefler, Torsten; Schneider, Timo; Lumsdaine, Andrew
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.12
conference November 2010
A Paradigm Change: From Performance Monitoring to Performance Analysis conference October 2009
Replay-Based Synchronization of Timestamps in Event Traces of Massively Parallel Applications
  • Becker, Daniel; Linford, John C.; Rabenseifner, Rolf
  • 2008 International Conference on Parallel Processing Workshops (ICPP-W), 2008 International Conference on Parallel Processing - Workshops https://doi.org/10.1109/ICPP-W.2008.17
conference September 2008
A Clock Synchronization Strategy for Minimizing Clock Variance at Runtime in High-End Computing Environments conference October 2010
The Gemini System Interconnect
  • Alverson, Robert; Roweth, Duncan; Kaplan, Larry
  • 2010 IEEE 18th Annual Symposium on High-Performance Interconnects (HOTI), 2010 18th IEEE Symposium on High Performance Interconnects https://doi.org/10.1109/HOTI.2010.23
conference August 2010
Cray Cascade: A scalable HPC system based on a Dragonfly network
  • Faanes, Greg; Bataineh, Abdulla; Roweth, Duncan
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.39
conference November 2012
Transition due to base roughness in a dense granular flow down an inclined plane journal May 2012
Early Experience on the Blue Gene/Q Supercomputing System
  • Morozov, Vitali; Kumaran, Kalyan; Vishwanath, Venkatram
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.106
conference May 2013
Technology-Driven, Highly-Scalable Dragonfly Topology journal June 2008
Serving time in the cloud: Why time-as-a-service?
  • Mizrahi, Tal; Moses, Yoram
  • IEEE INFOCOM 2016 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) https://doi.org/10.1109/INFCOMW.2016.7562052
conference April 2016