skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Further Automate Planned Cluster Maintenance to Minimize System Downtime during Maintenance Windows

Abstract

This report documents the integration and testing of the automated update process of compute clusters in LC to minimize impact to user productivity. Description: A set of scripts will be written and deployed to further standardize cluster maintenance activities and minimize downtime during planned maintenance windows. Completion Criteria: When the scripts have been deployed and used during planned maintenance windows and a timing comparison is completed between the existing process and the new more automated process, this milestone is complete. This milestone was completed on Aug 23, 2016 on the new CTS1 cluster called Jade when a request to upgrade the version of TOSS 3 was initiated while SWL jobs and normal user jobs were running. Jobs that were running when the update to the system began continued to run to completion. New jobs on the cluster started on the new release of TOSS 3. No system administrator action was required. Current update procedures in TOSS 2 begin by killing all users jobs. Then all diskfull nodes are updated, which can take a few hours. Only after the updates are applied are all nodes are rebooted, and then finally put back into service. A system administrator is required for allmore » steps. In terms of human time spent during a cluster OS update, the TOSS 3 automated procedure on Jade took 0 FTE hours. Doing the same update without the Toss Update Tool would have required 4 FTE hours.« less

Authors:
 [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1325869
Report Number(s):
LLNL-TR-702862
DOE Contract Number:  
AC52-07NA27344
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Springmeyer, R. Further Automate Planned Cluster Maintenance to Minimize System Downtime during Maintenance Windows. United States: N. p., 2016. Web. doi:10.2172/1325869.
Springmeyer, R. Further Automate Planned Cluster Maintenance to Minimize System Downtime during Maintenance Windows. United States. https://doi.org/10.2172/1325869
Springmeyer, R. 2016. "Further Automate Planned Cluster Maintenance to Minimize System Downtime during Maintenance Windows". United States. https://doi.org/10.2172/1325869. https://www.osti.gov/servlets/purl/1325869.
@article{osti_1325869,
title = {Further Automate Planned Cluster Maintenance to Minimize System Downtime during Maintenance Windows},
author = {Springmeyer, R.},
abstractNote = {This report documents the integration and testing of the automated update process of compute clusters in LC to minimize impact to user productivity. Description: A set of scripts will be written and deployed to further standardize cluster maintenance activities and minimize downtime during planned maintenance windows. Completion Criteria: When the scripts have been deployed and used during planned maintenance windows and a timing comparison is completed between the existing process and the new more automated process, this milestone is complete. This milestone was completed on Aug 23, 2016 on the new CTS1 cluster called Jade when a request to upgrade the version of TOSS 3 was initiated while SWL jobs and normal user jobs were running. Jobs that were running when the update to the system began continued to run to completion. New jobs on the cluster started on the new release of TOSS 3. No system administrator action was required. Current update procedures in TOSS 2 begin by killing all users jobs. Then all diskfull nodes are updated, which can take a few hours. Only after the updates are applied are all nodes are rebooted, and then finally put back into service. A system administrator is required for all steps. In terms of human time spent during a cluster OS update, the TOSS 3 automated procedure on Jade took 0 FTE hours. Doing the same update without the Toss Update Tool would have required 4 FTE hours.},
doi = {10.2172/1325869},
url = {https://www.osti.gov/biblio/1325869}, journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Sep 13 00:00:00 EDT 2016},
month = {Tue Sep 13 00:00:00 EDT 2016}
}