skip to main content

SciTech ConnectSciTech Connect

Title: Further Automate Planned Cluster Maintenance to Minimize System Downtime during Maintenance Windows

This report documents the integration and testing of the automated update process of compute clusters in LC to minimize impact to user productivity. Description: A set of scripts will be written and deployed to further standardize cluster maintenance activities and minimize downtime during planned maintenance windows. Completion Criteria: When the scripts have been deployed and used during planned maintenance windows and a timing comparison is completed between the existing process and the new more automated process, this milestone is complete. This milestone was completed on Aug 23, 2016 on the new CTS1 cluster called Jade when a request to upgrade the version of TOSS 3 was initiated while SWL jobs and normal user jobs were running. Jobs that were running when the update to the system began continued to run to completion. New jobs on the cluster started on the new release of TOSS 3. No system administrator action was required. Current update procedures in TOSS 2 begin by killing all users jobs. Then all diskfull nodes are updated, which can take a few hours. Only after the updates are applied are all nodes are rebooted, and then finally put back into service. A system administrator is required for allmore » steps. In terms of human time spent during a cluster OS update, the TOSS 3 automated procedure on Jade took 0 FTE hours. Doing the same update without the Toss Update Tool would have required 4 FTE hours.« less
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Publication Date:
OSTI Identifier:
Report Number(s):
DOE Contract Number:
Resource Type:
Technical Report
Research Org:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org:
Country of Publication:
United States