
- Fault Injection Framework for System Resilience Fake Faults for Finding Future Failures
- Design and Development of Prototype Components for the Harness
- System-Level Virtualization Research at Oak Ridge National Laboratory1
- An Analysis of HPC Benchmarks in Virtual Machine Environments
- HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC, Los Alamos Computer Science Symposium, Santa Fe, NM, 10/15/2008
- Christian Engelmann1,2 and Al Geist1 1Computer Science and Mathematics Division
- THE UNIVERSITY OF READING Symmetric Active/Active High Availability
- Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations
- 16th Mar, 2007 of 27 Bjoern Koenning1 Virtualized Environments
- April 12, 2007 On Programming Models for Service-Level High Availability 1/30 On Programming Models for
- High Availability for Ultra-Scale High-End Scientific Christian Engelmann
- Hybrid Checkpointing for MPI Jobs in HPC Environments Chao Wang1, Frank Mueller1, Christian Engelmann2, Stephen L. Scott2
- Fault Injection Framework for System Resilience Thomas Naughton, Wesley Bland*, Geoffroy Valle,
- October 17, 2006 Towards High Availability for High-Performance Computing System Services
- Super-Scalable Algorithms for Computing on 100,000 Processors
- High Availability for the Lustre File System
- PDP Toulouse, France Feb 2008 System-level Virtualization for High
- June 4, 2007 Advanced Fault Tolerance Solutions for High Performance Computing
- High Performance Computing with Harness over InfiniBand A. Valentini1
- Development and Implementation of a RAS Framework Prototype for HPC Environments
- University of Rome "Tor Vergata" Information Engeneering Dept. Software Applied Research &
- THE CASE FOR MODULAR REDUNDANCY IN LARGE-SCALE HIGH PERFORMANCE COMPUTING SYSTEMS
- Symmetric Active/Active HighSymmetric Active/Active High Availability for HighAvailability for High--PerformancePerformance
- Design and Development of Prototype Components for the Harness
- May 22, 2008 Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations
- Asymmetric Active-Active High Availability for High-end C. Leangsuksun
- Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems
- Scalable System Monitoring Christian Engelmann
- Distributed Real-Time Computing with Emanuele Di Saverio1
- On Programming Models for Service-Level High Availability C. Engelmann1,2, S. L. Scott1, C. Leangsuksun3, X. He4
- Diskless Checkpointing on Super-scale Architectures
- Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments
- Feb. 11, 2008 Advanced Fault Tolerance Solutions for High Performance Computing 1/47 Advanced Fault Tolerance Solutions
- Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations
- Presenting a thesis on... Simulation of Large Scale Architectures on High
- Sep 26, 2006 K. Uhlemann, C. Engelmann, and S.L. Scott -The University of Reading and Oak Ridge National Laboratory
- Concepts for High Availability in Scientific High-End Computing
- 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP), Weimar, Germany, Feb. 18-20, 2009
- Super-scalable Algorithms Next Generation Supercomputing on
- Facilitating Co-Design for Extreme-Scale Systems Through Lightweight Simulation
- Achieving Computational I/O Efficiency in a High Performance Cluster Using Multicore Processors
- System-Level Virtualization for High Performance Computing Geoffroy Vallee
- Dr. Christian Engelmann and Dr. Stephen L. Scott Computer Science and Mathematics Division
- Proactive Fault Tolerance for HPC with Xen Virtualization Arun Babu Nagarajan1
- A Fast Delivery Protocol for Total Order Broadcasting Xubin (Ben) He
- March 14, 2007 Towards High Availability for High-Performance Computing System Services
- Christian Engelmann, PhD Computer Science Research Group m www.csm.ornl.gov/engelman k engelmannc@ornl.gov
- HighEnd Computing Resilience: Analysis of Issues Facing the HEC Community and PathForward for
- Symmetric Active/Active Metadata Service for High Availability Parallel File Systems
- Symmetric Active/Active High Availability for High-Performance Computing System Services
- MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems
- A PROACTIVE FAULT TOLERANCE FRAMEWORK FOR HIGH-PERFORMANCE COMPUTING
- A Proactive Fault Tolerance Framework for High-Performance
- Managed by UT-Battelle for the Department of Energy George Ostrouchov -ostrouchovg@ornl.gov
- Performance Comparison of Two Virtual Machine Scenarios Using an HPC Application
- Evaluating the Shared Root File System Approach for Diskless High-Performance
- Managed by UT-Battelle for the Department of Energy
- Proactive Fault Tolerance Using Preemptive Migration C. Engelmann, G. R. Vallee, T. Naughton, and S. L. Scott
- Proactive Process-Level Live Migration in HPC Environments
- Virtual System Environments Geoffroy Vallee and Thomas Naughton and Hong Ong and Anand Tikotekar and
- European Conference on Parallel and Distributed Computing (Euro-Par) Workshop on Virtualization in High-Performance Cluster and Grid Computing (VHPC)
- An Online Controller Towards Self-Adaptive File System Availability and Performance
- Tennessee Technological University 2008-4-11 1
- Effects of Virtualization on a Scientific Application Running a Hyperspectral Radiative Transfer Code on Virtual Machines
- Symmetric Active/Active Replication for Dependent Services C. Engelmann1,2, S. L. Scott1, C. Leangsuksun3, and X. He4
- A Framework For Proactive Fault Tolerance12 Geoffroy Vallee
- Virtualized Environments for the Harness High Performance Computing B. Konning1,2, C. Engelmann1,2, S. L. Scott1, and G. A. Geist1
- February 13, 2008 Virtualized Environments for the Harness High Performance Computing Workbench 1/17 Virtualized Environments for the Harness
- SYMMETRIC ACTIVE/ACTIVE METADATA SERVICE FOR HIGHLY AVAILABLE CLUSTER STORAGE SYSTEMS
- A Fast Delivery Protocol for Total Order Broadcasting
- Arun Babu Nagarajan, Frank Mueller Christian Engelmann, Stephen L. Scott
- Middleware in Modern High Performance Computing System Architectures
- May 17, 2007 Transparent Symmetric Active/Active Replication for Service-Level High Availability 1/29 Transparent Symmetric Active/Active Replication
- A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , Frank Mueller1
- A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
- Achieving Computational I/O Efficiency in a High Performance Cluster Using Multicore Processors
- JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management
- A Parallel Plug-in Programming Paradigm Ronald Baumann1,2
- Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Jyothish Varma1
- Jyothish Varma1, Chao Wang1, Frank Mueller1, Christian Engelmann2, Stephen L. Scott2
- RMIX: A Dynamic, Heterogeneous, Reconfigurable Communication Framework
- Concepts for High Availability in Scientific High-End Computing C. Engelmann1,2 and S. L. Scott1
- High Availability for Ultra-Scale Scientific High-End Computing
- Asymmetric / Active-Active High-Availability for
- Super-Scalable Algorithms for Computing on 100,000 Processors
- A Lightweight Kernel for the Harness Metacomputing Framework C. Engelmann and G. A. Geist
- Christian Engelmann and Al Geist Oak Ridge National Laboratory
- C. Engelmann, S. L. Scott, G. A. Geist Oak Ridge National Laboratory
- A Highly Available Cluster Storage System Using Scavenging Xubin (Ben) He, Li Ou
- A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform
- Diskless Checkpointing on Super-scale Architectures
- Distributed PeerDistributed Peer--toto--Peer ControlPeer Control in Harnessin Harness
- Beyond Application-Level Checkpoint Restart Advanced Software Approaches
- Resilience Challenges at the Exascale Christian Engelmann
- Dr. Christian Engelmann Computer Science and Mathematics Division
- Modeling Techniques Towards Christian Engelmann
- 1 Managed by UT-Battelle for the Department of Energy
- HighHigh--Performance Computing Research atPerformance Computing Research at Oak Ridge National LaboratoryOak Ridge National Laboratory
- Dagstuhl Seminar on Fault Tolerance in High-Performance Computing and Grids, Schloss Dagstuhl, Wadern, Germany, May 3-8, 2009.
- Presented by Resiliency for high-performance
- 15 April 2005 Christian Engelmann, Oak Ridge National Laboratory
- Distributed Peer-to-Peer Control Christian Engelmann
- RAS Framework Prototype Real-time Data Reduction of Monitoring Data
- School of Systems Engineering MSc Dissertation Presentation
- Virtualized Environments for the Harness Workbench
- High Availability for High-End Scientific A Dissertation
- System design Active/Active HA Job Scheduler and
- Active/Active Replication for Highly Available HPC System Services C. Engelmann1,2, S. L. Scott1, C. Leangsuksun3, X. He4
- PDP Toulouse, France Feb 2008 A Framework for Proactive Fault
- JCAS -IAA Simulation Efforts at Oak Ridge National Laboratory
- May 28, 2007 Middleware in Modern High Performance Computing System Architectures 1/20 Middleware in Modern High Performance
- C. Engelmann -University of Reading and Oak Ridge National Laboratory High Availability for Ultra-scale Scientific High-End Computing 1/48
- May 12, 2005 Christian Engelmann, Oak Ridge National Laboratory
- Diplomarbeit Zur Erlangung des akademischen Grades eines
- Distributed Peer-to-Peer Control in Harness C. Engelmann, S. L. Scott, G. A. Geist
- HighHigh--Performance Computing Research atPerformance Computing Research at Oak Ridge National LaboratoryOak Ridge National Laboratory
- Christian Engelmann and Stephen L. Scott Computer Science and Mathematics Division
- Facilitating Co-Design for Extreme-Scale Systems Through Lightweight Simulation Christian Engelmann and Frank Lauer
- 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), Innsbruck, Austria, Feb. 15-17, 2011
- 3rd Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2009, Nuremberg, Germany, March 30, 2009
- High Availability for Ultra-Scale Scientific High-End Computing
- Nov. 23, 2007 Symmetric Active/Active Metadata Service for Highly Available Cluster Storage Systems 1/14 Symmetric Active/Active Metadata Service for
- Transparent Symmetric Active/Active Replication for Service-Level High Availability
- Nonparametric Multivariate Anomaly Analysis in Support of HPC Resilience G. Ostrouchov, T. Naughton, C. Engelmann, G. Vallee, and S. L. Scott
- RAS Framework Engine Prototype A Dissertation
- September 26, 2005 Christian Engelmann, Oak Ridge National Laboratory
- Proactive Fault Tolerance Using Preemptive Migration
- Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and
- Simulation of Advanced Large-Scale HPC Architectures Simulation of Advanced Large-Scale HPC
- October 10, 2007 Service-Level High Availability in Parallel and Distributed Systems 1/35 Service-Level High Availability in
- April 22, 2006 Active/Active Replication for Highly Available HPC System Services 1/26 Active/Active Replication for Highly
- REDUNDANT EXECUTION OF HPC APPLICATIONS WITH MR-MPI Christian Engelmann and Swen Bohm
- April 17, 2006 C. Engelmann, S.L. Scott -Oak Ridge National Laboratory
- A Parallel Plug-in Programming Paradigm
- Operating System Research at ORNL: System-level Virtualization
- Effects of Virtualization on a Scientific Application Running a Hyperspectral Radiative Transfer Code on Virtual Machines
- CS258 S99 1 NOW Handout Page 1
- March 5 2008 Symmetric Active/Active Replication for Dependent Services 1/27 Symmetric Active/Active Replication
- 27th IASTED International Conference on Parallel and Distributed Computing and Networks (PCDN), Innsbruck, Austria, Feb. 16-18, 2009
- High Availability through Distributed Control C. Engelmann, S. L. Scott, G. A. Geist
- Proactive Process-Level Live Migration in HPC Environments , Frank Mueller1
- SIMULATION OF LARGE SCALE ARCHITECTURES ON HIGH PERFORMANCE
- High Availability for the Lustre File System A Dissertation
- June 8, 2007 Advanced Fault Tolerance Solutions for High Performance Computing
- Hybrid Checkpointing for MPI Jobs in HPC Environments