skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Data Intensive Analysis of Biomolecular Simulations

Conference ·
DOI:https://doi.org/10.1063/1.2836009· OSTI ID:962048

The advances in biomolecular modeling and simulation made possible by the availability of increasingly powerful high performance computing resources is extending molecular simulations to biological more relevant system size and time scales. At the same time, advances in simulation methodologies are allowing more complex processes to be described more accurately. These developments make a systems approach to computational structural biology feasible, but this will require a focused emphasis on the comparative analysis of the increasing number of molecular simulations that are being carried out for biomolecular systems with more realistic models, multi-component environments, and for longer simulation times. Just as in the case of the analysis of the large data sources created by the new high-throughput experimental technologies, biomolecular computer simulations contribute to the progress in biology through comparative analysis. The continuing increase in available protein structures allows the comparative analysis of the role of structure and conformational flexibility in protein function, and is the foundation of the discipline of structural bioinformatics. This creates the opportunity to derive general findings from the comparative analysis of molecular dynamics simulations of a wide range of proteins, protein-protein complexes and other complex biological systems. Because of the importance of protein conformational dynamics for protein function, it is essential that the analysis of molecular trajectories is carried out using a novel, more integrative and systematic approach. We are developing a much needed rigorous computer science based framework for the efficient analysis of the increasingly large data sets resulting from molecular simulations. Such a suite of capabilities will also provide the required tools for access and analysis of a distributed library of generated trajectories. Our research is focusing on the following areas: (1) the development of an efficient analysis framework for very large scale trajectories on massively parallel architectures, (2) the development of novel methodologies that allow automated detection of events in these very large data sets, and (3) the efficient comparative analysis of multiple trajectories. The goal of the presented work is the development of new algorithms that will allow biomolecular simulation studies to become an integral tool to address the challenges of post-genomic biological research. The strategy to deliver the required data intensive computing applications that can effectively deal with the volume of simulation data that will become available is based on taking advantage of the capabilities offered by the use of large globally addressable memory architectures. The first requirement is the design of a flexible underlying data structure for single large trajectories that will form an adaptable framework for a wide range of analysis capabilities. The typical approach to trajectory analysis is to sequentially process trajectories time frame by time frame. This is the implementation found in molecular simulation codes such as NWChem, and has been designed in this way to be able to run on workstation computers and other architectures with an aggregate amount of memory that would not allow entire trajectories to be held in core. The consequence of this approach is an I/O dominated solution that scales very poorly on parallel machines. We are currently using an approach of developing tools specifically intended for use on large scale machines with sufficient main memory that entire trajectories can be held in core. This greatly reduces the cost of I/O as trajectories are read only once during the analysis. In our current Data Intensive Analysis (DIANA) implementation, each processor determines and skips to the entry within the trajectory that typically will be available in multiple files and independently from all other processors read the appropriate frames.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
962048
Report Number(s):
PNNL-SA-55419; KJ0101030; TRN: US200919%%361
Resource Relation:
Conference: COMPUTATION IN MODERN SCIENCE AND ENGINEERING: Proceedings of the International Conference on Computational Methods in Science and Engineering (ICCMSE 2007). AIP Conference Proceedings, 963:1379-1382
Country of Publication:
United States
Language:
English