Input/Output Scalability of Genomic Alignment: How to Configure a Computational Biology Cluster
Many scientific applications are I/O-intensive, which makes optimization and scaling difficult, especially on parallel architectures. The I/O requirements of computational biology applications are different from other scientific applications. The main difference is that many computational biology applications are embarrassingly parallel and require repeated read-only access to a large global database. In this paper we examine the scalability of an embarrassingly parallel computational biology application: psLayout, which played a crucial role in the mapping of the human genome. This study was carried out on three architecture: the native UCSC Linux cluster, a Linux cluster at Lawrence Livermore National Labs with a faster interconnect and NFS server, and the ASCI Blue-Pacific supercomputer. We show that a cluster equipped with a fast network and parallel file system or a scalable NFS server has reasonable I/O scalability. We believe that replication is an important issue when scaling to larger numbers of processors, and we introduce the design of a library for automatic data replication to address this issue.
- Research Organization:
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- US Department of Energy (US)
- DOE Contract Number:
- W-7405-ENG-48
- OSTI ID:
- 15006307
- Report Number(s):
- UCRL-JC-145770; TRN: US200407%%203
- Resource Relation:
- Conference: International Parallel and Distributed Processing Symposium, Fort Lauderdale, FL (US), 04/15/2002--04/19/2002; Other Information: PBD: 3 Oct 2001
- Country of Publication:
- United States
- Language:
- English
Similar Records
Design and Implementation of Ceph: A Scalable Distributed File System
A next-generation parallel file system for Linux cluster.