skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Large-scale data mining pilot project in human genome

Abstract

This whitepaper briefly describes a new, aggressive effort in large- scale data Livermore National Labs. The implications of `large- scale` will be clarified Section. In the short term, this effort will focus on several @ssion-critical questions of Genome project. We will adapt current data mining techniques to the Genome domain, to quantify the accuracy of inference results, and lay the groundwork for a more extensive effort in large-scale data mining. A major aspect of the approach is that we will be fully-staffed data warehousing effort in the human Genome area. The long term goal is strong applications- oriented research program in large-@e data mining. The tools, skill set gained will be directly applicable to a wide spectrum of tasks involving a for large spatial and multidimensional data. This includes applications in ensuring non-proliferation, stockpile stewardship, enabling Global Ecology (Materials Database Industrial Ecology), advancing the Biosciences (Human Genome Project), and supporting data for others (Battlefield Management, Health Care).

Authors:
; ;
Publication Date:
Research Org.:
Lawrence Livermore National Lab., CA (United States)
Sponsoring Org.:
USDOE, Washington, DC (United States)
OSTI Identifier:
647050
Report Number(s):
UCRL-JC-127338; CONF-9705227-
ON: DE98051372
DOE Contract Number:
W-7405-ENG-48
Resource Type:
Conference
Resource Relation:
Conference: Workshop on research and development opportunities in Federal Information Systems, Arlington, VA (United States), 13-14 May 1997; Other Information: PBD: 1 May 1997
Country of Publication:
United States
Language:
English
Subject:
99 MATHEMATICS, COMPUTERS, INFORMATION SCIENCE, MANAGEMENT, LAW, MISCELLANEOUS; 55 BIOLOGY AND MEDICINE, BASIC STUDIES; HUMAN FACTORS ENGINEERING; HUMAN POPULATIONS; LAWRENCE LIVERMORE NATIONAL LABORATORY; INFORMATION SYSTEMS; DATA PROCESSING; PATTERN RECOGNITION; GENETIC MAPPING; DNA SEQUENCING

Citation Formats

Musick, R., Fidelis, R., and Slezak, T. Large-scale data mining pilot project in human genome. United States: N. p., 1997. Web.
Musick, R., Fidelis, R., & Slezak, T. Large-scale data mining pilot project in human genome. United States.
Musick, R., Fidelis, R., and Slezak, T. 1997. "Large-scale data mining pilot project in human genome". United States. doi:. https://www.osti.gov/servlets/purl/647050.
@article{osti_647050,
title = {Large-scale data mining pilot project in human genome},
author = {Musick, R. and Fidelis, R. and Slezak, T.},
abstractNote = {This whitepaper briefly describes a new, aggressive effort in large- scale data Livermore National Labs. The implications of `large- scale` will be clarified Section. In the short term, this effort will focus on several @ssion-critical questions of Genome project. We will adapt current data mining techniques to the Genome domain, to quantify the accuracy of inference results, and lay the groundwork for a more extensive effort in large-scale data mining. A major aspect of the approach is that we will be fully-staffed data warehousing effort in the human Genome area. The long term goal is strong applications- oriented research program in large-@e data mining. The tools, skill set gained will be directly applicable to a wide spectrum of tasks involving a for large spatial and multidimensional data. This includes applications in ensuring non-proliferation, stockpile stewardship, enabling Global Ecology (Materials Database Industrial Ecology), advancing the Biosciences (Human Genome Project), and supporting data for others (Battlefield Management, Health Care).},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = 1997,
month = 5
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • This meeting was held June 10, 1996 at Georgetown University. The purpose of this meeting was to provide a multidisciplinary forum for exchange of state-of-the-art information on the human genome education model. Topics of discussion include the following: psychosocial issues; ethical issues for professionals; legislative issues and update; and education issues.
  • Dothideomycetes is the largest and most diverse class of ascomycete fungi with 23 orders 110 families, 1300 genera and over 19,000 known species. We present comparative analysis of 70 Dothideomycete genomes including over 50 that we sequenced and are as yet unpublished. This extensive sampling has almost quadrupled the previous study of 18 species and uncovered a 10 fold range of genome sizes. We were able to clarify the phylogenetic positions of several species whose origins were unclear in previous morphological and sequence comparison studies. We analyzed selected gene families including proteases, transporters and small secreted proteins and show thatmore » major differences in gene content is influenced by speciation.« less
  • In this paper, we analyze and optimize the most time-consuming steps of the SWAP-Assembler, a parallel genome assembler, so that it can scale to a large number of cores for huge genomes with the size of sequencing data ranging from terabyes to petabytes. According to the performance analysis results, the most time-consuming steps are input parallelization, k-mer graph construction, and graph simplification (edge merging). For the input parallelization, the input data is divided into virtual fragments with nearly equal size, and the start position and end position of each fragment are automatically separated at the beginning of the reads. Inmore » k-mer graph construction, in order to improve the communication efficiency, the message size is kept constant between any two processes by proportionally increasing the number of nucleotides to the number of processes in the input parallelization step for each round. The memory usage is also decreased because only a small part of the input data is processed in each round. With graph simplification, the communication protocol reduces the number of communication loops from four to two loops and decreases the idle communication time. The optimized assembler is denoted as SWAP-Assembler 2 (SWAP2). In our experiments using a 1000 Genomes project dataset of 4 terabytes (the largest dataset ever used for assembling) on the supercomputer Mira, the results show that SWAP2 scales to 131,072 cores with an efficiency of 40%. We also compared our work with both the HipMER assembler and the SWAP-Assembler. On the Yanhuang dataset of 300 gigabytes, SWAP2 shows a 3X speedup and 4X better scalability compared with the HipMer assembler and is 45 times faster than the SWAP-Assembler. The SWAP2 software is available at https://sourceforge.net/projects/swapassembler.« less
  • The third planning workshop of the Human Genome Diversity Project was held on the campus of the US National Institutes of Health in Bethesda, Maryland, from February 16 through February 18, 1993. The second day of the workshop was devoted to an exploration of the ethical and human-rights implications of the Project. This open meeting centered on three roundtables, involving 12 invited participants, and the resulting discussions among all those present. Attendees and their affiliations are listed in the attached Appendix A. The discussion was guided by a schedule and list of possible issues, distributed to all present and attachedmore » as Appendix B. This is a relatively complete, and thus lengthy, summary of the comments at the meeting. The beginning of the summary sets out as conclusions some issues on which there appeared to be widespread agreement, but those conclusions are not intended to serve as a set of detailed recommendations. The meeting organizer is distributing his recommendations in a separate memorandum; recommendations from others who attended the meeting are welcome and will be distributed by the meeting organizer to the participants and to the Project committee.« less
  • The history and reasons for launching the Human Genome project and the current uses of genetic human material; Identifying and discussing the major issues stemming directly from genetic research and therapy-including genetic discrimination, medical/ person privacy, allocation of government resources and individual finances, and the effect on the way in which we perceive the value of human life; Discussing the sometimes hidden ethical, social and legislative implications of genetic research and therapy such as informed consent, screening and preservation of genetic materials, efficacy of medical procedures, the role of the government, and equal access to medical coverage.