Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

A Lakehouse Architecture for the Management and Analysis of Heterogeneous Data for Biomedical Research and Mega-biobanks

Conference ·

Data Lakehouse is a new paradigm in data architectures that embodies and integrates already established concepts for the systematic management of disparate, large-scale data – a data lake for heterogeneous data management, use of open standards for high-performance querying, and systematic maintenance of the data "freshness". In addition to being a new concept, the data lakehouse is also still a conceptual construct. Many projects that use the lakehouse require maturing, empirical studies, and specific implementations. In this paper, we present our implementation of the data lakehouse concept in a biomedical research and health data analytics domain, and we discuss the implementation of some unique and novel features such as support for specialized access controls in support of HIPAA regulation and IRB protocols, and support for the FAIR standard.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1865733
Country of Publication:
United States
Language:
English

References (12)

Million Veteran Program: A mega-biobank to study genetic influences on health and disease journal February 2016
Hadoop, MapReduce and HDFS: A Developers Perspective journal January 2015
Delta lake journal August 2020
HIPAA Regulations — A New Era of Medical-Record Privacy? journal April 2003
Apache Parquet book January 2016
The variant call format and VCFtools journal June 2011
The FAIR Guiding Principles for scientific data management and stewardship journal March 2016
Information system security compliance to FISMA standard: a quantitative measure journal December 2009
Promoting an open research culture journal June 2015
Apache Spark: a unified engine for big data processing journal October 2016
Risks and Wrongs in Social Science Research journal October 2002
National population-based biobanks for genetic research journal March 2007