Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

A Lakehouse Architecture for the Management and Analysis of Heterogeneous Data for Biomedical Research and Mega-biobanks

Conference ·

Data Lakehouse is a new paradigm in data architectures that embodies and integrates already established concepts for the systematic management of disparate, large-scale data – a data lake for heterogeneous data management, use of open standards for high-performance querying, and systematic maintenance of the data "freshness". In addition to being a new concept, the data lakehouse is also still a conceptual construct. Many projects that use the lakehouse require maturing, empirical studies, and specific implementations. In this paper, we present our implementation of the data lakehouse concept in a biomedical research and health data analytics domain, and we discuss the implementation of some unique and novel features such as support for specialized access controls in support of HIPAA regulation and IRB protocols, and support for the FAIR standard.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1865733
Resource Relation:
Conference: 2021 IEEE International Conference on Big Data - Orlando, Florida, United States of America - 12/15/2021 5:00:00 AM-12/18/2021 5:00:00 AM
Country of Publication:
United States
Language:
English

References (5)

Promoting an open research culture journal June 2015
Apache Spark: a unified engine for big data processing journal October 2016
The FAIR Guiding Principles for scientific data management and stewardship journal March 2016
The variant call format and VCFtools journal June 2011
Million Veteran Program: A mega-biobank to study genetic influences on health and disease journal February 2016