A Lakehouse Architecture for the Management and Analysis of Heterogeneous Data for Biomedical Research and Mega-biobanks
- ORNL
Data Lakehouse is a new paradigm in data architectures that embodies and integrates already established concepts for the systematic management of disparate, large-scale data – a data lake for heterogeneous data management, use of open standards for high-performance querying, and systematic maintenance of the data "freshness". In addition to being a new concept, the data lakehouse is also still a conceptual construct. Many projects that use the lakehouse require maturing, empirical studies, and specific implementations. In this paper, we present our implementation of the data lakehouse concept in a biomedical research and health data analytics domain, and we discuss the implementation of some unique and novel features such as support for specialized access controls in support of HIPAA regulation and IRB protocols, and support for the FAIR standard.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1865733
- Resource Relation:
- Conference: 2021 IEEE International Conference on Big Data - Orlando, Florida, United States of America - 12/15/2021 5:00:00 AM-12/18/2021 5:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Promoting an open research culture
|
journal | June 2015 |
Apache Spark: a unified engine for big data processing
|
journal | October 2016 |
The FAIR Guiding Principles for scientific data management and stewardship
|
journal | March 2016 |
The variant call format and VCFtools
|
journal | June 2011 |
Million Veteran Program: A mega-biobank to study genetic influences on health and disease
|
journal | February 2016 |
Similar Records
Standardized Architecture for a Mega-Biobank Phenomic Library: The Million Veteran Program (MVP)