skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Skluma: An extensible metadata extraction pipeline for disorganized data

Abstract

To mitigate the effects of high-velocity data expansion and to automate the organization of filesystems and data repositories, we have developed Skluma-a system that automatically processes a target filesystem or repository, extracts content-and context-based metadata, and organizes extracted metadata for subsequent use. Skluma is able to extract diverse metadata, including aggregate values derived from embedded structured data; named entities and latent topics buried within free-text documents; and content encoded in images. Skluma implements an overarching probabilistic pipeline to extract increasingly specific metadata from files. It applies machine learning methods to determine file types, dynamically prioritizes and then executes a suite of metadata extractors, and explores contextual metadata based on relationships among files. The derived metadata, represented in JSON, describes probabilistic knowledge of each file that may be subsequently used for discovery or organization. Skluma's architecture enables it to be deployed both locally and used as an on-demand, cloud-hosted service to create and execute dynamic extraction workflows on massive numbers of files. It is modular and extensible-allowing users to contribute their own specialized metadata extractors. Thus far we have tested Skluma on local filesystems, remote FTP-accessible servers, and publicly-accessible Globus endpoints. We have demonstrated its efficacy by applying it to amore » scientific environmental data repository of more than 500,000 files. We show that we can extract metadata from those files with modest cloud costs in a few hours.« less

Authors:
; ; ; ; ; ;
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
University of Chicago; USDOE Office of Science (SC); National Institutes of Health (NIH)
OSTI Identifier:
1558658
DOE Contract Number:  
AC02-06CH11357
Resource Type:
Conference
Resource Relation:
Conference: 14th IEEE International Conference on eScience, 10/29/18 - 11/01/18, Amsterdam, NL
Country of Publication:
United States
Language:
English
Subject:
Metadata extraction; data swamp

Citation Formats

Skluzacek, Tyler J., Kumar, Rohan, Chard, Ryan, Harrison, Galen, Beckman, Paul, Chard, Kyle, and Foster, Ian T. Skluma: An extensible metadata extraction pipeline for disorganized data. United States: N. p., 2018. Web. doi:10.1109/eScience.2018.00040.
Skluzacek, Tyler J., Kumar, Rohan, Chard, Ryan, Harrison, Galen, Beckman, Paul, Chard, Kyle, & Foster, Ian T. Skluma: An extensible metadata extraction pipeline for disorganized data. United States. doi:10.1109/eScience.2018.00040.
Skluzacek, Tyler J., Kumar, Rohan, Chard, Ryan, Harrison, Galen, Beckman, Paul, Chard, Kyle, and Foster, Ian T. Mon . "Skluma: An extensible metadata extraction pipeline for disorganized data". United States. doi:10.1109/eScience.2018.00040.
@article{osti_1558658,
title = {Skluma: An extensible metadata extraction pipeline for disorganized data},
author = {Skluzacek, Tyler J. and Kumar, Rohan and Chard, Ryan and Harrison, Galen and Beckman, Paul and Chard, Kyle and Foster, Ian T.},
abstractNote = {To mitigate the effects of high-velocity data expansion and to automate the organization of filesystems and data repositories, we have developed Skluma-a system that automatically processes a target filesystem or repository, extracts content-and context-based metadata, and organizes extracted metadata for subsequent use. Skluma is able to extract diverse metadata, including aggregate values derived from embedded structured data; named entities and latent topics buried within free-text documents; and content encoded in images. Skluma implements an overarching probabilistic pipeline to extract increasingly specific metadata from files. It applies machine learning methods to determine file types, dynamically prioritizes and then executes a suite of metadata extractors, and explores contextual metadata based on relationships among files. The derived metadata, represented in JSON, describes probabilistic knowledge of each file that may be subsequently used for discovery or organization. Skluma's architecture enables it to be deployed both locally and used as an on-demand, cloud-hosted service to create and execute dynamic extraction workflows on massive numbers of files. It is modular and extensible-allowing users to contribute their own specialized metadata extractors. Thus far we have tested Skluma on local filesystems, remote FTP-accessible servers, and publicly-accessible Globus endpoints. We have demonstrated its efficacy by applying it to a scientific environmental data repository of more than 500,000 files. We show that we can extract metadata from those files with modest cloud costs in a few hours.},
doi = {10.1109/eScience.2018.00040},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {10}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: