Datum: A Scientific Metadata Catalog
The data catalog market is currently flooded with a myriad of different products, but none serve the scientific community well. There are cloud-native tools like Databricks, Snowflake,to on-premise solutions like Collibra and Datahub. The common failing of all these tools however, is their inability to serve the scientific data community directly. Most catalogs are targeted towards financial, health, or user data - not sensor or scientific domain data. They also prioritize integrations that often don’t exist or are just starting to be used in the scientific realm - all while ignoring common scientific tools and file types. Datum is a catalog which targets the scientific data directly, including the tools and networks in which those tools are used. We work with the producers and consumers of the data where they are, targeting cloud and on-premise with a focus on classified networks. Datum is an Erlang/Elixir application. Technical Features Note: The features listed below are still under development and may change, slightly, upon final delivery of the product. File Formats - Datum has the ability to read additional metadata and provides processing pipelines for the following file formats: Plain Text, PDF, LaTeX, HTML, Open Document Format (.odt), XML, CSV/TSV (and other standard delimiters), OpenDocument Database and Spreadsheets, Geo-Referenced TIFF, Common Data Format, HDF/HDF5, LabView TDMS, Excel, DeltaTables, Parquet, Apache Iceberg, Apache Hudi and many others. Metadata Collection - Scanners for the local and networked file systems and cloud storage providers. Network integration with common databases such as MSSQL and MySQL. User Plugin System - Users are able to provide either file processing, metadata extraction, or sampling plugins in the programming language of their choice. Authentication/Authorization -: OIDC integration, SCIM provisioning and EntraID integration out of the box. Full user and group management system with a “least privilege” operating mode. Governance - Customizable data governance platform; dictate and enforce required metadata, enforce data embargos, and enforce user agreements and NDAs before data access. Ability to create health checks on data, rejecting abandoned or poorly curated data and automatically removing it from the search index. Ability for users to submit corrections. Search - Semantic search is a first class citizen. No licenses to expensive, external software required. Integrated use of vectors and vector-based search allows for AI agent integration at all levels of operation. Metadata Model - Display and control data’s lineage and connections to other data and data directories. Data is modeled after a filesystem - an organization instantly recognizable and navigable by most any user. CLI and SDK - Ships with a Command Line Interface (CLI) tool and with a fully-featured Python SDK. This allows for rapid and programmatic use of Datum by every level of user. Minimal Infrastructure - Datum ships as a single executable file and can be run on any operating system and most CPU architectures. Datum has no reliance on external databases, search indexing tools, or other outside services - and it runs equally well on edge computing devices, cloud services, or in a clustered HPC environment.
- Short Name / Acronym:
- Datum
- Software Type:
- Scientific
- License(s):
- MIT License
- Programming Language(s):
- C; Elixir; Erlang; zig
- Research Organization:
- Idaho National Laboratory (INL), Idaho Falls, ID (United States)
- Sponsoring Organization:
- USDOE Office of Nuclear Energy (NE)Primary Award/Contract Number:AC07-05ID14517
- DOE Contract Number:
- AC07-05ID14517
- Code ID:
- 155543
- OSTI ID:
- code-155543
- Country of Origin:
- United States
Similar Records
Applied Parallel Metadata Indexing
Design and Implementation of a Metadata-rich File System