Knowledge Oriented Graph Unified Transformer (KOGUT) v0.1

RESOURCE

Abstract

KOGUT — Knowledge Oriented Graph Unified Transformer KOGUT implements the Relational Graph Transformer (RelGT) architecture for knowledge graph link prediction in biological domains, with a primary focus on microbial growth media prediction. While the original RelGT (arXiv:2505.10960) targets relational tables, time series, and multi-table databases, KOGUT adapts this architecture for heterogeneous biological knowledge graphs, providing first-in-class AI predictive models for microbial cultivation. Key Adaptations Beyond Original RelGT: - Knowledge Graph Focus: Applied to biological KGs with semantic node types (taxa, chemicals, media, phenotypes, environments) versus generic relational database tables, trained on the KG-Microbe knowledge graph (1.3M entities, 2.9M edges, 24 relation types). - Multimodal Node Encoding: Integrates node labels, categories, descriptions, and synonyms from KG metadata through learned embedding layers—adapting relational column features to graph node attributes with textual semantics. - Extended K-Hop Subgraph Strategy: Optimized neighborhood sampling (3-hop default, configurable up to 200 nodes) tuned for sparse biological networks, building on the original local-global attention framework with biological relation preservation. - Biolink Predicate Preservation: Type-specific transformations for 24 biological edge semantics (occurs_in, consumes, produces, has_phenotype, subclass_of) beyond standard relational foreign keys, enabling multi-relation link prediction. - Inductive Learning Support: Enables zero-shot predictions for novel taxa through feature-based embeddings (temperature, oxygen requirements, gram stain, cell shape), extending  More>>
Developers:
Joachimiak, Marcin [1]
  1. Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Release Date:
2025-12-18
Project Type:
Closed Source
Software Type:
Scientific
Sponsoring Org.:
Code ID:
175162
Site Accession Number:
2026-023
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Country of Origin:
United States

RESOURCE

Citation Formats

Joachimiak, Marcin. Knowledge Oriented Graph Unified Transformer (KOGUT) v0.1. Computer Software. USDOE. 18 Dec. 2025. Web. doi:10.11578/dc.20260210.3.
Joachimiak, Marcin. (2025, December 18). Knowledge Oriented Graph Unified Transformer (KOGUT) v0.1. [Computer software]. https://doi.org/10.11578/dc.20260210.3.
Joachimiak, Marcin. "Knowledge Oriented Graph Unified Transformer (KOGUT) v0.1." Computer software. December 18, 2025. https://doi.org/10.11578/dc.20260210.3.
@misc{ doecode_175162,
title = {Knowledge Oriented Graph Unified Transformer (KOGUT) v0.1},
author = {Joachimiak, Marcin},
abstractNote = {KOGUT — Knowledge Oriented Graph Unified Transformer KOGUT implements the Relational Graph Transformer (RelGT) architecture for knowledge graph link prediction in biological domains, with a primary focus on microbial growth media prediction. While the original RelGT (arXiv:2505.10960) targets relational tables, time series, and multi-table databases, KOGUT adapts this architecture for heterogeneous biological knowledge graphs, providing first-in-class AI predictive models for microbial cultivation. Key Adaptations Beyond Original RelGT: - Knowledge Graph Focus: Applied to biological KGs with semantic node types (taxa, chemicals, media, phenotypes, environments) versus generic relational database tables, trained on the KG-Microbe knowledge graph (1.3M entities, 2.9M edges, 24 relation types). - Multimodal Node Encoding: Integrates node labels, categories, descriptions, and synonyms from KG metadata through learned embedding layers—adapting relational column features to graph node attributes with textual semantics. - Extended K-Hop Subgraph Strategy: Optimized neighborhood sampling (3-hop default, configurable up to 200 nodes) tuned for sparse biological networks, building on the original local-global attention framework with biological relation preservation. - Biolink Predicate Preservation: Type-specific transformations for 24 biological edge semantics (occurs_in, consumes, produces, has_phenotype, subclass_of) beyond standard relational foreign keys, enabling multi-relation link prediction. - Inductive Learning Support: Enables zero-shot predictions for novel taxa through feature-based embeddings (temperature, oxygen requirements, gram stain, cell shape), extending the original transductive relational benchmark scope to uncultured microorganisms. CheapSOTA Performance Optimizations (This Distribution): - VQ-EMA Centroid Attention: Vector quantization with exponential moving average for improved global context modeling (+5-10% MRR improvement). - HDF5 Precomputed Data Loading: One-time preprocessing of k-hop subgraphs to eliminate redundant graph traversals (2-5× training speedup). - Distributed Data Parallel Training: Multi-GPU support for scaling to larger knowledge graphs (tested on 4× NVIDIA A100 GPUs at NERSC Perlmutter). - Mixed Precision Training: Automatic mixed precision (AMP) for memory efficiency and faster training. Advantages Over Standard Knowledge Graph Embedding Models: Combines RelGT's proven multi-element tokenization (features, type, hop, structure) with graph-native biological representations, enabling interpretable link prediction across heterogeneous entities that standard embedding models (TransE, RotatE, ComplEx) and table-based transformers cannot directly model. Achieves near-perfect performance on microbial growth media prediction (MRR: 0.9966, Precision@1: 0.9932, Hit@10: 1.0000) while maintaining explainability through attention-based reasoning over biological pathways. Training Data: - KG-Microbe merged knowledge graph: 1,379,337 nodes, 2,960,472 edges - 24 biological relation types including taxonomic hierarchies, metabolic interactions, phenotype associations, and environmental relationships - Primary prediction task: Growth media suitability for microbial taxa (biolink:occurs_in, 50K edges) - Multi-relation capability: Predicts links for any of the 24 relation types, including chemical consumption/production, phenotype associations, and taxonomic classification Citation: Original RelGT Architecture: Dwivedi et al., "Relational Graph Transformer", arXiv:2505.10960, 2025 KOGUT Implementation: Knowledge Oriented Graph Unified Transformer for Microbial Growth Media Prediction Developed at Lawrence Berkeley National Laboratory (LBNL) Trained on NERSC Perlmutter supercomputer},
doi = {10.11578/dc.20260210.3},
url = {https://doi.org/10.11578/dc.20260210.3},
howpublished = {[Computer Software] \url{https://doi.org/10.11578/dc.20260210.3}},
year = {2025},
month = {dec}
}