---
code_id: 175162
site_ownership_code: "LBNL"
open_source: false
landing_contact: "ipo@lbl.gov"
project_type: "CS"
software_type: "S"
official_use_only: {}
developers:
- email: "MJoachimiak@lbl.gov"
  orcid: ""
  first_name: "Marcin"
  last_name: "Joachimiak"
  middle_name: ""
  affiliations:
  - "Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)"
contributors: []
sponsoring_organizations:
- organization_name: "USDOE"
  funding_identifiers: []
  primary_award: "AC02-05CH11231"
  DOE: true
contributing_organizations: []
research_organizations:
- organization_name: "Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United\
    \ States)"
  DOE: true
related_identifiers: []
award_dois: []
release_date: "2025-12-18"
software_title: "Knowledge Oriented Graph Unified Transformer (KOGUT) v0.1"
acronym: "KOGUT v0.1"
doi: "https://doi.org/10.11578/dc.20260210.3"
description: "KOGUT — Knowledge Oriented Graph Unified Transformer\n\nKOGUT implements\
  \ the Relational Graph Transformer (RelGT) architecture for knowledge graph link\
  \ prediction in biological domains, with a primary focus on microbial growth media\
  \ prediction. While the original RelGT (arXiv:2505.10960) targets relational tables,\
  \ time series, and multi-table databases, KOGUT adapts this architecture for heterogeneous\
  \ biological knowledge graphs, providing first-in-class AI predictive models for\
  \ microbial cultivation.\n\nKey Adaptations Beyond Original RelGT:\n- Knowledge\
  \ Graph Focus: Applied to biological KGs with semantic node types (taxa, chemicals,\
  \ media, phenotypes, environments) versus generic relational database tables, trained\
  \ on the KG-Microbe knowledge graph (1.3M entities, 2.9M edges, 24 relation types).\n\
  - Multimodal Node Encoding: Integrates node labels, categories, descriptions, and\
  \ synonyms from KG metadata through learned embedding layers—adapting relational\
  \ column features to graph node attributes with textual semantics.\n- Extended K-Hop\
  \ Subgraph Strategy: Optimized neighborhood sampling (3-hop default, configurable\
  \ up to 200 nodes) tuned for sparse biological networks, building on the original\
  \ local-global attention framework with biological relation preservation.\n- Biolink\
  \ Predicate Preservation: Type-specific transformations for 24 biological edge semantics\
  \ (occurs_in, consumes, produces, has_phenotype, subclass_of) beyond standard relational\
  \ foreign keys, enabling multi-relation link prediction.\n- Inductive Learning Support:\
  \ Enables zero-shot predictions for novel taxa through feature-based embeddings\
  \ (temperature, oxygen requirements, gram stain, cell shape), extending the original\
  \ transductive relational benchmark scope to uncultured microorganisms.\n\nCheapSOTA\
  \ Performance Optimizations (This Distribution):\n- VQ-EMA Centroid Attention: Vector\
  \ quantization with exponential moving average for improved global context modeling\
  \ (+5-10% MRR improvement).\n- HDF5 Precomputed Data Loading: One-time preprocessing\
  \ of k-hop subgraphs to eliminate redundant graph traversals (2-5× training speedup).\n\
  - Distributed Data Parallel Training: Multi-GPU support for scaling to larger knowledge\
  \ graphs (tested on 4× NVIDIA A100 GPUs at NERSC Perlmutter).\n- Mixed Precision\
  \ Training: Automatic mixed precision (AMP) for memory efficiency and faster training.\n\
  \nAdvantages Over Standard Knowledge Graph Embedding Models:\nCombines RelGT's proven\
  \ multi-element tokenization (features, type, hop, structure) with graph-native\
  \ biological representations, enabling interpretable link prediction across heterogeneous\
  \ entities that standard embedding models (TransE, RotatE, ComplEx) and table-based\
  \ transformers cannot directly model. Achieves near-perfect performance on microbial\
  \ growth media prediction (MRR: 0.9966, Precision@1: 0.9932, Hit@10: 1.0000) while\
  \ maintaining explainability through attention-based reasoning over biological pathways.\n\
  \nTraining Data:\n- KG-Microbe merged knowledge graph: 1,379,337 nodes, 2,960,472\
  \ edges\n- 24 biological relation types including taxonomic hierarchies, metabolic\
  \ interactions, phenotype associations, and environmental relationships\n- Primary\
  \ prediction task: Growth media suitability for microbial taxa (biolink:occurs_in,\
  \ 50K edges)\n- Multi-relation capability: Predicts links for any of the 24 relation\
  \ types, including chemical consumption/production, phenotype associations, and\
  \ taxonomic classification\n\nCitation:\nOriginal RelGT Architecture:\n  Dwivedi\
  \ et al., \"Relational Graph Transformer\", arXiv:2505.10960, 2025\n\nKOGUT Implementation:\n\
  \  Knowledge Oriented Graph Unified Transformer for Microbial Growth Media Prediction\n\
  \  Developed at Lawrence Berkeley National Laboratory (LBNL)\n  Trained on NERSC\
  \ Perlmutter supercomputer"
programming_languages: []
country_of_origin: "United States"
project_keywords: []
licenses: []
recipient_org: "LBNL"
site_accession_number: "2026-023"
file_name: "KOGUT.tar.gz"
date_record_added: "2026-02-10"
date_record_updated: "2026-02-10"
is_file_certified: true
last_editor: "agithire@lbl.gov"
is_limited: false
links:
- rel: "citation"
  href: "https://www.osti.gov/doecode/biblio/175162"
