<metadata>
  <codeId>175162</codeId>
  <siteOwnershipCode>LBNL</siteOwnershipCode>
  <openSource>false</openSource>
  <landingContact>ipo@lbl.gov</landingContact>
  <projectType>CS</projectType>
  <softwareType>S</softwareType>
  <officialUseOnly/>
  <releaseDate>2025-12-18</releaseDate>
  <softwareTitle>Knowledge Oriented Graph Unified Transformer (KOGUT) v0.1</softwareTitle>
  <acronym>KOGUT v0.1</acronym>
  <doi>https://doi.org/10.11578/dc.20260210.3</doi>
  <description>KOGUT — Knowledge Oriented Graph Unified Transformer

KOGUT implements the Relational Graph Transformer (RelGT) architecture for knowledge graph link prediction in biological domains, with a primary focus on microbial growth media prediction. While the original RelGT (arXiv:2505.10960) targets relational tables, time series, and multi-table databases, KOGUT adapts this architecture for heterogeneous biological knowledge graphs, providing first-in-class AI predictive models for microbial cultivation.

Key Adaptations Beyond Original RelGT:
- Knowledge Graph Focus: Applied to biological KGs with semantic node types (taxa, chemicals, media, phenotypes, environments) versus generic relational database tables, trained on the KG-Microbe knowledge graph (1.3M entities, 2.9M edges, 24 relation types).
- Multimodal Node Encoding: Integrates node labels, categories, descriptions, and synonyms from KG metadata through learned embedding layers—adapting relational column features to graph node attributes with textual semantics.
- Extended K-Hop Subgraph Strategy: Optimized neighborhood sampling (3-hop default, configurable up to 200 nodes) tuned for sparse biological networks, building on the original local-global attention framework with biological relation preservation.
- Biolink Predicate Preservation: Type-specific transformations for 24 biological edge semantics (occurs_in, consumes, produces, has_phenotype, subclass_of) beyond standard relational foreign keys, enabling multi-relation link prediction.
- Inductive Learning Support: Enables zero-shot predictions for novel taxa through feature-based embeddings (temperature, oxygen requirements, gram stain, cell shape), extending the original transductive relational benchmark scope to uncultured microorganisms.

CheapSOTA Performance Optimizations (This Distribution):
- VQ-EMA Centroid Attention: Vector quantization with exponential moving average for improved global context modeling (+5-10% MRR improvement).
- HDF5 Precomputed Data Loading: One-time preprocessing of k-hop subgraphs to eliminate redundant graph traversals (2-5× training speedup).
- Distributed Data Parallel Training: Multi-GPU support for scaling to larger knowledge graphs (tested on 4× NVIDIA A100 GPUs at NERSC Perlmutter).
- Mixed Precision Training: Automatic mixed precision (AMP) for memory efficiency and faster training.

Advantages Over Standard Knowledge Graph Embedding Models:
Combines RelGT's proven multi-element tokenization (features, type, hop, structure) with graph-native biological representations, enabling interpretable link prediction across heterogeneous entities that standard embedding models (TransE, RotatE, ComplEx) and table-based transformers cannot directly model. Achieves near-perfect performance on microbial growth media prediction (MRR: 0.9966, Precision@1: 0.9932, Hit@10: 1.0000) while maintaining explainability through attention-based reasoning over biological pathways.

Training Data:
- KG-Microbe merged knowledge graph: 1,379,337 nodes, 2,960,472 edges
- 24 biological relation types including taxonomic hierarchies, metabolic interactions, phenotype associations, and environmental relationships
- Primary prediction task: Growth media suitability for microbial taxa (biolink:occurs_in, 50K edges)
- Multi-relation capability: Predicts links for any of the 24 relation types, including chemical consumption/production, phenotype associations, and taxonomic classification

Citation:
Original RelGT Architecture:
  Dwivedi et al., "Relational Graph Transformer", arXiv:2505.10960, 2025

KOGUT Implementation:
  Knowledge Oriented Graph Unified Transformer for Microbial Growth Media Prediction
  Developed at Lawrence Berkeley National Laboratory (LBNL)
  Trained on NERSC Perlmutter supercomputer</description>
  <countryOfOrigin>United States</countryOfOrigin>
  <recipientOrg>LBNL</recipientOrg>
  <siteAccessionNumber>2026-023</siteAccessionNumber>
  <fileName>KOGUT.tar.gz</fileName>
  <dateRecordAdded>2026-02-10</dateRecordAdded>
  <dateRecordUpdated>2026-02-10</dateRecordUpdated>
  <isFileCertified>true</isFileCertified>
  <lastEditor>agithire@lbl.gov</lastEditor>
  <isLimited>false</isLimited>
  <developers>
    <developer>
      <email>MJoachimiak@lbl.gov</email>
      <orcid></orcid>
      <firstName>Marcin</firstName>
      <lastName>Joachimiak</lastName>
      <middleName></middleName>
      <affiliations>
        <affiliation>Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)</affiliation>
      </affiliations>
    </developer>
  </developers>
  <contributors/>
  <sponsoringOrganizations>
    <sponsoringOrganization>
      <organizationName>USDOE</organizationName>
      <primaryAward>AC02-05CH11231</primaryAward>
      <DOE>true</DOE>
      <fundingIdentifiers/>
    </sponsoringOrganization>
  </sponsoringOrganizations>
  <contributingOrganizations/>
  <researchOrganizations>
    <researchOrganization>
      <organizationName>Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)</organizationName>
      <DOE>true</DOE>
    </researchOrganization>
  </researchOrganizations>
  <relatedIdentifiers/>
  <awardDois/>
  <programmingLanguages/>
  <projectKeywords/>
  <licenses/>
  <links>
    <link rel="citation" href="https://www.osti.gov/doecode/biblio/175162"/>
  </links>
</metadata>
