{"metadata":{"code_id":175162,"site_ownership_code":"LBNL","open_source":false,"landing_contact":"ipo@lbl.gov","project_type":"CS","software_type":"S","official_use_only":{},"developers":[{"email":"MJoachimiak@lbl.gov","orcid":"","first_name":"Marcin","last_name":"Joachimiak","middle_name":"","affiliations":["Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)"]}],"contributors":[],"sponsoring_organizations":[{"organization_name":"USDOE","funding_identifiers":[],"primary_award":"AC02-05CH11231","DOE":true}],"contributing_organizations":[],"research_organizations":[{"organization_name":"Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)","DOE":true}],"related_identifiers":[],"award_dois":[],"release_date":"2025-12-18","software_title":"Knowledge Oriented Graph Unified Transformer (KOGUT) v0.1","acronym":"KOGUT v0.1","doi":"https://doi.org/10.11578/dc.20260210.3","description":"KOGUT — Knowledge Oriented Graph Unified Transformer\n\nKOGUT implements the Relational Graph Transformer (RelGT) architecture for knowledge graph link prediction in biological domains, with a primary focus on microbial growth media prediction. While the original RelGT (arXiv:2505.10960) targets relational tables, time series, and multi-table databases, KOGUT adapts this architecture for heterogeneous biological knowledge graphs, providing first-in-class AI predictive models for microbial cultivation.\n\nKey Adaptations Beyond Original RelGT:\n- Knowledge Graph Focus: Applied to biological KGs with semantic node types (taxa, chemicals, media, phenotypes, environments) versus generic relational database tables, trained on the KG-Microbe knowledge graph (1.3M entities, 2.9M edges, 24 relation types).\n- Multimodal Node Encoding: Integrates node labels, categories, descriptions, and synonyms from KG metadata through learned embedding layers—adapting relational column features to graph node attributes with textual semantics.\n- Extended K-Hop Subgraph Strategy: Optimized neighborhood sampling (3-hop default, configurable up to 200 nodes) tuned for sparse biological networks, building on the original local-global attention framework with biological relation preservation.\n- Biolink Predicate Preservation: Type-specific transformations for 24 biological edge semantics (occurs_in, consumes, produces, has_phenotype, subclass_of) beyond standard relational foreign keys, enabling multi-relation link prediction.\n- Inductive Learning Support: Enables zero-shot predictions for novel taxa through feature-based embeddings (temperature, oxygen requirements, gram stain, cell shape), extending the original transductive relational benchmark scope to uncultured microorganisms.\n\nCheapSOTA Performance Optimizations (This Distribution):\n- VQ-EMA Centroid Attention: Vector quantization with exponential moving average for improved global context modeling (+5-10% MRR improvement).\n- HDF5 Precomputed Data Loading: One-time preprocessing of k-hop subgraphs to eliminate redundant graph traversals (2-5× training speedup).\n- Distributed Data Parallel Training: Multi-GPU support for scaling to larger knowledge graphs (tested on 4× NVIDIA A100 GPUs at NERSC Perlmutter).\n- Mixed Precision Training: Automatic mixed precision (AMP) for memory efficiency and faster training.\n\nAdvantages Over Standard Knowledge Graph Embedding Models:\nCombines RelGT's proven multi-element tokenization (features, type, hop, structure) with graph-native biological representations, enabling interpretable link prediction across heterogeneous entities that standard embedding models (TransE, RotatE, ComplEx) and table-based transformers cannot directly model. Achieves near-perfect performance on microbial growth media prediction (MRR: 0.9966, Precision@1: 0.9932, Hit@10: 1.0000) while maintaining explainability through attention-based reasoning over biological pathways.\n\nTraining Data:\n- KG-Microbe merged knowledge graph: 1,379,337 nodes, 2,960,472 edges\n- 24 biological relation types including taxonomic hierarchies, metabolic interactions, phenotype associations, and environmental relationships\n- Primary prediction task: Growth media suitability for microbial taxa (biolink:occurs_in, 50K edges)\n- Multi-relation capability: Predicts links for any of the 24 relation types, including chemical consumption/production, phenotype associations, and taxonomic classification\n\nCitation:\nOriginal RelGT Architecture:\n  Dwivedi et al., \"Relational Graph Transformer\", arXiv:2505.10960, 2025\n\nKOGUT Implementation:\n  Knowledge Oriented Graph Unified Transformer for Microbial Growth Media Prediction\n  Developed at Lawrence Berkeley National Laboratory (LBNL)\n  Trained on NERSC Perlmutter supercomputer","programming_languages":[],"country_of_origin":"United States","project_keywords":[],"licenses":[],"recipient_org":"LBNL","site_accession_number":"2026-023","file_name":"KOGUT.tar.gz","date_record_added":"2026-02-10","date_record_updated":"2026-02-10","is_file_certified":true,"last_editor":"agithire@lbl.gov","is_limited":false,"links":[{"rel":"citation","href":"https://www.osti.gov/doecode/biblio/175162"}]}}