Strategies for community-sourced biocuration in bioinformatics: a case study on MIBiG 4.0

Blin, Kai; Loureiro, Catarina; Louwen, Nico L. L.; Navarro-Muñoz, Jorge C.; Gerstmans, Hans; Robinson, Serina L.; Rutz, Adriano; Reitz, Zachary L.; Doering, Drew T.; van der Hooft, Justin J. J.; Weber, Tilmann; Medema, Marnix H.; Zdouc, Mitja M.

doi:10.1093/bib/bbaf659

Strategies for community-sourced biocuration in bioinformatics: a case study on MIBiG 4.0

Journal Article · Thu Dec 11 00:00:00 EST 2025 · Briefings in Bioinformatics

DOI:https://doi.org/10.1093/bib/bbaf659· OSTI ID:3013841

^[1]; ^[2]; Louwen, Nico L. L. ^[2]; ^[2]; Gerstmans, Hans ^[3]; Robinson, Serina L. ^[4]; ^[5]; Reitz, Zachary L. ^[6]; Doering, Drew T. ^[7]; ^[8]; ^[1]; ^[2]; ^[2]

Technical Univ. of Denmark, Lyngby (Denmark)
Wageningen Univ. & Research (Netherlands)
Flanders Institute for Biotechnology (VIB), Leuven (Belgium); Katholieke Univ. Leuven, Heverlee (Belgium)
Swiss Federal Institute of Aquatic Science and Technology, Duebendorf (Switzerland)
Eidgenoessische Technische Hochschule (ETH), Zurich (Switzerland)
Univ. of California, Santa Barbara, CA (United States)
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); USDOE Joint Genome Institute (JGI), Berkeley, CA (United States)
Wageningen Univ. & Research (Netherlands); Univ. of Johannesburg (South Africa)

Biocuration is essential to transform molecular sequence data into standardized, machine-readable resources. Such curated datasets enable comparative analysis, predictive modeling, and data integration across bioinformatics platforms. While professional biocuration is resource-intensive and usually limited to institutional settings, community-driven approaches can mobilize large-scale annotation of specialized datasets and are more resilient to disruptions in scientific funding. Here, we present a model for community-powered curation applied to the Minimum Information about a Biosynthetic Gene Cluster (MIBiG) repository. Through a framework of workflows for metadata capture, annotation validation, and contributor coordination, the MIBiG 4.0 initiative recruited 267 scientists across 178 institutions from 33 countries, volunteering an estimated 4000 h of work. These efforts expanded the MIBiG repository by 22% and enhanced its usability in downstream molecular data analyses in comparative genomic analyses, natural product discovery, and machine learning applications. We provide strategies and actionable lessons for adopting this model, supporting the sustainability of curated bioinformatics resources central to nucleic acid research and related fields.

View Accepted Manuscript (DOE)

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)

Grant/Contract Number:: AC02-05CH11231

OSTI ID:: 3013841

Alternate ID(s):: OSTI ID: 3013965

Journal Information:: Briefings in Bioinformatics, Journal Name: Briefings in Bioinformatics Journal Issue: 6 Vol. 26; ISSN 1467-5463; ISSN 1477-4054

Publisher:: Oxford University PressCopyright Statement

Country of Publication:: United States

Language:: English

References (42)

The Evolution of Power and Standard Wikidata Editors: Comparing Editing Behavior over Time to Predict Lifespan and Volume of Edits Sarasua, Cristina; Checco, Alessandro; Demartini, Gianluca Computer Supported Cooperative Work (CSCW), Vol. 28, Issue 5 https://doi.org/10.1007/s10606-018-9344-y	journal	December 2018
Deep Learning to Predict the Biosynthetic Gene Clusters in Bacterial Genomes Liu, Mingyang; Li, Yun; Li, Hongzhe Journal of Molecular Biology, Vol. 434, Issue 15 https://doi.org/10.1016/j.jmb.2022.167597	journal	August 2022
LogoMotif: A Comprehensive Database of Transcription Factor Binding Site Profiles in Actinobacteria Augustijn, Hannah E.; Karapliafis, Dimitris; Joosten, Kristy M. M. Journal of Molecular Biology, Vol. 436, Issue 17 https://doi.org/10.1016/j.jmb.2024.168558	journal	September 2024
CyanoMetDB, a comprehensive public database of secondary metabolites from cyanobacteria Jones, Martin R.; Pinto, Ernani; Torres, Mariana A. Water Research, Vol. 196 https://doi.org/10.1016/j.watres.2021.117017	journal	May 2021
Minimum Information about a Biosynthetic Gene cluster Medema, Marnix H.; Kottmann, Renzo; Yilmaz, Pelin Nature Chemical Biology, Vol. 11, Issue 9 https://doi.org/10.1038/nchembio.1890	journal	August 2015
Mining genomes to illuminate the specialized chemistry of life Medema, Marnix H.; de Rond, Tristan; Moore, Bradley S. Nature Reviews Genetics, Vol. 22, Issue 9 https://doi.org/10.1038/s41576-021-00363-7	journal	June 2021
Larger and more instructable language models become less reliable Zhou, Lexin; Schellaert, Wout; Martínez-Plumed, Fernando Nature, Vol. 634, Issue 8032 https://doi.org/10.1038/s41586-024-07930-y	journal	September 2024
A community resource for paired genomic and metabolomic data mining Schorn, Michelle A.; Verhoeven, Stefan; Ridder, Lars Nature Chemical Biology, Vol. 17, Issue 4 https://doi.org/10.1038/s41589-020-00724-z	journal	February 2021
The O3 guidelines: open data, open code, and open infrastructure for sustainable curated scientific resources Hoyt, Charles Tapley; Gyori, Benjamin M. Scientific Data, Vol. 11, Issue 1 https://doi.org/10.1038/s41597-024-03406-w	journal	May 2024
What large language models know and what people think they know Steyvers, Mark; Tejeda, Heliodoro; Kumar, Aakriti Nature Machine Intelligence, Vol. 7, Issue 2 https://doi.org/10.1038/s42256-024-00976-7	journal	January 2025
The FAIR Guiding Principles for scientific data management and stewardship Wilkinson, Mark D.; Dumontier, Michel; Aalbersberg, IJsbrand Jan Scientific Data, Vol. 3, Issue 1 https://doi.org/10.1038/sdata.2016.18	journal	March 2016
DeepRiPP integrates multiomics data to automate discovery of novel ribosomally synthesized natural products Merwin, Nishanth J.; Mousa, Walaa K.; Dejong, Chris A. Proceedings of the National Academy of Sciences, Vol. 117, Issue 1 https://doi.org/10.1073/pnas.1901493116	journal	December 2019
Canto: an online tool for community literature curation Rutherford, Kim M.; Harris, Midori A.; Lock, Antonia Bioinformatics, Vol. 30, Issue 12 https://doi.org/10.1093/bioinformatics/btu103	journal	February 2014
Community curation in PomBase: enabling fission yeast experts to provide detailed, standardized, sharable annotation from research publications Lock, Antonia; Harris, Midori A.; Rutherford, Kim Database, Vol. 2020 https://doi.org/10.1093/database/baaa028	journal	January 2020
Working in biocuration: contemporary experiences and perspectives Davies, Sarah R. Database, Vol. 2025 https://doi.org/10.1093/database/baaf003	journal	February 2025
Micropublication: incentivizing community curation and placing unpublished data into the public domain Raciti, Daniela; Yook, Karen; Harris, Todd W. Database, Vol. 2018 https://doi.org/10.1093/database/bay013	journal	January 2018
JaponicusDB: rapid deployment of a model organism database for an emerging model species Rutherford, Kim M.; Harris, Midori A.; Oliferenko, Snezhana Genetics, Vol. 220, Issue 4 https://doi.org/10.1093/genetics/iyab223	journal	December 2021
FlyBase: updates to the Drosophila genes and genomes database Öztürk-Çolak, Arzu; Marygold, Steven J.; Antonazzo, Giulia GENETICS, Vol. 227, Issue 1 https://doi.org/10.1093/genetics/iyad211	journal	February 2024
Rhea, the reaction knowledgebase in 2022 Bansal, Parit; Morgat, Anne; Axelsen, Kristian B. Nucleic Acids Research, Vol. 50, Issue D1 https://doi.org/10.1093/nar/gkab1016	journal	November 2021
MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters Terlouw, Barbara R.; Blin, Kai; Navarro-Muñoz, Jorge C. Nucleic Acids Research, Vol. 51, Issue D1 https://doi.org/10.1093/nar/gkac1049	journal	November 2022
KEGG for taxonomy-based analysis of pathways and genomes Kanehisa, Minoru; Furumichi, Miho; Sato, Yoko Nucleic Acids Research, Vol. 51, Issue D1 https://doi.org/10.1093/nar/gkac963	journal	October 2022
WikiPathways 2024: next generation pathway database Agrawal, Ayushi; Balcı, Hasan; Hanspers, Kristina Nucleic Acids Research, Vol. 52, Issue D1 https://doi.org/10.1093/nar/gkad960	journal	November 2023
COCONUT 2.0: a comprehensive overhaul and curation of the collection of open natural products database Chandrasekhar, Venkata; Rajan, Kohulan; Kanakam, Sri Ram Sagar Nucleic Acids Research, Vol. 53, Issue D1 https://doi.org/10.1093/nar/gkae1063	journal	November 2024
The Natural Products Atlas 3.0: extending the database of microbially derived natural products Poynton, Ella F.; van Santen, Jeffrey A.; Pin, Matthew Nucleic Acids Research, Vol. 53, Issue D1 https://doi.org/10.1093/nar/gkae1093	journal	November 2024
MIBiG 4.0: advancing biosynthetic gene cluster curation through global collaboration Zdouc, Mitja M.; Blin, Kai; Louwen, Nico L. L. Nucleic Acids Research, Vol. 53, Issue D1 https://doi.org/10.1093/nar/gkae1115	journal	December 2024
antiSMASH 8.0: extended gene cluster detection capabilities and analyses of chemistry, enzymology, and regulation Blin, Kai; Shaw, Simon; Vader, Lisa Nucleic Acids Research, Vol. 53, Issue W1 https://doi.org/10.1093/nar/gkaf334	journal	April 2025
MITE: the Minimum Information about a Tailoring Enzyme database for capturing specialized metabolite biosynthesis Rutz, Adriano; Probst, Daniel; Aguilar, César Nucleic Acids Research, Vol. 54, Issue D1 https://doi.org/10.1093/nar/gkaf969	journal	September 2025
GONUTS: the Gene Ontology Normal Usage Tracking System Renfro, Daniel P.; McIntosh, Brenley K.; Venkatraman, Anand Nucleic Acids Research, Vol. 40, Issue D1 https://doi.org/10.1093/nar/gkr907	journal	November 2011
UniProt: a worldwide hub of protein knowledge Consortium, The UniProt Nucleic Acids Research https://doi.org/10.1093/nar/gky1049		November 2018
BAGEL4: a user-friendly web server to thoroughly mine RiPPs and bacteriocins van Heel, Auke J.; de Jong, Anne; Song, Chunxu Nucleic Acids Research, Vol. 46, Issue W1 https://doi.org/10.1093/nar/gky383	journal	May 2018
A deep learning genome-mining strategy for biosynthetic gene cluster prediction Hannigan, Geoffrey D.; Prihoda, David; Palicka, Andrej Nucleic Acids Research, Vol. 47, Issue 18 https://doi.org/10.1093/nar/gkz654	journal	August 2019
PDBe-KB: a community-driven resource for structural and functional annotations Varadi, Mihaly; Berrisford, John; Deshpande, Mandar Nucleic Acids Research, Vol. 48, Issue D1 https://doi.org/10.1093/nar/gkz853	journal	October 2019
MIBiG 2.0: a repository for biosynthetic gene clusters of known function Kautsar, Satria A.; Blin, Kai; Shaw, Simon Nucleic Acids Research https://doi.org/10.1093/nar/gkz882	journal	October 2019
The strain on scientific publishing Hanson, Mark A.; Barreiro, Pablo Gómez; Crosetto, Paolo Quantitative Science Studies, Vol. 5, Issue 4 https://doi.org/10.1162/qss_a_00327	journal	November 2024
A standardized workflow for submitting data to the Minimum Information about a Biosynthetic Gene cluster (MIBiG) repository: prospects for research-based educational experiences Epstein, Samuel C.; Charkoudian, Louise K.; Medema, Marnix H. Standards in Genomic Sciences, Vol. 13, Issue 1 https://doi.org/10.1186/s40793-018-0318-y	journal	July 2018
Big Data: Astronomical or Genomical? Stephens, Zachary D.; Lee, Skylar Y.; Faghri, Faraz PLOS Biology, Vol. 13, Issue 7 https://doi.org/10.1371/journal.pbio.1002195	journal	July 2015
Quality of Computationally Inferred Gene Ontology Annotations Škunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe PLoS Computational Biology, Vol. 8, Issue 5 https://doi.org/10.1371/journal.pcbi.1002533	journal	May 2012
Crowdsourcing biocuration: The Community Assessment of Community Annotation with Ontologies (CACAO) Ramsey, Jolene; McIntosh, Brenley; Renfro, Daniel PLOS Computational Biology, Vol. 17, Issue 10 https://doi.org/10.1371/journal.pcbi.1009463	journal	October 2021
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions Huang, Lei; Yu, Weijiang; Ma, Weitao arXiv https://doi.org/10.48550/arXiv.2311.05232	preprint	January 2023
CurateGPT: A flexible language-model assisted biocuration tool Caufield, Harry; Kroll, Carlo; O'Neil, Shawn T. arXiv https://doi.org/10.48550/arXiv.2411.00046	preprint	January 2024
Wikidata as a knowledge graph for the life sciences Waagmeester, Andra; Stupp, Gregory; Burgstaller-Muehlbacher, Sebastian eLife, Vol. 9 https://doi.org/10.7554/eLife.52614	journal	March 2020
The LOTUS initiative for open knowledge management in natural products research Rutz, Adriano; Sorokina, Maria; Galgonek, Jakub eLife, Vol. 11 https://doi.org/10.7554/eLife.70780	journal	May 2022