Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Strategies for community-sourced biocuration in bioinformatics: a case study on MIBiG 4.0

Journal Article · · Briefings in Bioinformatics
DOI:https://doi.org/10.1093/bib/bbaf659· OSTI ID:3013841
 [1];  [2];  [2];  [2];  [3];  [4];  [5];  [6];  [7];  [8];  [1];  [2];  [2]
  1. Technical Univ. of Denmark, Lyngby (Denmark)
  2. Wageningen Univ. & Research (Netherlands)
  3. Flanders Institute for Biotechnology (VIB), Leuven (Belgium); Katholieke Univ. Leuven, Heverlee (Belgium)
  4. Swiss Federal Institute of Aquatic Science and Technology, Duebendorf (Switzerland)
  5. Eidgenoessische Technische Hochschule (ETH), Zurich (Switzerland)
  6. Univ. of California, Santa Barbara, CA (United States)
  7. Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); USDOE Joint Genome Institute (JGI), Berkeley, CA (United States)
  8. Wageningen Univ. & Research (Netherlands); Univ. of Johannesburg (South Africa)
Biocuration is essential to transform molecular sequence data into standardized, machine-readable resources. Such curated datasets enable comparative analysis, predictive modeling, and data integration across bioinformatics platforms. While professional biocuration is resource-intensive and usually limited to institutional settings, community-driven approaches can mobilize large-scale annotation of specialized datasets and are more resilient to disruptions in scientific funding. Here, we present a model for community-powered curation applied to the Minimum Information about a Biosynthetic Gene Cluster (MIBiG) repository. Through a framework of workflows for metadata capture, annotation validation, and contributor coordination, the MIBiG 4.0 initiative recruited 267 scientists across 178 institutions from 33 countries, volunteering an estimated 4000 h of work. These efforts expanded the MIBiG repository by 22% and enhanced its usability in downstream molecular data analyses in comparative genomic analyses, natural product discovery, and machine learning applications. We provide strategies and actionable lessons for adopting this model, supporting the sustainability of curated bioinformatics resources central to nucleic acid research and related fields.
Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
3013841
Alternate ID(s):
OSTI ID: 3013965
Journal Information:
Briefings in Bioinformatics, Journal Name: Briefings in Bioinformatics Journal Issue: 6 Vol. 26; ISSN 1467-5463; ISSN 1477-4054
Publisher:
Oxford University PressCopyright Statement
Country of Publication:
United States
Language:
English

References (42)

The Evolution of Power and Standard Wikidata Editors: Comparing Editing Behavior over Time to Predict Lifespan and Volume of Edits journal December 2018
Deep Learning to Predict the Biosynthetic Gene Clusters in Bacterial Genomes journal August 2022
LogoMotif: A Comprehensive Database of Transcription Factor Binding Site Profiles in Actinobacteria journal September 2024
CyanoMetDB, a comprehensive public database of secondary metabolites from cyanobacteria journal May 2021
Minimum Information about a Biosynthetic Gene cluster journal August 2015
Mining genomes to illuminate the specialized chemistry of life journal June 2021
Larger and more instructable language models become less reliable journal September 2024
A community resource for paired genomic and metabolomic data mining journal February 2021
The O3 guidelines: open data, open code, and open infrastructure for sustainable curated scientific resources journal May 2024
What large language models know and what people think they know journal January 2025
The FAIR Guiding Principles for scientific data management and stewardship journal March 2016
DeepRiPP integrates multiomics data to automate discovery of novel ribosomally synthesized natural products journal December 2019
Canto: an online tool for community literature curation journal February 2014
Community curation in PomBase: enabling fission yeast experts to provide detailed, standardized, sharable annotation from research publications journal January 2020
Working in biocuration: contemporary experiences and perspectives journal February 2025
Micropublication: incentivizing community curation and placing unpublished data into the public domain journal January 2018
JaponicusDB: rapid deployment of a model organism database for an emerging model species journal December 2021
FlyBase: updates to the Drosophila genes and genomes database journal February 2024
Rhea, the reaction knowledgebase in 2022 journal November 2021
MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters journal November 2022
KEGG for taxonomy-based analysis of pathways and genomes journal October 2022
WikiPathways 2024: next generation pathway database journal November 2023
COCONUT 2.0: a comprehensive overhaul and curation of the collection of open natural products database journal November 2024
The Natural Products Atlas 3.0: extending the database of microbially derived natural products journal November 2024
MIBiG 4.0: advancing biosynthetic gene cluster curation through global collaboration journal December 2024
antiSMASH 8.0: extended gene cluster detection capabilities and analyses of chemistry, enzymology, and regulation journal April 2025
MITE: the Minimum Information about a Tailoring Enzyme database for capturing specialized metabolite biosynthesis journal September 2025
GONUTS: the Gene Ontology Normal Usage Tracking System journal November 2011
UniProt: a worldwide hub of protein knowledge November 2018
BAGEL4: a user-friendly web server to thoroughly mine RiPPs and bacteriocins journal May 2018
A deep learning genome-mining strategy for biosynthetic gene cluster prediction journal August 2019
PDBe-KB: a community-driven resource for structural and functional annotations journal October 2019
MIBiG 2.0: a repository for biosynthetic gene clusters of known function journal October 2019
The strain on scientific publishing journal November 2024
A standardized workflow for submitting data to the Minimum Information about a Biosynthetic Gene cluster (MIBiG) repository: prospects for research-based educational experiences journal July 2018
Big Data: Astronomical or Genomical? journal July 2015
Quality of Computationally Inferred Gene Ontology Annotations journal May 2012
Crowdsourcing biocuration: The Community Assessment of Community Annotation with Ontologies (CACAO) journal October 2021
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions preprint January 2023
CurateGPT: A flexible language-model assisted biocuration tool preprint January 2024
Wikidata as a knowledge graph for the life sciences journal March 2020
The LOTUS initiative for open knowledge management in natural products research journal May 2022

Figures / Tables (5)


Similar Records

MIBiG 4.0: advancing biosynthetic gene cluster curation through global collaboration
Journal Article · Sun Dec 08 19:00:00 EST 2024 · Nucleic Acids Research · OSTI ID:2481018