DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A large-scale analysis of bioinformatics code on GitHub

Journal Article · · PLoS ONE
ORCiD logo [1];  [1];  [2];  [3];  [1];  [4]
  1. Colorado School of Public Health, Aurora, CO (United States)
  2. National Renewable Energy Lab. (NREL), Golden, CO (United States)
  3. Univ. of Colorado Anschutz Medical Campus, Aurora, CO (United States)
  4. Emory Univ., Atlanta, GA (United States)

In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. Here, the purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods.

Research Organization:
National Renewable Energy Laboratory (NREL), Golden, CO (United States)
Sponsoring Organization:
USDOE Office of Energy Efficiency and Renewable Energy (EERE)
Grant/Contract Number:
AC36-08GO28308
OSTI ID:
1483062
Report Number(s):
NREL/JA--2C00-72836
Journal Information:
PLoS ONE, Journal Name: PLoS ONE Journal Issue: 10 Vol. 13; ISSN 1932-6203
Publisher:
Public Library of ScienceCopyright Statement
Country of Publication:
United States
Language:
English

References (63)

How to test bioinformatics software? journal August 2015
Scalability and Validation of Big Data Bioinformatics Software journal January 2017
Initial sequencing and analysis of the human genome journal February 2001
Targeted editing and evolution of engineered ribosomes in vivo by filtered editing journal January 2022
Advances in biocultural geography of olive tree (Olea europaea L.) landscapes by merging biological and historical assays journal May 2020
Case studies in reproducibility journal January 2011
BioJava: an open-source framework for bioinformatics in 2012 journal August 2012
Understanding "watchers" on GitHub conference January 2014
A large scale study of programming languages and code quality in github
  • Ray, Baishakhi; Posnett, Daryl; Filkov, Vladimir
  • Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering - FSE 2014 https://doi.org/10.1145/2635868.2635922
conference January 2014
Gender and Tenure Diversity in GitHub Teams conference January 2015
An innovative approach for testing bioinformatics programs using metamorphic testing journal January 2009
Ten recommendations for creating usable bioinformatics command line software journal November 2013
Influence analysis of Github repositories journal August 2016
Alternative ideas to increase the percentage of filled seats in nephrology fellowships journal January 2014
Best Practices for Scientific Computing journal January 2014
The Roots of Bioinformatics journal June 2010
BioStar: An Online Question & Answer Resource for the Bioinformatics Community journal October 2011
Ten Simple Rules for Reproducible Computational Research journal October 2013
Women are underrepresented in computational biology: An analysis of the scholarly literature in biology, computer science and computational biology journal October 2017
Bioconductor : open software development for computational biology and bioinformatics text January 2004
Variations of Box Plots journal February 1978
On best practices in the development of bioinformatics software journal July 2014
Walking the Talk: Adopting and Adapting Sustainable Scientific Software Development processes in a Small Biology Lab journal November 2016
A Mathematical Theory of Communication journal October 1948
A Mathematical Theory of Communication journal July 1948
How to test bioinformatics software? journal August 2015
Scalability and Validation of Big Data Bioinformatics Software journal January 2017
The origins of bioinformatics journal December 2000
Initial sequencing and analysis of the human genome journal February 2001
PIC, a paediatric-specific intensive care database journal January 2020
Opinion: Gender diversity leads to better science journal February 2017
Case studies in reproducibility journal January 2011
Biopython: freely available Python tools for computational molecular biology and bioinformatics journal March 2009
BioJava: an open-source framework for bioinformatics in 2012 journal August 2012
Software Carpentry: Getting Scientists to Write Better Code by Making Them More Productive journal November 2006
Better Software, Better Research journal September 2014
What Are the Dominant Projects in the GitHub Python Ecosystem? conference September 2016
Understanding the Factors That Impact the Popularity of GitHub Repositories conference October 2016
An Empirical Study of Adoption of Software Testing in Open Source Projects conference July 2013
What Are the Dominant Projects in the GitHub Python Ecosystem? conference September 2016
Probabilistic topic models conference January 2011
Probabilistic topic models journal April 2012
Understanding "watchers" on GitHub conference January 2014
A large scale study of programming languages and code quality in github
  • Ray, Baishakhi; Posnett, Daryl; Filkov, Vladimir
  • Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering - FSE 2014 https://doi.org/10.1145/2635868.2635922
conference January 2014
Gender and Tenure Diversity in GitHub Teams conference January 2015
An innovative approach for testing bioinformatics programs using metamorphic testing journal January 2009
Ten recommendations for creating usable bioinformatics command line software journal November 2013
Bioconductor: open software development for computational biology and bioinformatics journal September 2004
Influence analysis of Github repositories journal August 2016
How diverse is your team? Investigating gender and nationality diversity in GitHub teams journal December 2017
Software Carpentry: lessons learned journal January 2014
Best Practices for Scientific Computing journal January 2014
The Roots of Bioinformatics journal June 2010
BioStar: An Online Question & Answer Resource for the Bioinformatics Community journal October 2011
Ten Simple Rules for Reproducible Computational Research journal October 2013
A Quick Introduction to Version Control with Git and GitHub journal January 2016
Women are underrepresented in computational biology: An analysis of the scholarly literature in biology, computer science and computational biology journal October 2017
A Survey of Bioinformatics Database and Software Usage through Mining the Literature journal June 2016
A large-scale analysis of bioinformatics code on GitHub text January 2018
Variations of Box Plots journal February 1978
On best practices in the development of bioinformatics software journal July 2014
Best Practices for Scientific Computing text January 2012
Walking the Talk: Adopting and Adapting Sustainable Scientific Software Development processes in a Small Biology Lab journal November 2016

Cited By (10)

Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software journal September 2019
Biobtree: A tool to search and map bioinformatics identifiers and special keywords journal January 2019
GitHub Statistics as a Measure of the Impact of Open-Source Bioinformatics Software journal December 2018
No evidence of citation bias as a determinant of STEM gender disparities in US biochemistry, genetics and molecular biology research journal October 2019
Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software journal September 2019
Molecular bases of responses to abiotic stress in trees journal November 2019
Biobtree: A tool to search and map bioinformatics identifiers and special keywords journal January 2019
Biobtree: A tool to search and map bioinformatics identifiers and special keywords journal January 2019
The ten commandments of translational research informatics journal November 2019
A Systematic Review of Open Source Clinical Software on GitHub for Improving Software Reuse in Smart Healthcare journal January 2019