skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Tympana - Machine Learning Assisted Data Annotation

Technical Report ·
OSTI ID:1960310

A key component to making data Artificial Intelligence (AI)-ready is to provide machine-readable labels for Machine Learning (ML) algorithms. This mapping of input data to outcome provides the core foundation that supervised Machine Learning algorithms utilize. Without a proper mapping or sufficient volume of quality data, supervised learning becomes difficult if not impossible. In scientific domains this expertise comes from highly trained individuals who can properly interpret this data to provide the correct output mapping. As Deep Learning type algorithms expand, the appetite for larger and larger high-quality datasets becomes insatiable. In scientific domains it can be difficult to identify a scientist who is deeply embedded within the domain and also possesses the required algorithmic skillset to apply their domain knowledge in a scalable fashion to benefit from Machine Learning techniques and tools. Constructing a team with an embedded ML expert is an alternative solution, but requires more overhead in team management and an investment in knowledge transfer between the domain experts and data scientists that few teams can afford to make. DataCicada’s solution brings powerful ML tools to the hands of the domain experts, through a platform named Tympana. While it can be easy to instruct users at diverse education levels how to recognize a specific object within an image, it requires years of training to properly evaluate a sensor reading from a DNA sequencer, an X-ray, CT scan, or a particle accelerator. These domains require an inherent body of knowledge and experience that does not lend itself to annotation by the general public. Tympana integrates active learning, explainable AI, and synthetic data generation to create robust models and datasets that can be exported for scientific workflows. Tympana assists scientists in building high quality Machine Learning models for their complex data, allowing them to scale their expertise, serving as a multiplier in efforts. Tympana currently has proof-of-concept use cases, in Protein Sequences, Image Object Detection and Sensor Signal detection. These use cases serve to demonstrate the capabilities of the Tympana platform. The active learning strategies utilized by Tympana have demonstrated a reduction in data volume requirements in initial experiments that will save scientists time in achieving comparable results. DataCicada expects to beta launch the tool shortly after the completion of Phase I SBIR funding. Because the platform will have been proven with biological data, the first likely customers will be the Federal Government (DOE, NIH, NSF, CDC, FDA), but also pharmaceutical companies for therapeutics, vaccine manufacturers, and Universities performing biological research. SARS-CoV-2 research alone could benefit greatly by improving time and results to researchers.

Research Organization:
DATACICADA, LLC
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE Office of Science (SC), Biological and Environmental Research (BER)
Contributing Organization:
Envy Labs
DOE Contract Number:
SC0022459
OSTI ID:
1960310
Type / Phase:
SBIR (Phase I)
Report Number(s):
DOE-DC-22459
Country of Publication:
United States
Language:
English