skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Aggregation and Structuring of Materials and Chemicals Data from Diverse Sources

Abstract

The overarching goal of this work was to demonstrate how data driven materials discovery which utilizes artificial intelligence can be utilized by both the industrial and academic sectors. Our objectives were four fold. First, to build the necessary tools, both experimental and computational, to demonstrate how data driven methods accelerate materials discovery in challenging design spaces for technologically important materials for which there does not exist a theoretical framework relating processing, structure, and properties. Second, to apply these tools to make break-through discoveries in two materials classes, namely wear resistant alloys and highly selective catalysts. Third, to develop the guiding principles and provide the foundational software frameworks which can be utilized with minimal effort in novel application areas outside the scope of the proposed work. Forth, to demonstrate how to translate machine learning models, which have previously been regarded as ‘black boxes’, into a human readable new physio-chemical insights. To this end we have used high-throughput experimentation coupled to data-driven feedback to investigate synthesis routes for two complex material systems that exhibit immediate industrial potential but are challenged by engineering bottlenecks: wear resistant metallic glasses and crystalline nanoparticle catalysts for hydrocarbon conversion. In both of these systems, current theoretical frameworksmore » have fallen short of providing sufficiently robust predictions for candidate materials, and the synthetic routes to creating them. By leveraging massively scalable data-driven models to inform a cycle of synthesis, measurement, and model building, we have compressed the timeline for breakthroughs in these systems by at least order of magnitude. Taking advantage of the synchrotron-based measurement capabilities at SSRL and the massively scalable data-driven modeling platform of Citrine Informatics (Citrination), we performed highthroughput experimentation which is informed by adaptive machine learning models (i.e sequential learning) which are scalable to from tens to tens of thousands of related experiments. This approach augments traditional combinatorial experiments, by including a sequential learning algorithm in the loop to search large design spaces as efficiently as possible, and informed by all prior experiments performed. The common design loop which is performed when these sequential learning algorithms are incorporated into traditional high throughput combinatorial experiments is as follows. A target set of material properties or structures is specified. A traditional combinatorial experiment is performed to generate an initial data set. A machine learning algorithm is trained on this initial data set. The machine learning algorithm is then queried for candidate materials which will either be likely to hit the specified target (i.e exploit), or be likely to significantly improve the predictive power of the algorithm (i.e explore). New experiments are performed based on these predictions, and that data is contributed back to the model, which is re-queried for the next set of candidate materials. This loop has several advantages of traditional design of experiments coupled to high throughput combinatorial methods but the primary advantage is that the sequential learning algorithm guides successive experiments using all prior data to efficiently search the potential design space by sampling only the most impactful regions. The result is significantly accelerated discovery. In the case of this work we demonstrated that in two years we were 7 able to double the number of metallic glasses discovered over decades of research, as well as reduce the time to discovery nanocrystal synthetic routes from 1 year to 3 days. This represents a significant increase in the efficiency of R&D in challenging design spaces. Additionally the use of these tools resulted in the discovery of a FeBNb alloy with the wear resistance a Young’s modulus of 404 hardened stainless steel, and hardness of silicon carbide, as well as a catalyst which converts propane to propylene with a 100% selectivity, and negligible degradation in activity after 20 hours on stream. The catalysts discovered are significant as the propane dehydrogenation reaction has been identified as a potential target reaction which can see a 20% reduction in energy inputs if separation is less energy intensive. The high selectivity for the intended product, accompanied by less frequent catalyst regeneration cycles has the potential to result in significant energy savings. The demonstration of the use of sequential learning data driven discovery required the development of algorithms and software. This included both the continued development of the Citrination platform, as well as the development of open source freely available software. The Citrination platform was augments to include new data sets which were integrated, as well as develop new workflows for the uptake of x-ray scattering data. Additionally, the Citrination API was further developed to be able to interact with using a python interface. In order to facilitate the demonstration of autonomous closed loop synthetic discovery two software packages were developed. Xrsdkit (https://github.com/scattering-central/xrsdkit) uses a set of machine learning algorithms to completely automate the interpretation of x-ray scattering data and is extensible to any x-ray scattering data. The platform for automated workflows by SSRL, paws (https://github.com/slaclab/paws), handles all the machine control for the closed loop sequential learning driven synthesis and includes a set of sequential learning algorithms for the design of experiment, as well as a client for communication with Citrination to replace paws built-in sequential learning algorithm with Citrine’s design tool. Together these pieces of software ensure that other DOE projects, or industry researchers which would benefit from the utilization of sequential learning or automated materials discovery can do so with minimal effort on the part of the project performers. These tools will function out of the box for a wide array of applications in which the discovery of higher performing materials is the current bottle neck. Additionally the Citrination platform has seen a broad adoption by industry users over the course of the project period, with a demonstrated track record of significantly compressing R&D time tables and reducing associated costs. The tools and results of this project provide a powerful demonstration of how machine learning can be used to significantly accelerate R&D cycles for the discovery of high performing materials. In both cases the use of these tools resulted in either significantly increasing the rate of discovery, or significantly reducing the time to discovery. Furthermore, we demonstrated the ability to perform closed-loop automated discovery of complex materials which to our knowledge is the first example for beyond small molecule drug discovery. Given these successes we strongly recommend that future project which include materials discovery as part of the proposed work include the use of these methods in the research design. We have endeavored to build the foundational tools which are applicable across application spaces to ensure that these tools are accessible to researchers and industry alike.« less

Authors:
ORCiD logo [1]; ORCiD logo [1]
  1. SLAC National Accelerator Lab., Menlo Park, CA (United States). Stanford Synchrotron Radiation Lightsource (SSRL)
Publication Date:
Research Org.:
SLAC National Accelerator Lab., Menlo Park, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
Contributing Org.:
Citrine Informatics, Inc., Redwood City, CA (United States)
OSTI Identifier:
1630122
Report Number(s):
SLAC-R-1140
FWP 100250;
DOE Contract Number:  
AC02-76SF00515
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English

Citation Formats

Tassone, Christopher, and Mehta, Apurva. Aggregation and Structuring of Materials and Chemicals Data from Diverse Sources. United States: N. p., 2019. Web. doi:10.2172/1630122.
Tassone, Christopher, & Mehta, Apurva. Aggregation and Structuring of Materials and Chemicals Data from Diverse Sources. United States. doi:10.2172/1630122.
Tassone, Christopher, and Mehta, Apurva. Tue . "Aggregation and Structuring of Materials and Chemicals Data from Diverse Sources". United States. doi:10.2172/1630122. https://www.osti.gov/servlets/purl/1630122.
@article{osti_1630122,
title = {Aggregation and Structuring of Materials and Chemicals Data from Diverse Sources},
author = {Tassone, Christopher and Mehta, Apurva},
abstractNote = {The overarching goal of this work was to demonstrate how data driven materials discovery which utilizes artificial intelligence can be utilized by both the industrial and academic sectors. Our objectives were four fold. First, to build the necessary tools, both experimental and computational, to demonstrate how data driven methods accelerate materials discovery in challenging design spaces for technologically important materials for which there does not exist a theoretical framework relating processing, structure, and properties. Second, to apply these tools to make break-through discoveries in two materials classes, namely wear resistant alloys and highly selective catalysts. Third, to develop the guiding principles and provide the foundational software frameworks which can be utilized with minimal effort in novel application areas outside the scope of the proposed work. Forth, to demonstrate how to translate machine learning models, which have previously been regarded as ‘black boxes’, into a human readable new physio-chemical insights. To this end we have used high-throughput experimentation coupled to data-driven feedback to investigate synthesis routes for two complex material systems that exhibit immediate industrial potential but are challenged by engineering bottlenecks: wear resistant metallic glasses and crystalline nanoparticle catalysts for hydrocarbon conversion. In both of these systems, current theoretical frameworks have fallen short of providing sufficiently robust predictions for candidate materials, and the synthetic routes to creating them. By leveraging massively scalable data-driven models to inform a cycle of synthesis, measurement, and model building, we have compressed the timeline for breakthroughs in these systems by at least order of magnitude. Taking advantage of the synchrotron-based measurement capabilities at SSRL and the massively scalable data-driven modeling platform of Citrine Informatics (Citrination), we performed highthroughput experimentation which is informed by adaptive machine learning models (i.e sequential learning) which are scalable to from tens to tens of thousands of related experiments. This approach augments traditional combinatorial experiments, by including a sequential learning algorithm in the loop to search large design spaces as efficiently as possible, and informed by all prior experiments performed. The common design loop which is performed when these sequential learning algorithms are incorporated into traditional high throughput combinatorial experiments is as follows. A target set of material properties or structures is specified. A traditional combinatorial experiment is performed to generate an initial data set. A machine learning algorithm is trained on this initial data set. The machine learning algorithm is then queried for candidate materials which will either be likely to hit the specified target (i.e exploit), or be likely to significantly improve the predictive power of the algorithm (i.e explore). New experiments are performed based on these predictions, and that data is contributed back to the model, which is re-queried for the next set of candidate materials. This loop has several advantages of traditional design of experiments coupled to high throughput combinatorial methods but the primary advantage is that the sequential learning algorithm guides successive experiments using all prior data to efficiently search the potential design space by sampling only the most impactful regions. The result is significantly accelerated discovery. In the case of this work we demonstrated that in two years we were 7 able to double the number of metallic glasses discovered over decades of research, as well as reduce the time to discovery nanocrystal synthetic routes from 1 year to 3 days. This represents a significant increase in the efficiency of R&D in challenging design spaces. Additionally the use of these tools resulted in the discovery of a FeBNb alloy with the wear resistance a Young’s modulus of 404 hardened stainless steel, and hardness of silicon carbide, as well as a catalyst which converts propane to propylene with a 100% selectivity, and negligible degradation in activity after 20 hours on stream. The catalysts discovered are significant as the propane dehydrogenation reaction has been identified as a potential target reaction which can see a 20% reduction in energy inputs if separation is less energy intensive. The high selectivity for the intended product, accompanied by less frequent catalyst regeneration cycles has the potential to result in significant energy savings. The demonstration of the use of sequential learning data driven discovery required the development of algorithms and software. This included both the continued development of the Citrination platform, as well as the development of open source freely available software. The Citrination platform was augments to include new data sets which were integrated, as well as develop new workflows for the uptake of x-ray scattering data. Additionally, the Citrination API was further developed to be able to interact with using a python interface. In order to facilitate the demonstration of autonomous closed loop synthetic discovery two software packages were developed. Xrsdkit (https://github.com/scattering-central/xrsdkit) uses a set of machine learning algorithms to completely automate the interpretation of x-ray scattering data and is extensible to any x-ray scattering data. The platform for automated workflows by SSRL, paws (https://github.com/slaclab/paws), handles all the machine control for the closed loop sequential learning driven synthesis and includes a set of sequential learning algorithms for the design of experiment, as well as a client for communication with Citrination to replace paws built-in sequential learning algorithm with Citrine’s design tool. Together these pieces of software ensure that other DOE projects, or industry researchers which would benefit from the utilization of sequential learning or automated materials discovery can do so with minimal effort on the part of the project performers. These tools will function out of the box for a wide array of applications in which the discovery of higher performing materials is the current bottle neck. Additionally the Citrination platform has seen a broad adoption by industry users over the course of the project period, with a demonstrated track record of significantly compressing R&D time tables and reducing associated costs. The tools and results of this project provide a powerful demonstration of how machine learning can be used to significantly accelerate R&D cycles for the discovery of high performing materials. In both cases the use of these tools resulted in either significantly increasing the rate of discovery, or significantly reducing the time to discovery. Furthermore, we demonstrated the ability to perform closed-loop automated discovery of complex materials which to our knowledge is the first example for beyond small molecule drug discovery. Given these successes we strongly recommend that future project which include materials discovery as part of the proposed work include the use of these methods in the research design. We have endeavored to build the foundational tools which are applicable across application spaces to ensure that these tools are accessible to researchers and industry alike.},
doi = {10.2172/1630122},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {12}
}