Title:
A pipeline for data and knowledge extraction from material science literature to accelerate scientific discovery
A pipeline for data and knowledge extraction from material science literature to accelerate scientific discovery
Author(s)
Shetty, Pranav
Advisor(s)
Ramprasad, Rampi
Zhang, Chao
Zhang, Chao
Editor(s)
Collections
Supplementary to
Permanent Link
Abstract
Scientific literature is growing at an exponential pace which makes it difficult for scientists to search through and effectively utilize the data contained in it. In this work, we developed methods and data sets needed to extract knowledge and material property data from a corpus of 2.4 million materials science articles. We uniquely identified extracted polymer materials by training supervised clustering models using parameterized cosine distances with hierarchical agglomerative clustering that achieve state-of-the-art results on a benchmark data set of polymer named entity clusters. In addition, we built sequence labeling models that can tag property information using an ontology specific to the materials domain. MaterialsBERT, a pre-trained encoder fine-tuned on the aforementioned corpus of materials science papers was used as the encoder for the sequence labeling model and outperforms the baselines tested for data sets in the materials domain. We developed two pipelines, one that combines sequence labeling outputs with heuristic rules, and another using prompts to a large language model, to extract material property records from our corpus of papers. The extracted data is made available to the public through the interface polymerscholar.org. A subset of the extracted data was used to train machine learning models to predict the power conversion efficiency of polymer solar cells, thus demonstrating an end-to-end pipeline that goes from literature-extracted data to data-driven insights. This work will reduce the time taken during the search as well as the discovery phase of experimental work, thus allowing researchers to move beyond an Edisonian trial-and-error approach.
Sponsor
Date Issued
2023-08-16
Extent
Resource Type
Text
Resource Subtype
Dissertation