Information Extraction on Scientific Literature under Limited Supervision

Bai, Fan

Title:

Information Extraction on Scientific Literature under Limited Supervision

Files

BAI-DISSERTATION-2023.pdf (5.64 MB)

Author(s)

Bai, Fan

Advisor(s)

Ritter, Alan

Advisor(s)

Person

Ritter, Alan

Associated Organization(s)

Organizational Unit

College of Computing

Organizational Unit

School of Interactive Computing

Collections

Theses and Dissertations

Permanent Link

https://hdl.handle.net/1853/73149

Abstract

The exponential growth of scientific literature presents both challenges and opportunities for researchers across various disciplines. Effectively extracting pertinent information from this extensive corpus is crucial for advancing knowledge, enhancing collaboration, and driving innovation. However, manual extraction is a laborious and time-consuming process, underscoring the demand for automated solutions. Information extraction (IE), a sub-field of natural language processing (NLP) focused on automatically extracting structured information from unstructured data sources, plays a crucial role in addressing this challenge. Despite their success, many IE methods often require substantial human-annotated data, which might not be easily accessible, particularly in specialized scientific domains. This highlights the need for adaptable and robust techniques capable of functioning with limited supervision. In this thesis, we study the task of information extraction on scientific literature, particularly addressing the challenge of limited (human) supervision. Specifically, our work has delved into four key dimensions of this problem. First, we explore the potential of harnessing easily accessible resources, like knowledge bases, to develop IE systems without direct human supervision. Second, we examine the use of pre-trained language models to create effective and efficient scientific IE systems, experimenting with various fine-tuning architectures and learning strategies. Next, we investigate the balance between the labor expenditure of human annotation and the computational cost linked with domain-specific pre-training, to achieve optimal performance under the budget constraints. Lastly, we capitalize on the emerging capabilities of large pre-trained language models by showcasing how information extraction can be achieved solely based on a human-crafted data schema. Through these explorations, this thesis aims to lay a solid foundation for the continued advancement of scientific IE under limited supervision.

Date Issued

2023-12-12

Resource Type

Text

Resource Subtype

Dissertation

Full item page

Title:

Information Extraction on Scientific Literature under Limited Supervision

Files

Author(s)

Authors

Advisor(s)

Advisor(s)

Editor(s)

Associated Organization(s)

Series

Collections

Supplementary to

Permanent Link

Abstract

Sponsor

Date Issued

Extent

Resource Type

Resource Subtype

Rights Statement

Rights URI

Georgia Tech Library

Title: Information Extraction on Scientific Literature under Limited Supervision

Files

Author(s)

Authors

Advisor(s)

Advisor(s)

Editor(s)

Associated Organization(s)

Series

Collections

Supplementary to

Permanent Link

Abstract

Sponsor

Date Issued

Extent

Resource Type

Resource Subtype

Rights Statement

Rights URI

Title:

Information Extraction on Scientific Literature under Limited Supervision