Topic Modeling Methods for Research Paper Classification

Author(s)
Maheshwari, Vidushi
Advisor(s)
Editor(s)
Associated Organization(s)
Series
Supplementary to:
Abstract
Topic modeling is crucial for document categorization, information retrieval and augmenting databases. This study assesses multilabel hierarchical classification of research papers, using different models, datasets, and metrics. We replicate the original OpenAlex Concept Tagger (v1) as a baseline and compare its performance with an updated OpenAlex Topic Classification model (v2) that uses 92% fewer concepts while efficiently mapping v1 concepts. Through experiments involving concept tree pruning, data cleaning, and embedding utilization, we evaluate the impact of various techniques on classification accuracy. Contingency analysis and sentence similarity assessments allow us to evaluate the unsupervised v2 model's performance against different ground truths. The findings contribute to advancing our understanding of topic classification methodologies and their real-world applicability. Model versions and code are available for reproducibility.
Sponsor
Date
Extent
Resource Type
Text
Resource Subtype
Undergraduate Research Option Thesis
Rights Statement
Rights URI