Topic Modeling Methods for Research Paper Classification
Author(s)
Maheshwari, Vidushi
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Topic modeling is crucial for document categorization, information retrieval and augmenting databases. This study assesses multilabel hierarchical classification of research papers, using different models, datasets, and metrics. We replicate the original OpenAlex Concept Tagger (v1) as a baseline and compare its performance with an updated OpenAlex Topic Classification model (v2) that uses 92% fewer concepts while efficiently mapping v1 concepts. Through experiments involving concept tree pruning, data cleaning, and embedding utilization, we evaluate the impact of various techniques on classification accuracy. Contingency analysis and sentence similarity assessments allow us to evaluate the unsupervised v2 model's performance against different ground truths. The findings contribute to advancing our understanding of topic classification methodologies and their real-world applicability. Model versions and code are available for reproducibility.
Sponsor
Date
Extent
Resource Type
Text
Resource Subtype
Undergraduate Research Option Thesis