Can Large Language Models Reliably Extract Dialect Features?

Author(s)
Ramanathan, Vishnesh J.
Advisor(s)
Editor(s)
Associated Organization(s)
Supplementary to:
Abstract
Dialect features refer to morphosyntactic linguistic variations that uniquely characterize a dialect and help in distinguishing between different dialects. Though they offer a very useful lense to study dialects from a bottom-up perspective, large scale extraction of dialect features has long remained a persistent challenge in computational linguistics. This paper investigates the potential of Large Language Models (LLMs) to automatically extract dialect features from text. We find that LLMs can be quite reliable in extracting dialect features and serve as a powerful alternative to human annotation due to their scalability and cost efficacy. To motivate further work in this area, we release DIA-BENCH, a large-scale corpus of sentences annotated for dialect features across 4 dialects and provide a set of best practices for prompting strategies and optimization.
Sponsor
Date
Extent
Resource Type
Text
Resource Subtype
Undergraduate Research Option Thesis
Rights Statement
Rights URI