Edit-based Language Model Evaluation
Author(s)
Heineman, David
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Large language models are uniquely capable of producing highly rated text generation, yet current work struggles to build adequate measures of language model ability, or to capture a specific taxonomy of errors for individual generation task. At the center of this evaluation crisis is a lack of expressivity of current methods: they rely on feedback which is too coarse to isolate the underlying successful behaviors or failure modes. To address these limitations, this work introduces edit-based language model evaluation, a new evaluation paradigm which focuses on annotating aligned pairs of text as a comprehensive measure of all evaluation operations over text generation. To enable edit-based evaluation, I first introduce Thresh, a unified, customizable and deployable platform for fine-grained evaluation. With a single YAML configuration file, users can build and test an annotation interface for any framework within minutes -- all in one web browser window. Thresh provides a community hub that hosts a collection of fine-grained frameworks and corresponding annotations made and collected by the community, covering a wide range of NLP tasks. Using Thresh, I then demonstrate the efficacy of edit-based language model evaluation by introducing SALSA, a novel taxonomy of success and errors for the task of text simplification. SALSA consists of twenty one linguistically grounded edit types, covering the full spectrum of success and failure across dimensions of conceptual, syntactic and lexical simplicity. Using SALSA, I revealing discrepancies in the distribution of simplification strategies performed by fine-tuned models, prompted LLMs and humans, and find GPT-3.5 performs more quality edits than humans, but still exhibits frequent errors. Using our fine-grained annotations, I develop LENS-SALSA, a reference-free automatic simplification metric, trained to predict sentence- and word-level quality simultaneously. Thresh is publicly accessible at https://thresh.tools and our data and metric for SALSA are available at https://salsa-eval.com.
Sponsor
Date
Extent
Resource Type
Text
Resource Subtype
Undergraduate Research Option Thesis