Studying Text Revision in Scientific Writing

Loading...
Thumbnail Image
Author(s)
Jiang, Chao
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Interactive Computing
School established in 2007
Series
Supplementary to:
Abstract
Writing is essential for sharing scientific discoveries, and researchers devote significant effort to revising their papers to improve writing quality and incorporate new findings. The revision process encodes valuable knowledge, including logical and structural improvements at the document level and stylistic and grammatical refinements at the sentence and word levels. This dissertation presents a complete computational framework for extracting text revisions across different granularity and analyzing edits made for different purposes. The first contribution is developing the state-of-the-art methods for monolingual sentence alignment. We propose a neural CRF model that captures sequential dependencies and semantic similarity between sentences in parallel documents. The proposed approach outperforms previous methods by a large margin, and enables the creation of high-quality text simplification datasets, such as Newsela-Auto and Wiki-Auto. Next, to study fine-grained editing operations, we design a novel neural semi-Markov CRF alignment model for monolingual word alignment. This model unifies word and phrase alignments using variable-length spans and achieves state-of-the-art performance on both in-domain and out-of-domain evaluations. It also demonstrates utility in downstream tasks, such as automatic text simplification and sentence pair classification tasks. We further present arXivEdits, a dataset containing human-annotated sentence alignments, and fine-grained span-level edits across multiple versions of 751 research papers. Enabled by this corpus, we perform a detailed analysis of revision strategies in scientific writing, revealing common practices researchers use to improve their papers. Finally, this dissertation explores the human revision from a readability perspective through MedReadMe, a new dataset consisting of sentence-level readability ratings and complex span annotations for 4,520 medical sentences. This dataset supports fine-grained readability analysis and the evaluation of state-of-the-art readability metrics. By incorporating novel features, we significantly improve their correlation with human judgments.
Sponsor
Date
2025-03-19
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI