Transformer-Based Language Models' Learning of Irregular Past-Tense Verb Inflection
Author(s)
Sharma, Mihir Rajas
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
During the process of learning English, human children learn that adding "-ed" to the end of a verb's stem turns it into the past-tense inflection.
Verbs following this rule are known as regular past-tense verbs; however, there are many verbs in English with irregular past-tense inflections. For example, the correct past-tense inflection of "go" is "went". However, as children learn, they begin to overregularize this rule even to verbs with an irregular past-tense inflection. A child might say "goed" instead of "went".
This overregularization phenomenon manifests as a U-shaped learning curve, where children initially correctly learn the past-tense for some irregular verbs, then overregularize, then correctly acquire the past-tense.
There is much debate over why children demonstrate a U-shaped learning curve when learning the past-tense.
Many works have sought to computationally model a U-shaped learning curve on past-tense inflections as well.
Today, transformer-based Language Models (LMs) have found huge success at learning language, growing in size and scale until they are frequently able to produce text that is indistinguishable from human-written text.
This work investigates whether modern, transformer-based LMs also demonstrate a U-shaped learning curve on past-tense inflection as they learn English.
We simulated human language acquisition in a baby LM called BabyBERTa. We took checkpoints during pre-training, as a way to simulate human age and learning over time.
To further simulate human language acquisition, we trained a model on a dataset of Child-Directed Speech (CDS). We also compared its performance to that of models trained on data taken from Wikipedia as a baseline for how modern LMs are trained.
To evaluate the checkpoints, we provided each checkpoint with a series of sentence-pairs containing both the correct and overregularized past-tense inflection of each verb. We then used the accuracy with which the model assigned a higher probability to the correct verb inflection as our metric. We repeated this experiment five times, and reported the averaged results.
We measured model performance on different vocabulary sizes: first, on a subset of the evaluation metric with vocabulary entirely consisting of CDS; second, on a subset of the metric with vocabulary where only the verbs were guaranteed to be CDS; and third on a subset of the metric where many verbs as well as other sentence tokens were infrequently seen by the models.
There was evidence that the model trained on CDS showed a U-shaped curve on the subset of the evaluation data consisting entirely of CDS, but none for the model trained on adult-written text. There was clear evidence that both the models trained on CDS and the models trained on adult-written text showed a U-shaped learning curve on the subset of data consisting of verbs exclusive to CDS, but whose other sentence tokens could be drawn from adult text. Neither model had any indication of a U-shaped curve on the subset of data containing many verbs and sentence tokens from adult-written speech; additionally, the model trained on CDS, which infrequently saw many of the verbs evaluated in this subset, did not prefer to overregularize the new verbs.
In all, our findings supported the idea that LMs show similar behavior to human children during language acquisition. Because our model did not have an explicitly defined rule-based structure yet still showed evidence of the overregularization phenomenon, we conclude that the existence of a rule-based system in the human brain is also not necessary for language processing. We did not find evidence that Child-Directed speech as pre-training data increased the likelihood of a U-shaped learning curve. We also did not find any relationship between model performances and number of occurrences in training data on sub-classes of irregular past-tense verbs, providing further evidence that the model overregularizes the "-ed" rule.
There are some major differences between the content received by humans and LMs during language acquisition. Human children are able to learn language with considerably more constrained token input than modern LMs. At the same time, children incorporate physical, social, and auditory contexts into their learning of language; on the contrary, LMs are entirely token-based. In addition, children use an entirely different modality than LMs during first language acquisition: spoken language instead of written language.
Despite these substantial differences in modality, context, and data efficiency, there is increasing interest in understanding how human language acquisition is similar to computational language acquisition, and how LMs can benefit from similar learning to humans. Our experiment tests one such similarity, to understand whether LMs display a similar form of language acquisition to humans through the overregularization phenomenon.
There is much more work to be done in this area. Our experiment only considered one model, which did not perfectly learn the past-tense. Future directions may consider re-creating this experiment with other LMs, modified training datasets, and improved evaluation metrics.
Sponsor
Date
2025-04-30
Extent
Resource Type
Text
Resource Subtype
Thesis