Title:
Unsupervised Learning for Lexicon-Based Classification
Unsupervised Learning for Lexicon-Based Classification
Author(s)
Eisenstein, Jacob
Advisor(s)
Editor(s)
Collections
Supplementary to
Permanent Link
Abstract
In lexicon-based classification, documents are assigned labels
by comparing the number of words that appear from two opposed
lexicons, such as positive and negative sentiment. Creating
such words lists is often easier than labeling instances,
and they can be debugged by non-experts if classification performance
is unsatisfactory. However, there is little analysis
or justification of this classification heuristic. This paper describes
a set of assumptions that can be used to derive a probabilistic
justification for lexicon-based classification, as well
as an analysis of its expected accuracy. One key assumption
behind lexicon-based classification is that all words in each
lexicon are equally predictive. This is rarely true in practice,
which is why lexicon-based approaches are usually outperformed
by supervised classifiers that learn distinct weights on
each word from labeled instances. This paper shows that it is
possible to learn such weights without labeled data, by leveraging
co-occurrence statistics across the lexicons. This offers
the best of both worlds: light supervision in the form of lexicons,
and data-driven classification with higher accuracy than
traditional word-counting heuristics.
Sponsor
Date Issued
2017
Extent
Resource Type
Text
Resource Subtype
Pre-print
Proceedings
Proceedings