Text-classification methods and the mathematical theory of Principal Components

Thumbnail Image
Chen, Jiangning
Matzinger, Heinrich
Lounici, Karim
Associated Organizations
Organizational Unit
Organizational Unit
Supplementary to
This thesis studies three topics. First of all, in text classification, one may use Principal Components Analysis (PCA) as a dimension reduction technique, or with few topics even as unsupervised classification method. We investigate how useful it is for real life problems. The problem is that, often times the spectrum of the covariance matrix is wrongly estimated due to the ratio between sample space dimension over feature space dimension not being large enough. We show how to reconstruct the spectrum of the ground truth covariance matrix, given the spectrum of the estimated covariance for multivariate normal vectors. We then present an algorithm for reconstruction the spectrum in the case of sparse matrices related to text classification. In the second part, we concentrate on schemes of PCA estimators. Consider the problem of finding the least eigenvalue and eigenvector of ground truth covariance matrix, a famous classical estimator are due to Krasulina. We state the convergence proof of Krasulina for the least eigenvalue and corresponding eigenvector, and then find their convergence rate. In the last part, we consider the application problem, text classification, in the supervised view with traditional Naive-Bayes method. We find out an updated Naive-Bayes method with a new loss function, which loses the unbiased property of traditional Naive-Bayes method, but obtains a smaller variance of the estimator.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI