Title:
Text-classification methods and the mathematical theory of Principal Components

dc.contributor.advisor Matzinger, Heinrich
dc.contributor.advisor Lounici, Karim
dc.contributor.author Chen, Jiangning
dc.contributor.committeeMember Popescu, Ionel
dc.contributor.committeeMember Huo, Xiaoming
dc.contributor.committeeMember Bonetto, Federico
dc.contributor.department Mathematics
dc.date.accessioned 2019-08-21T13:50:39Z
dc.date.available 2019-08-21T13:50:39Z
dc.date.created 2019-08
dc.date.issued 2019-04-22
dc.date.submitted August 2019
dc.date.updated 2019-08-21T13:50:39Z
dc.description.abstract This thesis studies three topics. First of all, in text classification, one may use Principal Components Analysis (PCA) as a dimension reduction technique, or with few topics even as unsupervised classification method. We investigate how useful it is for real life problems. The problem is that, often times the spectrum of the covariance matrix is wrongly estimated due to the ratio between sample space dimension over feature space dimension not being large enough. We show how to reconstruct the spectrum of the ground truth covariance matrix, given the spectrum of the estimated covariance for multivariate normal vectors. We then present an algorithm for reconstruction the spectrum in the case of sparse matrices related to text classification. In the second part, we concentrate on schemes of PCA estimators. Consider the problem of finding the least eigenvalue and eigenvector of ground truth covariance matrix, a famous classical estimator are due to Krasulina. We state the convergence proof of Krasulina for the least eigenvalue and corresponding eigenvector, and then find their convergence rate. In the last part, we consider the application problem, text classification, in the supervised view with traditional Naive-Bayes method. We find out an updated Naive-Bayes method with a new loss function, which loses the unbiased property of traditional Naive-Bayes method, but obtains a smaller variance of the estimator.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/61686
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject Text-classification
dc.subject NLP
dc.subject PCA
dc.subject Online PCA
dc.subject Incremental scheme
dc.subject Naive Bayes
dc.subject Partial labeling
dc.subject KL divergence
dc.title Text-classification methods and the mathematical theory of Principal Components
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.advisor Matzinger, Heinrich
local.contributor.corporatename College of Sciences
local.contributor.corporatename School of Mathematics
relation.isAdvisorOfPublication b5feafcc-8bcd-4cf0-a1fc-25fb479550a4
relation.isOrgUnitOfPublication 85042be6-2d68-4e07-b384-e1f908fae48a
relation.isOrgUnitOfPublication 84e5d930-8c17-4e24-96cc-63f5ab63da69
thesis.degree.level Doctoral
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
CHEN-DISSERTATION-2019.pdf
Size:
1.04 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.87 KB
Format:
Plain Text
Description: