Title:
Novel statistical learning and data mining methods for service systems improvement

dc.contributor.advisor Paynabar, Kamran
dc.contributor.author Ranjan, Chitta
dc.contributor.committeeMember Shi, Jianjun (Jan)
dc.contributor.committeeMember Vidakovic, Brani
dc.contributor.committeeMember Mei, Yajun
dc.contributor.department Industrial and Systems Engineering
dc.date.accessioned 2017-01-11T14:03:15Z
dc.date.available 2017-01-11T14:03:15Z
dc.date.created 2016-12
dc.date.issued 2016-11-07
dc.date.submitted December 2016
dc.date.updated 2017-01-11T14:03:15Z
dc.description.abstract This dissertation focuses on solving problems for service systems improvement using newly developed data mining methods. Among a large plethora of problems in this realm, this dissertation attempts to solve three distinct and critical research topics. As a first topic, a classical problem of accurately forecasting patient census, and thereby workloads, for hospital management is studied. Majority of current literature focuses on optimal scheduling of inpatients, but largely ignores the process of accurate estimation of the path of patients throughout the treatment and recovery process. The result is that current scheduling models are optimized based on inaccurate input data. We developed a Clustering and Scheduling Integrated (CSI) approach to capture patient flows through a network of hospital services. CSI works differently by clustering patients into groups based on the similarity of paths, instead of admit, condition, or other physical attributes. To that end, we develop a novel Semi-Markov model (SMM)-clustering scheme. The methodology is validated by simulation and then applied to real patient data from a partner hospital where we see it outperforms current methods. Further, we demonstrate that extant optimization methods achieve significantly better results on key hospital performance measures under CSI, compared with traditional estimation approaches. From a methodological standpoint, the SMM-clustering is a novel approach applicable to any temporal-spatial stochastic data that is prevalent in many industries and application areas. In the second topic, data analysis problems in a special scenario — longitudinal data with measurement errors but absence of replicates — is studied. Longitudinal data is commonly found across fields, and sometimes has measurement errors. Especially, if the data collection has several processing stages, like MRI scans in medical fields. Multiple measurements (replications) are often taken at the same time to gauge its error and correct the analysis. However, obtaining replicates are sometimes not possible due to cost or associated risks, for instance, MRI scans are taken at long intervals due to high costs. Inferences derived from such erroneous data can be unreliable and, in medical diagnosis, can be fatal. We, therefore, devise a new estimation approach, called as EM-Variogram, that utilizes the autocorrelation aspect of longitudinal data to isolate the variance from measurement errors. This estimation approach enables a more reliable data analysis and a powerful statistical test of model parameters. Upon using this methodology on Alzheimer disease patients, we could quickly and precisely detect any signal of decline in patients' conditions. This can prove to be extremely useful for providing any required treatment to the patients to improve their conditions. Besides, other possible applications are also discussed. The last topic is on one of the most commonly found data type – sequences. It has a ubiquitous presence across fields, like, web, healthcare, bioinformatics, text mining, etc. This has made sequence mining a vital research area. However, sequence mining is particularly challenging because of an absence of an accurate and fast approach to find (dis)similarity between sequences. As a measure of (dis)similarity, mainstream data mining methods like k-means, kNN, regression, etc., have proved distance between data points in a euclidean space to be most effective. But a distance measure between sequences is not obvious due to their unstructuredness – arbitrary strings of arbitrary length. We, therefore, propose a new function, called as Sequence Graph Transform (SGT), that extracts sequence features and embeds it in a finite-dimensional euclidean space. It is scalable due to a low computational complexity and has a universal applicability on any sequence problem. We theoretically show that SGT can capture both short- and long- term patterns in sequences, and provides an accurate distance-based measure of (dis)similarity between them. This is also validated experimentally. Finally, we show its real world application for clustering, classification, search and visualization on different sequence problems.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/56281
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject Mixture semi-Markov model clustering
dc.subject Expectation-maximization
dc.subject Variogram
dc.subject MRI
dc.subject Sequences
dc.subject Data mining
dc.title Novel statistical learning and data mining methods for service systems improvement
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.advisor Paynabar, Kamran
local.contributor.corporatename H. Milton Stewart School of Industrial and Systems Engineering
local.contributor.corporatename College of Engineering
relation.isAdvisorOfPublication e5b534cb-d155-48bb-a9ed-346168404f15
relation.isOrgUnitOfPublication 29ad75f0-242d-49a7-9b3d-0ac88893323c
relation.isOrgUnitOfPublication 7c022d60-21d5-497c-b552-95e489a06569
thesis.degree.level Doctoral
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
RANJAN-DISSERTATION-2016.pdf
Size:
2.75 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.87 KB
Format:
Plain Text
Description: