Efficient data integration techniques in some modern applications

dc.contributor.advisor Mei, Yajun
dc.contributor.author Liu, Kun
dc.contributor.committeeMember Huo, Xiaoming
dc.contributor.committeeMember Vidakovic, Brani
dc.contributor.committeeMember Chen, Jie
dc.contributor.department Industrial and Systems Engineering
dc.date.accessioned 2019-05-29T13:59:49Z
dc.date.available 2019-05-29T13:59:49Z
dc.date.created 2018-05
dc.date.issued 2018-04-10
dc.date.submitted May 2018
dc.date.updated 2019-05-29T13:59:49Z
dc.description.abstract Data science is changing our society and economy, and complicated data from heterogeneous sources is often collected in various industries such as finance, manufacturing, security, and pharmaceutical industries. The main challenge is often how to analyze these complicated data from heterogeneous sources. One useful data analysis technique is data integration that allows one to extract invaluable information from heterogeneous sources to make intelligent decisions at the global level. This dissertation aims to develop efficient data integration techniques in some modern real-world applications. We consider four different contexts: (i) online monitoring of large-scale data streams, (ii) consensus sequential detection over distributed networks, (iii) combining different patients' responses to assess the treatment effects of new drugs, and (iv) robust statistical inference in the presence of contaminated data. Chapter 1 investigates the problem of online monitoring large-scale data streams where an undesired event may occur at some unknown time and affect only a few unknown data streams. Existing research is either statistical inefficient or computationally infeasible. Motivated by parallel and distributed computing, we propose to develop a new information fusion technique we called the “SUM-Shrinkage” approach that is efficient and scalable. The main idea is to parallel run local detection procedures and to use the sum of the shrinkage transformation of local detection statistics as a global statistic to make a decision. The proposed shrinkage transformation approach is able to automatically filter out the unaffected data streams and only use information from affected data streams to make the decision. The usefulness of our proposed SUM-Shrinkage approach is illustrated in an example of monitoring large-scale independent normally distributed data streams when the local post-change mean shifts are unknown and can be positive or negative. In Chapter 2, we consider the consensus sequential detection problem over distributed sensor networks, in which each local sensor can only communicate local information with its immediate neighborhood sensors at each time step, and the question is how the sensors can work together to make a quick but accurate decision when testing binary hypotheses on the true raw sensor distributions. An interesting data integration technique is based on the weighted local-likelihood-ratio-statistics, which yields the Consensus-Innovation Sequential Probability Ratio Test (CISPRT) algorithm proposed by Sahu and Kar (IEEE Trans. Signal Process., 2016). Our new contribution is to present improved, non-asymptotic properties of the CISPRT algorithm for Gaussian data in term of network connectivity no matter how large the number of sensors is. Moreover, we also provide sharp upper bounds on the information loss of the CISPRT algorithm as compared to the centralized optimal SPRT algorithm in term of expected sample sizes in the asymptotic regime when Type I and II error probabilities go to 0. Numerical simulations suggest that our results are useful under the practical setting when the number of sensors is moderately large. Chapter 3 aims to develop an efficient method that is able to combine different patients' responses to assess the treatment effects of new drugs. Our research is motivated by Biogen's ongoing Phase 3 clinical trial of a new drug “Aducanumab” for Alzheimer's disease (AD), where the primary outcome is on the change in the Clinical Dementia Rating-Sum of Boxes (CDR-SB) scores. The current gold standard method is the so-called responder analysis based on the two-sample proportion test, which only uses information at Month 18 and 0. This might lose detection powers because of two reasons: (i) Not every subject will have these CDR-SB scores at Month 18, due to various reasons such as missing the appointments or dropping out; (ii) it does not take advantage of the longitudinal study design when the CDR-SB scores will be collected multiple times for most subjects (e.g., at Month 0, 6, 12, 18, 24 and 36 after the enrollment of the study). We propose to model the CDR-SB scores by the Beta distribution and to use the mixed-effects Beta regression model combining all observed CDR-SB values together to increase the detection power of the changes in the CDR-SB scores. The usefulness of our proposed models and methods is demonstrated through the Alzheimer's Disease Neuroimaging Initiative (ADNI) database and simulation studies. In Chapter 4 of the dissertation, we investigate the problem of robust statistical inference in the presence of contaminated data. The corrupted or contaminated data is often a big issue when we integrate data from different sources, and thus it is crucial to have a robust local inference before combining different local information together. We present our research on the robust point estimations in the mixture model. Our main contribution is to consider an exponential loss function that is better to mitigate the effect of outliers and develop an asymptotic theory in a new asymptotic regime when the outlier means go to infinity in a suitable rate as the proportion of outliers goes to 0.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/61160
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject Data science
dc.subject Data integration
dc.subject Change-point
dc.subject CUSUM
dc.subject Parallel computing
dc.subject Quickest detection
dc.subject Sensor networks
dc.subject Distributed learning
dc.subject Network connectivity
dc.subject Sequential detection
dc.subject Beta-regression
dc.subject CDR
dc.subject Generalized linear mixed model
dc.subject ADNI database
dc.subject Robust inference
dc.title Efficient data integration techniques in some modern applications
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.advisor Mei, Yajun
local.contributor.corporatename H. Milton Stewart School of Industrial and Systems Engineering
local.contributor.corporatename College of Engineering
relation.isAdvisorOfPublication 278b2355-ca85-4111-b664-4d7e39f71482
relation.isOrgUnitOfPublication 29ad75f0-242d-49a7-9b3d-0ac88893323c
relation.isOrgUnitOfPublication 7c022d60-21d5-497c-b552-95e489a06569
thesis.degree.level Doctoral
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
1.26 MB
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
3.86 KB
Plain Text