Efficient data integration techniques in some modern applications

Thumbnail Image
Liu, Kun
Mei, Yajun
Associated Organization(s)
Supplementary to
Data science is changing our society and economy, and complicated data from heterogeneous sources is often collected in various industries such as finance, manufacturing, security, and pharmaceutical industries. The main challenge is often how to analyze these complicated data from heterogeneous sources. One useful data analysis technique is data integration that allows one to extract invaluable information from heterogeneous sources to make intelligent decisions at the global level. This dissertation aims to develop efficient data integration techniques in some modern real-world applications. We consider four different contexts: (i) online monitoring of large-scale data streams, (ii) consensus sequential detection over distributed networks, (iii) combining different patients' responses to assess the treatment effects of new drugs, and (iv) robust statistical inference in the presence of contaminated data. Chapter 1 investigates the problem of online monitoring large-scale data streams where an undesired event may occur at some unknown time and affect only a few unknown data streams. Existing research is either statistical inefficient or computationally infeasible. Motivated by parallel and distributed computing, we propose to develop a new information fusion technique we called the “SUM-Shrinkage” approach that is efficient and scalable. The main idea is to parallel run local detection procedures and to use the sum of the shrinkage transformation of local detection statistics as a global statistic to make a decision. The proposed shrinkage transformation approach is able to automatically filter out the unaffected data streams and only use information from affected data streams to make the decision. The usefulness of our proposed SUM-Shrinkage approach is illustrated in an example of monitoring large-scale independent normally distributed data streams when the local post-change mean shifts are unknown and can be positive or negative. In Chapter 2, we consider the consensus sequential detection problem over distributed sensor networks, in which each local sensor can only communicate local information with its immediate neighborhood sensors at each time step, and the question is how the sensors can work together to make a quick but accurate decision when testing binary hypotheses on the true raw sensor distributions. An interesting data integration technique is based on the weighted local-likelihood-ratio-statistics, which yields the Consensus-Innovation Sequential Probability Ratio Test (CISPRT) algorithm proposed by Sahu and Kar (IEEE Trans. Signal Process., 2016). Our new contribution is to present improved, non-asymptotic properties of the CISPRT algorithm for Gaussian data in term of network connectivity no matter how large the number of sensors is. Moreover, we also provide sharp upper bounds on the information loss of the CISPRT algorithm as compared to the centralized optimal SPRT algorithm in term of expected sample sizes in the asymptotic regime when Type I and II error probabilities go to 0. Numerical simulations suggest that our results are useful under the practical setting when the number of sensors is moderately large. Chapter 3 aims to develop an efficient method that is able to combine different patients' responses to assess the treatment effects of new drugs. Our research is motivated by Biogen's ongoing Phase 3 clinical trial of a new drug “Aducanumab” for Alzheimer's disease (AD), where the primary outcome is on the change in the Clinical Dementia Rating-Sum of Boxes (CDR-SB) scores. The current gold standard method is the so-called responder analysis based on the two-sample proportion test, which only uses information at Month 18 and 0. This might lose detection powers because of two reasons: (i) Not every subject will have these CDR-SB scores at Month 18, due to various reasons such as missing the appointments or dropping out; (ii) it does not take advantage of the longitudinal study design when the CDR-SB scores will be collected multiple times for most subjects (e.g., at Month 0, 6, 12, 18, 24 and 36 after the enrollment of the study). We propose to model the CDR-SB scores by the Beta distribution and to use the mixed-effects Beta regression model combining all observed CDR-SB values together to increase the detection power of the changes in the CDR-SB scores. The usefulness of our proposed models and methods is demonstrated through the Alzheimer's Disease Neuroimaging Initiative (ADNI) database and simulation studies. In Chapter 4 of the dissertation, we investigate the problem of robust statistical inference in the presence of contaminated data. The corrupted or contaminated data is often a big issue when we integrate data from different sources, and thus it is crucial to have a robust local inference before combining different local information together. We present our research on the robust point estimations in the mixture model. Our main contribution is to consider an exponential loss function that is better to mitigate the effect of outliers and develop an asymptotic theory in a new asymptotic regime when the outlier means go to infinity in a suitable rate as the proportion of outliers goes to 0.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI