Some results in high-dimensional statistics

Thumbnail Image
Xiao, Yi
Yuan, Ming
Associated Organization(s)
Supplementary to
High-dimensional statistics is one of the most active research topics in modern statistics. It also has applications in many fields, such as computer science, biology and economics. Recent advancements in computer technology enable large volumes of data to be collected and stored relatively easily. Combining this with the advancements in processing and analytical capabilities of computers, we see an even faster growth in research and technology, making our daily lives much easier than ever before. At the same time, the complexity of data both in size and structure brings new challenges to statisticians, to be able to differentiate useful information from noise in an efficient and accurate manner. \\ A common problem in high-dimensional statistics is when the number of covariates exceeds the sample size. Most classical approaches would either be inapplicable or produce unsatisfactory results in such problems., although extensive research efforts have been made to overcome these difficulties. One of the more popular approaches to tackle the lack of degrees of freedom is to introduce additional assumptions on the data structure to reduce model complexity, such as sparsity of coefficients for linear regression models and sparsity of inverse covariance matrix for Gaussian graphical models. It is shown that under certain assumptions and with proper regularization on the parameters we can obtain reasonably good estimates for these models even if the sample size is limited. However, it is still unclear how to justify these assumptions in certain scenarios. \\ The purpose of this thesis is to narrow the gap between theory and practice in the field of high-dimensional statistics by studying some of the more widely adopted assumptions in literature and by introducing new testing procedures. To be more specific, we will cover $l_1$-regularized estimations for time series and testing for the sparse Gaussian graphical model.\\ In the first chapter we explore the applications of $l_1$-regularized regression methods for Gaussian vector autoregressive processes. We decompose the classical regression model into smaller submodels and obtain sparse solutions by applying $l_1$-penalties. We show that under mild conditions the design matrices corresponding to the submodels are actually generated from some $\alpha$-mixing processes. Therefore, a more general problem is to study the performance of the $l_1$-regularized methods for a linear model with a random design matrix that is generated by an $\alpha$-mixing Gaussian process with exponential decay rate. Our main result verifies the restricted eigenvalue assumption for the mixing random design based on the generic chaining technique, and derives the $l_p$ error bound for the Lasso and Dantzig selectors. We also study the sufficient conditions for a VAR(p) model to guarantee a tight error bound of the solutions and discuss how to select the order of the model. Finally, we illustrate the variable selection and estimation performance of Lasso by several sets of simulation.\\ In the second chapter, we propose a new statistic to test the decomposable structure of a Gaussian graphical model in the high-dimensional setting. It is based on the eigenvalues of the sample covariance matrix. In the case when the null hypothesis corresponds to a group independence structure, we derive the asymptotic distribution of the proposed statistic and show that it is invariant under non-singular linear transformations within each group. When testing an arbitrary decomposable structure, a simple asymptotic distribution of the statistic is not available. We suggest a simulation-based method to approximate the null distribution and calculate the corresponding $p$ value. We also study the computational complexity of the proposed methods and give some suggestions on how to improve the performance. In the last section, We give some numerical results including both simulation and an empirical example to study the proposed testing procedure in different scenarios.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI