Analysis of High-dimensional Data with Variable Clustering and Selection

Author(s)
Yang, Sheng-Tao
Advisor(s)
Editor(s)
Associated Organization(s)
Supplementary to:
Abstract
Chapter 1 provides a decision-making procedure called Human-in-the-Loop Clustering and Representative Selection (HITL-CARS) procedure that involves users’ domain knowledge for analyzing high-dimensional datasets. The proposed method simultaneously clusters strongly linearly correlated variables and estimates a linear regression model with only a few selected cluster representatives and independent variables. After users obtain the analysis result of CARS and provide their advice based on their domain knowledge, HITL-CARS refines analyses for accounting users’ inputs. To optimize CARS and HITL-CARS procedures, an algorithm is provided for solving the mixed-integer programming problem based on penalized likelihood. Simulation studies are performed to assess the performance between CARS and other two-stage variable clustering and selection methods. A real-life example of brain mapping data shows that HITL-CARS could aid in discovering important brain regions associated with depression symptoms and provide predictive analytics on cluster representatives. Chapter 2 studies large-sample properties of an adaptive Clustering and Representative Selection (aCARS) procedure in ultrahigh-dimensional scenarios, where the number of variables increases exponentially along with the sample size. This chapter investigates under which conditions, aCARS selects important representatives and variables consistently, constructs clusters matching the true clusters consistently, achieves oracle properties for regression parameter estimation. Moreover, because aCARS involves cluster information for reducing the dimensionality of variable space, the manuscript explores how large the dimensionality could be to preserve aCARS’ large-sample properties. Lastly, since aCARS does not select multiple variables in each cluster, the studies investigate how aCARS relaxes the condition about multicollinearity. So far, Chapter 1 develops HITL-CARS that includes users’ domain knowledge to refine CARS for analyzing high-dimensional data with a small sample size. Accordingly, Chapter 2 investigates aCARS’ large-sample properties by discussing its asymptotic behaviors, where the dimensionality goes to infinity exponentially along with the sample size. However, finite-sample properties of CARS and aCARS have not been discussed yet, and thus practitioners do not know the advantages of using CARS in analyzing high-dimensional data in practice. Chapter 3 systematically investigates the finite-sample performance of aCARS and CARS, focusing on their ability of dealing with ultrahigh dimensionality, strong multicollinearity, and the importance of hyperparameter tuning. We study a series of simulation scenarios and real-world data analysis that are capable of comparing CARS and aCARS with related and popular variable selection methods, such as lasso, adaptive Lasso, SCAD, and MCP. In particular, simulation settings focus on ultrahigh-dimensional and strongly multicollinear data, where the number of variables grows exponentially with the sample size, and some variables display Pearson correlations exceeding 0.95. Moreover, we provide practical guides for using High-dimensional Bayesian Information Criterion (HBIC) to tune hyperparameters efficiently in the aCARS and CARS procedures. To assess performance, evaluation metrics for (i) clustering, (ii) representative and variable selection, and (iii) prediction are provided so that the simulation results are able to show the significance of CARS and aCARS systematically. In summary, Chapter 3 demonstrates the applicability of aCARS and CARS in addressing real-world (finite-sample) data characterized by challenges such as ultrahigh dimensionality, multicollinearity, and hyperparameter tuning, thereby offering valuable insights for statisticians and data analysts. In particular, our studies reveal that other methods struggle when faced with ultrahigh dimensionality and strong multicollinearity. In contrast, aCARS consistently clusters strongly correlated variables, selects important variables, and excludes unimportant variables, resulting in the lowest prediction error.
Sponsor
Date
2023-11-28
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI