Analytics and Machine Learning for Healthcare Data

Thumbnail Image
Pujol Mitchell, Toyya A.
Serban, Nicoleta
Associated Organization(s)
Supplementary to
The volume of data is expected to grow fastest in health care compared to any other industry. This creates a demand for the development of rigorous analytics and machine learning methods for applications to large health data sets. These large health data sets come with privacy protections which place limitations on data visibility and its release; which can cause unique complications for analysis. This can restrict the use of out-of-the-box solutions. Notably, healthcare research has incredibly high stakes, it can be the difference between life and death or can have a major impact on an individual’s quality of life. For these reasons, the development of statistically-sound technical solutions, are that much more critical. This thesis focuses on the application of analytics and machine learning to solve applied research problems focused on healthcare data. Chapter 1 is an introduction to each study in the thesis. It presents the research objectives and contributions. The chapter also discusses the value of the methods used in each study and the benefits of using administrative claims data. In chapter 2, we determine the level of uptake of new CDC contraceptive recommendations by clinicians. The study included Medicaid-enrolled women within reproductive age two years prior to the MEC release and two years following the release for 14 states using the Medicaid claims data. We focused on two outcome measures: (1) overall contraception use and (2) the use of CDC recommended contraception (i.e. those of the highest efficacy). We evaluated each outcome for the entire study population and by health condition. The ratio of the after-guideline rate over the before-guideline rate was used to determine statistical significance in the MEC uptake. The results found that there had been an increase in the overall use of contraception methods among women with these health conditions and for each condition individually. However, the results also showed that the use of the highest efficacy methods increased overall but not for every condition. The chapter gives suggestions for further increasing the use of the highest-efficacy methods within this population. In chapter 3, we assess the health and wellness outcomes of infants born to adolescent mothers. Our nationwide study assesses the association between adolescent pregnancy and the health and wellness of infant within their first year of life. Each infant in the study group (infants born to adolescent mothers) is matched with the control group (infants born to adult mothers) based on the mother’s demographics. The outcomes assessed are: low birth weight, substance exposure, foster care, health status, mortality, emergency department visits, and wellness visits. The results suggested differences between the two groups, especially for emergency departments visits. However, the differences were not as drastic as previous research has found, suggesting a promising result that the gap between these two groups may be closing. The chapter also includes recommendations to support adolescent mothers. In chapter 4, we assess statistical learning methods for a difference-in-differences (DID) study setting. These analyses rely on parametric statistical models that make strong assumptions about the unknown underlying functional form of the data. In this study, we extend existing statistical machine learning methods to target a DID parameter, defined nonparametrically, while considering a larger nonparametric model space that makes fewer assumptions. We develop a general framework for DID designs that allow researchers to estimate causal or statistical effect quantities using machine learning while providing statistical inference. We demonstrate its performance through a simulation in which we compare it to more traditional methods.  The project applies the method to estimate the effects of episode-based bundle payment on perinatal spending. In the study, we find our approach reduces bias (upward of 50% reduction) for lower effect sizes. Chapter 5 applies machine learning to the problem of edge weight estimation for social networks. Social network analysis can be used to visualize, quantify, and assess relationships between two entities. Within healthcare, social networks can be used to quantify the impact of social influence on healthcare interventions. Algorithms have been used to predict information on social networks, such as edge existence, or similarity measures, such as common neighbors. However, little research focuses on weighted graphs and even less work on the estimation of their edge weights. Accurate weight estimation can serve as a data quality tool to check if the weights in the data are correct or where we would expect new stronger (or weaker) relationships to occur next. This study evaluates the performance of three estimators, including an ensemble machine learning approach, to predict the edge weights of a weighted social network. We use a faculty hiring example to compare the three methods. Chapter 6 is the conclusion of the thesis. It includes a discussion of the overall impact of the research with respect to health care policy and techniques for administrative claims data. Future work is proposed as well as additional applications of the work.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI