Health data mining using tensor factorization: Methods and applications

Thumbnail Image
Perros, Ioakeim
Sun, Jimeng
Associated Organization(s)
Supplementary to
The increasing volume and availability of healthcare and biomedical data is opening up new opportunities for the use of computational methods to improve health. However, the data are diverse, multidimensional and sparse, posing challenges to the extraction of clinically-meaningful relations and interactions. For example, the electronic health records (EHRs) of patients contain time-stamped occurrences of diverse features (e.g., diagnoses, medications, procedures) as well as information about relationships among different types of features (e.g., identifying the subset of medications prescribed to treat a certain diagnosis). Such EHR data can be utilized to identify patient cohorts sharing common conditions without expert supervision, a task known as unsupervised phenotyping. Tensors, which are generalizations of matrices for higher orders, can naturally express the multidimensional data relationships inherent in the EHR. Tensor factorization encompasses a set of tools which can capture the latent correlation structure among diverse feature sets. For example, in the context of phenotyping, tensor factorization can be utilized to identify clinically-meaningful patient groups, along with succinct feature profiles distinguishing one group of patients from another. In this dissertation, we expose how tensor factorization can be leveraged to tackle several important problems in healthcare and biomedicine. We also identify multiple significant methodological challenges in fully harnessing the capacity of tensor factorization for the problems at hand and develop algorithms to tackle them. In particular, we focus on the following problems: - Drug-perturbed, tissue-specific gene expression prediction, where we demonstrate how tensor factorization can be used to model the interactions between drugs, genes and tissues in an efficient manner. - Unsupervised phenotyping through EHRs, in the context of which we advance existing tensor factorization methods so that: a) they are fast and scalable to use for large patient cohorts of hundreds of thousands of patients; and b) they yield interpretable output, easy to be communicated to a clinical expert. - Automating understanding of physician desktop work. Therein, we demonstrate how tensor factorization can be used to substantially compress audit EHR logs, offering an intuitive categorization of user actions that can be used for workflow analysis.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI