Title:
Advanced machine learning approaches for characterization of transcriptional regulatory elements and genome-wide associations

dc.contributor.advisor Isbell, Charles L.
dc.contributor.advisor Gibson, Greg
dc.contributor.author Hassanzadeh, Hamid Reza
dc.contributor.committeeMember Qiu, Peng
dc.contributor.committeeMember Dovrolis, Constantine
dc.contributor.committeeMember Tsygankvo, Denis
dc.contributor.department Interactive Computing
dc.date.accessioned 2020-05-20T17:00:38Z
dc.date.available 2020-05-20T17:00:38Z
dc.date.created 2020-05
dc.date.issued 2020-03-20
dc.date.submitted May 2020
dc.date.updated 2020-05-20T17:00:38Z
dc.description.abstract The deep learning revolution has initiated a surge of remarkable achievements in diverse research areas where large volumes of data that underlie complex processes exist. Despite the successful application of deep models in solving certain problems in the Biomedical and Bioinformatics domains, the field has not brought any promise in solving many other challenging problems that deal with the genomic complexities. The goal of my Ph.D. research has been to develop advanced machine learning techniques to address two relevant challenging problems in the Bioinformatics domain, namely, the characterization of transcriptional regulatory elements and, modeling genome-wide associations and linkage disequilibrium using genomic and evolutionary annotation of variants. Genome codes for almost all biological phenomena that take place inside living cells. One such key interactions is the association between transcription factors and a number of degenerate binding sites on DNA which facilitate initiation of transcription of genes. While each protein can potentially bind to any site on the DNA, it is the strength of this binding that plays the key role in the initiation process. Predicting these binding sites as well as binding affinities, are two interesting and yet challenging problems that remain largely unsolved. Yet, we know that the cell machineries constantly identify such sites on DNA with near perfect accuracy. The last two decade witnessed production of multiple in-vivo and in-vitro high-throughput technologies for elucidating these interactions. Protein Binding Microarrays (PBM) have been one of the most effective in-vitro technologies developed so far. The result of PBM experiments, however, are not easily interpretable and require advanced downstream analysis tools to discover the patterns of bindings. In the first half of my thesis, I will develop a series of computational methods that can learn such patterns from data generated by this technology, using tools and techniques from the natural language and image processing domains. I will also show the superiority of my proposed pipelines in predicting binding patterns and affinity. The second part of my thesis is devoted to developing methods for modeling of genome-wide associations and the linkage disequilibrium. Both of these tasks pose similar challenges that restrict our ability in utilizing recent advances in deep learning research. Specifically, when dealing with GWA studies, we are often bound by high dimensionality of variants data, a significant degree of missing information (i.e. missing heritability), high complexity weak patterns to learn, and relatively small datasets. As a consequence, the state-of-the-art approaches for GWAS that are used in practice are different variations of linear models. In my thesis, I showed that part of the failure in learning higher-capacity models can be attributed to how we are training such models. Specifically, I showed that using Siamese networks and tools from graph theory we can achieve a performance higher or on par with the state-of-the-art Bayesian non-parametric approaches. Being successful in learning weak relationships using the proposed model, I then extended my approach to show that there is a relation between variants annotations and their underlying haplotype structure, which was not known before. Existence of such a relationship can increase the power of GWA models and if proved biologically will have important implications in population genetics.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/62784
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject Deep learning
dc.subject Genome-wide association studies (GWAS)
dc.subject Transcription factor binding site modeling
dc.title Advanced machine learning approaches for characterization of transcriptional regulatory elements and genome-wide associations
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.advisor Isbell, Charles L.
local.contributor.advisor Gibson, Greg
local.contributor.corporatename College of Computing
local.contributor.corporatename School of Interactive Computing
relation.isAdvisorOfPublication 3f357176-4c4b-402c-8b61-ec18ffb083a6
relation.isAdvisorOfPublication 5606ef18-bd5a-4b7b-b3fc-05821bf66602
relation.isOrgUnitOfPublication c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication aac3f010-e629-4d08-8276-81143eeaf5cc
thesis.degree.level Doctoral
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
HASSANZADEH-DISSERTATION-2020.pdf
Size:
13.55 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.87 KB
Format:
Plain Text
Description: