Advanced machine learning approaches for characterization of transcriptional regulatory elements and genome-wide associations

Hassanzadeh, Hamid Reza

Title:

Advanced machine learning approaches for characterization of transcriptional regulatory elements and genome-wide associations

dc.contributor.advisor	Isbell, Charles L.
dc.contributor.advisor	Gibson, Greg
dc.contributor.author	Hassanzadeh, Hamid Reza
dc.contributor.committeeMember	Qiu, Peng
dc.contributor.committeeMember	Dovrolis, Constantine
dc.contributor.committeeMember	Tsygankvo, Denis
dc.contributor.department	Interactive Computing
dc.date.accessioned	2020-05-20T17:00:38Z
dc.date.available	2020-05-20T17:00:38Z
dc.date.created	2020-05
dc.date.issued	2020-03-20
dc.date.submitted	May 2020
dc.date.updated	2020-05-20T17:00:38Z
dc.description.abstract	The deep learning revolution has initiated a surge of remarkable achievements in diverse research areas where large volumes of data that underlie complex processes exist. Despite the successful application of deep models in solving certain problems in the Biomedical and Bioinformatics domains, the field has not brought any promise in solving many other challenging problems that deal with the genomic complexities. The goal of my Ph.D. research has been to develop advanced machine learning techniques to address two relevant challenging problems in the Bioinformatics domain, namely, the characterization of transcriptional regulatory elements and, modeling genome-wide associations and linkage disequilibrium using genomic and evolutionary annotation of variants. Genome codes for almost all biological phenomena that take place inside living cells. One such key interactions is the association between transcription factors and a number of degenerate binding sites on DNA which facilitate initiation of transcription of genes. While each protein can potentially bind to any site on the DNA, it is the strength of this binding that plays the key role in the initiation process. Predicting these binding sites as well as binding affinities, are two interesting and yet challenging problems that remain largely unsolved. Yet, we know that the cell machineries constantly identify such sites on DNA with near perfect accuracy. The last two decade witnessed production of multiple in-vivo and in-vitro high-throughput technologies for elucidating these interactions. Protein Binding Microarrays (PBM) have been one of the most effective in-vitro technologies developed so far. The result of PBM experiments, however, are not easily interpretable and require advanced downstream analysis tools to discover the patterns of bindings. In the first half of my thesis, I will develop a series of computational methods that can learn such patterns from data generated by this technology, using tools and techniques from the natural language and image processing domains. I will also show the superiority of my proposed pipelines in predicting binding patterns and affinity. The second part of my thesis is devoted to developing methods for modeling of genome-wide associations and the linkage disequilibrium. Both of these tasks pose similar challenges that restrict our ability in utilizing recent advances in deep learning research. Specifically, when dealing with GWA studies, we are often bound by high dimensionality of variants data, a significant degree of missing information (i.e. missing heritability), high complexity weak patterns to learn, and relatively small datasets. As a consequence, the state-of-the-art approaches for GWAS that are used in practice are different variations of linear models. In my thesis, I showed that part of the failure in learning higher-capacity models can be attributed to how we are training such models. Specifically, I showed that using Siamese networks and tools from graph theory we can achieve a performance higher or on par with the state-of-the-art Bayesian non-parametric approaches. Being successful in learning weak relationships using the proposed model, I then extended my approach to show that there is a relation between variants annotations and their underlying haplotype structure, which was not known before. Existence of such a relationship can increase the power of GWA models and if proved biologically will have important implications in population genetics.
dc.description.degree	Ph.D.
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/1853/62784
dc.language.iso	en_US
dc.publisher	Georgia Institute of Technology
dc.subject	Deep learning
dc.subject	Genome-wide association studies (GWAS)
dc.subject	Transcription factor binding site modeling
dc.title	Advanced machine learning approaches for characterization of transcriptional regulatory elements and genome-wide associations
dc.type	Text
dc.type.genre	Dissertation
dspace.entity.type	Publication
local.contributor.advisor	Isbell, Charles L.
local.contributor.advisor	Gibson, Greg
local.contributor.corporatename	College of Computing
local.contributor.corporatename	School of Interactive Computing
relation.isAdvisorOfPublication	3f357176-4c4b-402c-8b61-ec18ffb083a6
relation.isAdvisorOfPublication	5606ef18-bd5a-4b7b-b3fc-05821bf66602
relation.isOrgUnitOfPublication	c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication	aac3f010-e629-4d08-8276-81143eeaf5cc
thesis.degree.level	Doctoral