Title:
Prokaryotic Gene Start Prediction: Algorithms for Genomes and Metagenomes

dc.contributor.advisor Borodovsky, Mark
dc.contributor.author Gemayel, Karl
dc.contributor.committeeMember Catalyurek, Umit
dc.contributor.committeeMember Chau, Duen Horng
dc.contributor.committeeMember Qiu, Peng
dc.contributor.committeeMember Jordan, King
dc.contributor.department Computational Science and Engineering
dc.date.accessioned 2021-01-11T17:12:13Z
dc.date.available 2021-01-11T17:12:13Z
dc.date.created 2020-12
dc.date.issued 2020-12-01
dc.date.submitted December 2020
dc.date.updated 2021-01-11T17:12:13Z
dc.description.abstract Prokaryotic gene-prediction is the task of finding genes in archaeal or bacterial DNA sequences. These genomes consist of alternating gene-coding and non-coding regions, meaning the task is solved by determining the start and end points of each gene in the DNA sequence, with gene-start prediction generally considered to be more difficult. The primary focus of this work is to improve gene-start prediction accuracy and our understanding of the biological translation-initiation mechanisms used to mark and determine gene-starts. There are two challenges that characterize this task. First, ground-truth, experimentally verified gene-starts are only available for a very small set of genes, and second, our knowledge of translation-initiation mechanisms is incomplete and quite often misleading. Three motivating questions arise from these challenges and are addressed in this work. First, how can we predict gene-starts in a DNA sequence without relying on ground-truth data and without any prior biological knowledge of that species? I show how simplifying assumptions about translation-initiation mechanisms biased the design of existing gene-finder algorithms hindering their predictive performance. I present GeneMarkS-2, an algorithm that relaxes those assumptions and learns more accurate representations of these mechanisms, thereby achieving more accurate predictions. Using it, I provide an updated view of the diversity of translation-initiation mechanisms across the prokaryotic domain. GeneMarkS-2 is now used by the National Center for Biotechnology Information (NCBI) to annotate their database of more than two hundred thousand prokaryotic genomes. Second, how can we measure the accuracy of gene-start prediction without access to ground-truth data? I show that the accuracy of existing methods measured on the limited set of verified data does not generalize to the much larger and more diverse set of available genes. This proves that these benchmark sets of verified starts are not representative enough for this task. I describe an alternative method to boost prediction performance for genes outside the ground-truth set by effectively filtering low-certainty predictions. This is done by only selecting gene-start predictions that are corroborated by multiple, independent sources of evidence. As part of this approach, I propose StartLink, a new comparative genomics approach for gene-start prediction; that is, comparing DNA fragments from multiple species rather than relying solely on a single genome. Third, how can we predict gene-starts for metagenomes, i.e. cases where frequently only part of the DNA sequence is available? Here, I describe how the mechanisms for gene-start prediction developed for GeneMarkS-2 can be ported to metagenomes, which often have short DNA fragments that hinder the performance of predictive methods. I present MetaGeneMarkS, and show that it achieves accuracies on metagenomes close to those achieved by GeneMarkS-2 on fully-sequenced DNA. Several recurring themes appear throughout this work. Understanding the limits of our knowledge of translation-initiation mechanisms proves essential to designing better models and provides an open field of new exploration of the diversity of these mechanisms. Furthermore, our unhealthy dependence on verified gene-starts for measuring performance has and continues to prevent us from accurately portraying the quality of our predictors, despite the >95% average accuracy levels measured on this set. It is therefore critical to restate that gene-start prediction is still an open problem.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/64155
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject Machine learning
dc.subject Computational biology
dc.subject Bioinformatics
dc.subject Gene finding
dc.subject Prokaryotes
dc.subject Leaderless transcription
dc.subject RBS
dc.subject Promoter
dc.title Prokaryotic Gene Start Prediction: Algorithms for Genomes and Metagenomes
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.advisor Borodovsky, Mark
local.contributor.corporatename College of Computing
local.contributor.corporatename School of Computational Science and Engineering
relation.isAdvisorOfPublication fa975b84-f807-4cec-93a6-9df633afb791
relation.isOrgUnitOfPublication c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication 01ab2ef1-c6da-49c9-be98-fbd1d840d2b1
thesis.degree.level Doctoral
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
GEMAYEL-DISSERTATION-2020.pdf
Size:
3.68 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.86 KB
Format:
Plain Text
Description: