Prokaryotic Gene Start Prediction: Algorithms for Genomes and Metagenomes

Gemayel, Karl

Title:

Prokaryotic Gene Start Prediction: Algorithms for Genomes and Metagenomes

dc.contributor.advisor	Borodovsky, Mark
dc.contributor.author	Gemayel, Karl
dc.contributor.committeeMember	Catalyurek, Umit
dc.contributor.committeeMember	Chau, Duen Horng
dc.contributor.committeeMember	Qiu, Peng
dc.contributor.committeeMember	Jordan, King
dc.contributor.department	Computational Science and Engineering
dc.date.accessioned	2021-01-11T17:12:13Z
dc.date.available	2021-01-11T17:12:13Z
dc.date.created	2020-12
dc.date.issued	2020-12-01
dc.date.submitted	December 2020
dc.date.updated	2021-01-11T17:12:13Z
dc.description.abstract	Prokaryotic gene-prediction is the task of finding genes in archaeal or bacterial DNA sequences. These genomes consist of alternating gene-coding and non-coding regions, meaning the task is solved by determining the start and end points of each gene in the DNA sequence, with gene-start prediction generally considered to be more difficult. The primary focus of this work is to improve gene-start prediction accuracy and our understanding of the biological translation-initiation mechanisms used to mark and determine gene-starts. There are two challenges that characterize this task. First, ground-truth, experimentally verified gene-starts are only available for a very small set of genes, and second, our knowledge of translation-initiation mechanisms is incomplete and quite often misleading. Three motivating questions arise from these challenges and are addressed in this work. First, how can we predict gene-starts in a DNA sequence without relying on ground-truth data and without any prior biological knowledge of that species? I show how simplifying assumptions about translation-initiation mechanisms biased the design of existing gene-finder algorithms hindering their predictive performance. I present GeneMarkS-2, an algorithm that relaxes those assumptions and learns more accurate representations of these mechanisms, thereby achieving more accurate predictions. Using it, I provide an updated view of the diversity of translation-initiation mechanisms across the prokaryotic domain. GeneMarkS-2 is now used by the National Center for Biotechnology Information (NCBI) to annotate their database of more than two hundred thousand prokaryotic genomes. Second, how can we measure the accuracy of gene-start prediction without access to ground-truth data? I show that the accuracy of existing methods measured on the limited set of verified data does not generalize to the much larger and more diverse set of available genes. This proves that these benchmark sets of verified starts are not representative enough for this task. I describe an alternative method to boost prediction performance for genes outside the ground-truth set by effectively filtering low-certainty predictions. This is done by only selecting gene-start predictions that are corroborated by multiple, independent sources of evidence. As part of this approach, I propose StartLink, a new comparative genomics approach for gene-start prediction; that is, comparing DNA fragments from multiple species rather than relying solely on a single genome. Third, how can we predict gene-starts for metagenomes, i.e. cases where frequently only part of the DNA sequence is available? Here, I describe how the mechanisms for gene-start prediction developed for GeneMarkS-2 can be ported to metagenomes, which often have short DNA fragments that hinder the performance of predictive methods. I present MetaGeneMarkS, and show that it achieves accuracies on metagenomes close to those achieved by GeneMarkS-2 on fully-sequenced DNA. Several recurring themes appear throughout this work. Understanding the limits of our knowledge of translation-initiation mechanisms proves essential to designing better models and provides an open field of new exploration of the diversity of these mechanisms. Furthermore, our unhealthy dependence on verified gene-starts for measuring performance has and continues to prevent us from accurately portraying the quality of our predictors, despite the >95% average accuracy levels measured on this set. It is therefore critical to restate that gene-start prediction is still an open problem.
dc.description.degree	Ph.D.
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/1853/64155
dc.language.iso	en_US
dc.publisher	Georgia Institute of Technology
dc.subject	Machine learning
dc.subject	Computational biology
dc.subject	Bioinformatics
dc.subject	Gene finding
dc.subject	Prokaryotes
dc.subject	Leaderless transcription
dc.subject	RBS
dc.subject	Promoter
dc.title	Prokaryotic Gene Start Prediction: Algorithms for Genomes and Metagenomes
dc.type	Text
dc.type.genre	Dissertation
dspace.entity.type	Publication
local.contributor.advisor	Borodovsky, Mark
local.contributor.corporatename	College of Computing
local.contributor.corporatename	School of Computational Science and Engineering
relation.isAdvisorOfPublication	fa975b84-f807-4cec-93a6-9df633afb791
relation.isOrgUnitOfPublication	c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication	01ab2ef1-c6da-49c9-be98-fbd1d840d2b1
thesis.degree.level	Doctoral