Tools for interactive analysis of metagenomic read recruitment, sensitive detection and quantification of genes in metagenomes, and rapid, large scale genome relatedness estimation

Author(s)
Gerhardt, Kenji Allen
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Biological Sciences
School established in 2016 with the merger of the Schools of Applied Physiology and Biology
Series
Supplementary to:
Abstract
In the modern era of genomics, nearly all genomic research ultimately flows through bioinformatic software. However, the rate at which biological sequence data is generated has outpaced increases in the availability of computational power. This has created an environment where most genome analysis programs must operate in the world of big data. Solving problems at large scales requires a fundamentally different approach and mindset than solving problems at smaller scales, and all modern bioinformatic software must be designed with solutions to this problem in mind. Software methods must be fast and efficient with computational resources to process modern datasets at all, and automation – which by necessity accounts for an ever-growing fraction of the process of genomic analysis – must be robust and able to translate increasingly complex and multifaceted data into clear, transparent, and comprehensible signal for researchers. Multiple categories of solutions exist for these problems: software can be exceptionally quick and overcome scale with efficiency. Software can implement novel algorithms that achieve similar/identical goals to a previous approach, but which escape key limitations of the replaced approach. Software can limit the scope of hypothesis it seeks to render testable and eschew broad utility for use as a scalpel. Finally, software can summarize vast quantities of data into more digestible forms that are amenable to analyses which would not ordinarily scale well. In this thesis, we seek to address three software tools which explore this solution space and further the toolkit of genome analysis. A common way to assess the relative abundance of a genome in a sample is by mapping reads from a corresponding metagenome to is (so-called, read recruitment). This analysis can also reveal is the reference is a good representative of the natural population in the sample, and if so, provides hints for population structure e.g., how clonal or diverse the population is. However, existing read recruitment plots are cumbersome to use, can visualize only one genome at a time, and are not interactive. In Chapter 2, we introduce RecruitPlotEasy, an interactive visualization program that allows for the rapid and detailed characterization of populations in metagenomes. RecruitPlotEasy addresses these previous limitations and empowers a researcher to manually curate many more populations than would otherwise be feasible and to communicate those results in a richly detailed manner. A longstanding issue in metagenomics is the extraction of functional annotations from environments, using recovered genes to identify key functions. In some cases, this can even boil down to single key genes which are critical for the measurement of an important ecological process. In Chapter 3, we introduce ROCkI/O, a tool for the detection and quantification of target genes within metagenomes. ROCkI/O constructs gene-specific classification models which enable the rapid and extremely accurate retrieval of key genes – for example, environmentally important genes like denitrifiers or genes with biomedical significance such as antibiotic resistance genes – from metagenomic datasets. ROCkI/O takes the approach of adopting a narrow, but tailorable, scope in searching large collections of data, enabling searches which are both faster, require less resources, and are more sensitive than competing methods as long as the question being asked by researchers is sufficiently specific. The comparison of novel genome sequences to those in public genome databases using genome similarity metrics such as whole genome aggregate amino acid identity (AAI) is a crucial step in determining the novelty and, if similar species exist, function of the novel genome. AAI functions to compare two genomes by averaging the amino acid similarity of their shared proteins, and has a long history of use as a method of funding distant relatives (genus or above). In Chapter 4, we introduce FastAAI v2, which improves upon the ultrafast AAI estimation capacities of the original version of FastAAI by increasing its robustness to large misestimations of AAI in prokaryotes and allows for the calculation of AAI in eukaryotes as well. In an era where the scale of public genome databases renders them difficult, slow, and expensive to search, FastAAI v2 overcomes the limitations of traditional AAI calculators through the use of a simple but effective, alignment-free metric of genome similarity. FastAAI v2 provides a scalable solution for the estimation of relatedness among genomes of intermediate novelty, including a proactive application to eukaryotic organisms which have been the targets of intense sequencing effort in recent years, and serves as a new, powerful tool in the toolkit of genome analysis.
Sponsor
Date
2024-09-06
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI