Bioinformatic Software Development For Large-Scale Microbial Genome Analysis with Applications in Disturbance Ecology and SAR11 Species Evolution
Author(s)
Zhao, Jianshu
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Genome comparison, search and/or classification is a key step in microbiome studies and has recently become more challenging due to the increasing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e., ProbMinHash, SuperMinHash, Densified MinHash, and SetSketch) to estimate genomic distance, with a graph-based nearest neighbor search algorithm (called Hierarchical Navigable Small World Graphs, or HNSW), I created and implemented a new data structure, and developed two associated computer programs, BinDash 2 and GSearch, for genome comparison and search, respectively. I implemented b-bit one-permutation rolling MinHash with optimal/faster/re-randomized densification with SIMD in BinDash 2. BinDash 2 can perform 0.1 trillion (i.e.,10^11 pairwise genome comparisons in about 1.8 hour on a descent computer cluster or several hours on a personal laptop. BinDash 2 is about 50% faster than the original version with similar accuracy. GSearch is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage for genome search. For example, GSearch can classify 8,000 query genomes against all available microbial or viral genomes (n$\approx$318,000 or 3,000,000, respectively) within a few minutes on a personal laptop, using \textasciitilde{}6GB of memory or less (e.g., 2.5GB using the SetSketch option). Notably, GSearch has an $O(log(N))$ time complexity and will scale well with billions of database genomes based on a database splitting strategy. Further, GSearch implements a three-step classification strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck in microbiome studies that require genome search and/or classification. Additionally, this new structure can also be applied to solve another problem: large-scale sequence or genome visualization/embedding in a new software package called annembed. Annembed demonstrates competitive accuracy compared to popular UMAP-like algorithms but is more than 10 times faster for large biological datasets. These software packages can be used in many real-world metagenomic and genomic studies to facilitate the investigation of ecological and evolutionary questions related to natural or engineered microbial communities.
Understanding how microbial populations respond to disturbances represents a major goal for microbial ecology. While several theories have been advanced to explain microbial community compositional changes in response to disturbances, appropriate data to test these theories is scarce, especially when considering the challenges in defining rare vs. abundant taxa and generalists vs. specialists, prerequisites for testing the theories. in chapter 5, I define these two key concepts by employing the patterns of coverage of a target genome by a metagenome to identify rare populations and, by borrowing concepts from macroecology, the proportional similarity index (PS index), to identify generalists. Using these concepts, we found that coastal microbial communities are resilient to major perturbations such as tropical cyclones and (uncommon) cold or warm weather events—partly due to the response of rare populations, providing support for the insurance hypothesis (i.e., the rare biosphere has the buffering capacity to mitigate the effects of disturbances). Generalists appear to contribute proportionally more than specialists to community adaptation to perturbations, supporting the disturbance-specialization hypothesis, i.e., disturbance favors generalists. Taken together, our results advance understanding of the mechanisms governing microbial population dynamics under changing environmental conditions and have potential applications for ecosystem management.
In the final chapter of the thesis, I focus on one of the most abundant microbial groups on earth— the marine SAR11. I studied the ecology and evolution of SAR11 in the oxygen minimum zone through a variety of cutting-edge 'omics approaches, such as single-cell genomics and metagenomics. Large-scale surveys of natural microbial communities (metagenomics) or isolate genomes have revealed species-like clusters around 95% Average Nucleotide Identity (ANI) of shared genes. That is, members of the same species tend to show greater than 96% ANI among themselves and less than 83% to members of other species, with a clear scarcity (gap) of genome pairs showing between 83-96% ANI. In these surveys, members of the marine SAR11 order (Alphaproteobacteria, Pelagibacterales) have always been a conspicuous outlier, sometimes showing indiscrete species boundaries. I found that SAR11 does form sequence-discrete genomospecies, but their ANI gap is shifted to lower identities, i.e., between 86% and 91%, and the intraspecies ANI ranges between 91% and 100\%, with a peak at 93%-94%. I then employed a recently developed bioinformatic methodology to measure recent gene exchange among these genomes. The results showed a higher frequency of homologous recombination within the genomospecies than between them (more than twice the effect of diversifying mutation), and the recombination events to be randomly distributed across the genome. Further, recombination is so frequent that it causes gene (as opposed to genome) sweeps for metabolic functions under strong selection, such as the respiratory nitrate reductase (NarG). Therefore, these results suggest that high ecological cohesiveness coupled with rampant horizontal gene transfer, mediated by homologous recombination, underlies the SAR11 genomospecies.
In summary, both the software packages and the ecological and evolutionary findings in this thesis can help better study and understand natural and engineered microbial communities.
Sponsor
Date
2024-07-17
Extent
Resource Type
Text
Resource Subtype
Dissertation