Unsupervised statistical clustering of environmental shotgun sequences
Author(s)
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Background: The development of effective environmental shotgun sequence binning methods
remains an ongoing challenge in algorithmic analysis of metagenomic data. While previous methods
have focused primarily on supervised learning involving extrinsic data, a first-principles statistical
model combined with a self-training fitting method has not yet been developed.
Results: We derive an unsupervised, maximum-likelihood formalism for clustering short
sequences by their taxonomic origin on the basis of their k-mer distributions. The formalism is
implemented using a Markov Chain Monte Carlo approach in a k-mer feature space. We introduce
a space transformation that reduces the dimensionality of the feature space and a genomic fragment
divergence measure that strongly correlates with the method's performance. Pairwise analysis of
over 1000 completely sequenced genomes reveals that the vast majority of genomes have sufficient
genomic fragment divergence to be amenable for binning using the present formalism. Using a highperformance
implementation, the binner is able to classify fragments as short as 400 nt with
accuracy over 90% in simulations of low-complexity communities of 2 to 10 species, given sufficient
genomic fragment divergence. The method is available as an open source package called LikelyBin.
Conclusion: An unsupervised binning method based on statistical signatures of short
environmental sequences is a viable stand-alone binning method for low complexity samples. For
medium and high complexity samples, we discuss the possibility of combining the current method
with other methods as part of an iterative process to enhance the resolving power of sorting reads
into taxonomic and/or functional bins.
Sponsor
Date
2009-10-02
Extent
Resource Type
Text
Resource Subtype
Article