Eleonora Rachtman

and 2 more

A key application of phylogenetics in ecological studies is identifying unknown sequences with respect to known ones. This goal can be formalized as assigning taxonomic labels or inserting sequences into a reference phylogenetic tree (phylogenetic placement). Much attention has been paid to the phylogenetic placement of short fragments used in amplicon sequencing or metagenomics. However, placing longer pieces of DNA, such as assembled genomes, contigs, or long reads, is less studied. Placing long sequences should be easier than short reads due to their increased signal. However, handling larger inputs poses its own challenges, including finding homologs and the computational burden. Here, we explore a phylogenetic placement method that uses k-mer frequencies to measure distances between long query sequences and reference genomes. Our proposed method, kf2vec, requires no alignment and can work on any region of the genome (needs no marker genes), thus simplifying analysis pipelines. A rich literature exists on using short k-mers frequencies to measure distances that correlate with phylogeny. Existing methods, however, have had moderate practical success despite enjoying strong theory. Instead of using predefined metrics, we train a deep neural network to estimate a distance from k-mer frequency vectors such that those distances match the path lengths on the reference phylogeny. The trained model is then used to characterize new samples. We demonstrate that kf2vec outperforms existing kmer-based approaches in distance calculation and allows accurate phylogenetic placement and taxonomic identification of new samples from various types of long sequences.