Read mapping stringency and genetic relatedness to the reference genome
significantly impact multispecies population genetic and phylogenetic
analyses
Abstract
The increasing affordability of whole genome resequencing in the past
five years and numbers of published reference genomes have enabled
multispecies population genomic and phylogenomic studies on non-model
organisms, but they raise new questions: what reference genomes should
be used for read mapping in comparative studies, and what mapping
methods provide the greatest and least bias in comparative genomics?
Focusing on Eastern North American white oaks (Quercus sect.
Quercus), which have an estimated 36 Ma divergence, we compared
the effects of mapping resequencing data to four Quercus
reference genomes, using three read-mapping methods: Bowtie2
–end-to-end, Bowtie2 –local, and BWA mapping methods. We analyzed
the reference genomes and read-mapping methods in a fully factorial
design to call variants and invariants for nine Quercus genome
resequencing samples, then used the resulting datasets to test how
different combinations of reference genome and method influence
genotyping accuracy and bias. We found that both the genetic distance of
the reference genome to the ingroup samples and mapping method together
impacted sample heterozygosity, tree topology, and tree branch lengths.
Specifically, the heterozygosity of closely-related sample/reference
genome pairs using Bowtie2 –end-to-end alone was not significantly
different from the average heterozygosity of samples that match the
reference species. The outgroup reference genome resulted in low base
pair recovery, low heterozygosity, and unbalanced phylogenies. We
concluded that using a closely related, but not conspecific reference is
ideal to minimize bias from the reference and Bowtie2 –end-to-end
minimizes mismapping enabling the most accurate calls.