Marie Gurke

and 1 more

Missing genotypes in DNA sequence data are an issue in many evolutionary genomic studies, especially of non-model organisms. It can be addressed using genotype imputation. However, algorithms that do not depend on additional genotype data as reference for accurate imputation, which is often not available for non- model taxa, and are able to work with large whole-genome data sets are scarce. Therefore, we developed a new approach to genotype imputation called GenoPop-Impute. It combines a batch based processing strat- egy of whole genomes with the existing missing data imputation algorithm missForest, which is based on the random forest machine learning algorithm. The batch-wise approach utilizes linkage disequilibrium to in- crease imputation accuracy and allows computational parallelization and thus efficiency. Tests on simulated data demonstrate that linkage disequilibrium between SNPs has a positive effect on imputation accuracy, due to correlation that originated in a shared evolutionary history. In comparison to four alternative algo- rithms, GenoPop-Impute is more accurate and is the only one computationally applicable to data sets of whole genomes. In addition, we found that GenoPop-Impute also increases the accuracy of commonly estimated population genomic metrics and mitigates biases due to missing data in demographic modeling experiments. We conclude that genotype imputation can be a valuable tool for evolutionary genomic studies of non-model taxa and that GenoPop-Impute is a highly suitable algorithm for this.

Marie Gurke

and 1 more