Abstract
Techniques of reduced-representation sequencing (RRS) have
revolutionized ecological and evolutionary genomics studies. Precise
establishment of orthologs is a critical challenge for RRS, especially
when a reference genome is absent. The proportion of shared heterozygous
sites across samples is an alternative criterion for filtering paralogs,
as divergent lineages should be less likely to share heterozygosity. In
the prevailing pipeline for variant calling of RRS data - PYRAD/IPYRAD,
maxSharedH is an often overlooked parameter with implications to
detecting and filtering paralogs according to shared heterozygosity.
Using empirical GBS data of two primroses (Primula alpicola Stapf
and Primula florindae Ward) and their putative hybrids, we
explore the impact of maxSharedH on filtering paralogs and
further downstream analyses. Our study sheds light on the simultaneous
validity and risk of filtering paralogs using maxSharedH, and its
significant effects on downstream analyses of outlier detection,
population assignment, and demographic modelling, emphasizing the
importance of attention to detail during bioinformatics processes. The
mutual confirmation between results of population assignment and
demographic modelling in this study suggested maxSharedH = 0.10
has a potentially excessive and asymmetrical effect on the removal of
truly shared heterozygous sites as paralogs. These results indicate that
hybridization origin hypotheses of putative hybrids represented by
results with maxSharedH = 0.25 and 0.50 are more credible. In
conclusion, we revealed the critical hazard of paralogs filtration
according to sharing heterozygosity at first, so that we propose to use
specific protocols, rather than maxSharedH, to filter potential
paralogs for closely related lineages.