2.2 Read mapping, ploidy identification and variant calling
The raw reads were filtered using fastp 0.20.0 (Chen et al., 2018) (with
parameters -3 -W 4 -M 20 -q 20 -u 40 -c -l 50) to eliminate reads with
low quality or short length. The clean reads were aligned to the recent
chromosome-level goldfish reference genome (Luo et al., 2020) using the
BWA-MEM algorithm from BWA (Li, 2012).The resulting SAM file was
converted to a coordinate-sorted and indexed BAM file using Samtools (Li
et al., 2009). Duplicated read pairs were removed using Sambamba
(Tarasov et al., 2015).
The ploidy level of each sample was determined by the nQuire software
(Weiß et al., 2018) using the BAM file. The denoised input of base
frequencies at biallelic sites generated with default parameters was
utilized to estimate ploidy levels with a Gaussian Mixture Model which
describes the histogram as a mixture of Gaussians distributions between
0 and 1. The expected distributions of read frequencies at biallelic
sites are one Gaussian with mean 0.5 for diploid, two Gaussians with
means 0.33 and 0.67 for triploid. The likelihood of certain assumptions
of di, tri- and tetraploidy based on this model given the empirical data
is maximized using an Expectation-Maximization (EM) algorithm.
Variant calling was carried out for diploids and triploids separately
using the Genome Analysis Toolkit (GATK version 4.2.2.0) (McKenna et
al., 2010), with the parameter of -ploidy being 2 for diploids and 3 for
triploids. Briefly, the genomic variants for each sample (in GVCF
format) were identified with the HaplotypeCaller module. The GVCF files
of all samples were merged into a single GVCF with the CombineGVCFs
module, followed by variant calling to generate a genotype file (in VCF
format) using the GenotypeGVCFs module based on a joint calling
approach. The SNPs were selected using the SelectVariants module and
were initially filtered with the VariantFiltration module using the
parameters ‘QUAL < 30.0 || QD <
2.0 || FS > 60.0 || MQ
< 40.0’. The SNPs were further filtered using the VCFtools
software (Danecek et al., 2011) for diploids and BCFtools software
(Danecek et al., 2021) combined with a custom script for triploids using
the following criteria: (1) SNPs with more than two alleles, (2)
genotype calls with a depth < 4 or > 24, (3)
minor allele frequency (MAF) < 0.05, (4) SNPs with missing
rates > 0.1.