2.2 Read mapping, ploidy identification and variant calling
The raw reads were filtered using fastp 0.20.0 (Chen et al., 2018) (with parameters -3 -W 4 -M 20 -q 20 -u 40 -c -l 50) to eliminate reads with low quality or short length. The clean reads were aligned to the recent chromosome-level goldfish reference genome (Luo et al., 2020) using the BWA-MEM algorithm from BWA (Li, 2012).The resulting SAM file was converted to a coordinate-sorted and indexed BAM file using Samtools (Li et al., 2009). Duplicated read pairs were removed using Sambamba (Tarasov et al., 2015).
The ploidy level of each sample was determined by the nQuire software (Weiß et al., 2018) using the BAM file. The denoised input of base frequencies at biallelic sites generated with default parameters was utilized to estimate ploidy levels with a Gaussian Mixture Model which describes the histogram as a mixture of Gaussians distributions between 0 and 1. The expected distributions of read frequencies at biallelic sites are one Gaussian with mean 0.5 for diploid, two Gaussians with means 0.33 and 0.67 for triploid. The likelihood of certain assumptions of di, tri- and tetraploidy based on this model given the empirical data is maximized using an Expectation-Maximization (EM) algorithm.
Variant calling was carried out for diploids and triploids separately using the Genome Analysis Toolkit (GATK version 4.2.2.0) (McKenna et al., 2010), with the parameter of -ploidy being 2 for diploids and 3 for triploids. Briefly, the genomic variants for each sample (in GVCF format) were identified with the HaplotypeCaller module. The GVCF files of all samples were merged into a single GVCF with the CombineGVCFs module, followed by variant calling to generate a genotype file (in VCF format) using the GenotypeGVCFs module based on a joint calling approach. The SNPs were selected using the SelectVariants module and were initially filtered with the VariantFiltration module using the parameters ‘QUAL < 30.0 || QD < 2.0 || FS > 60.0 || MQ < 40.0’. The SNPs were further filtered using the VCFtools software (Danecek et al., 2011) for diploids and BCFtools software (Danecek et al., 2021) combined with a custom script for triploids using the following criteria: (1) SNPs with more than two alleles, (2) genotype calls with a depth < 4 or > 24, (3) minor allele frequency (MAF) < 0.05, (4) SNPs with missing rates > 0.1.