loading page

An exploration of assembly strategies and quality metrics on the accuracy of the Knightia excelsa (rewarewa) genome.
  • +5
  • Ann McCartney,
  • Elena Hilario,
  • Seung-Sub Choi,
  • Joseph Guhlin,
  • Jessie Prebble,
  • Gary Houliston,
  • Thomas Buckley,
  • David Chagné
Ann McCartney

Corresponding Author:ann.mccartney@nih.gov

Author Profile
Elena Hilario
Author Profile
Seung-Sub Choi
Author Profile
Joseph Guhlin
Author Profile
Jessie Prebble
Author Profile
Gary Houliston
Landcare Research New Zealand
Author Profile
Thomas Buckley
Landcare Research
Author Profile
David Chagné
New Zealand Institute for Plant and Food Research Ltd
Author Profile

Abstract

We used long read sequencing data generated from Knightia excelsaI R.Br, a nectar producing Proteaceae tree endemic to Aotearoa New Zealand, to explore how sequencing data type, volume and workflows can impact final assembly accuracy and chromosome construction. Establishing a high-quality genome for this species has specific cultural importance to Māori, the indigenous people, as well as commercial importance to honey producers in Aotearoa New Zealand. Assemblies were produced by five long read assemblers using data subsampled based on read lengths, two polishing strategies, and two Hi-C mapping methods. Our results from subsampling the data by read length showed that each assembler tested performed differently depending on the coverage and the read length of the data. Assemblies that used longer read lengths (>30 kb) and lower coverage were the most contiguous, kmer and gene complete. The final genome assembly was constructed into pseudo-chromosomes using all available data assembled with FLYE, polished using Racon/Medaka/Pilon combined, scaffolded using SALSA2 and AllHiC, curated using Juicebox, and validated by synteny with Macadamia. We highlighted the importance of developing assembly workflows based on the volume and type of sequencing data and establishing a set of robust quality metrics for generating high quality assemblies. Scaffolding analyses highlighted that problems found in the initial assemblies could not be resolved accurately by utilizing Hi-C data and that scaffolded assemblies were more accurate when the underlying contig assembly was of higher accuracy. These findings provide insight into what is required for future high-quality de-novo assemblies of non-model organisms.
10 Dec 2020Submitted to Molecular Ecology Resources
12 Jan 2021Submission Checks Completed
12 Jan 2021Assigned to Editor
12 Jan 2021Reviewer(s) Assigned
16 Feb 2021Review(s) Completed, Editorial Evaluation Pending
11 Mar 2021Editorial Decision: Revise Minor
19 Mar 2021Review(s) Completed, Editorial Evaluation Pending
19 Mar 20211st Revision Received
20 Apr 2021Editorial Decision: Accept