The value of primary transcripts to the clinical and non-clinical genomics community: survey results and roadmap for improvements.

Authors:

Joannella Morales, Aoife C. McMahon, Jane Loveland, Emily Perry, Adam Frankish, Sarah Hunt, Irina M. Armean, Paul Flicek, Fiona Cunningham.
European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom

Grant numbers

Ensembl receives majority funding from Wellcome Trust (grant number WT108749/Z/15/Z) with additional funding for specific project components. Research reported in this publication was supported by Wellcome Trust [WT200990/Z/16/Z, WT200990/A/16/Z ], EMBL and by National Human Genome Research Institute of the National Institutes of Health under award number 2U41HG007234. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

Abstract

Variant interpretation is dependent on transcript annotation and remains time consuming and challenging. There are major obstacles for historical data reuse and for interpretation of new variants. First, both RefSeq and Ensembl/GENCODE produce transcript sets in common use, but there is currently no easy way to translate between the two. Second, the resources often used for variant interpretation (e.g., ClinVar, gnomAD, UniProt) do not use the same transcript set, nor default transcript or protein sequence. Ensembl ran a survey in 2018 to sample attitudes to choosing one default transcript per locus, and to gather data on reference sequences used by the scientific community. This was publicised on the Ensembl and UCSC genome browsers, by email and on social media. We had 788 respondents. Here we report our results and roadmap to create an effective default set of transcripts for resources, and for reporting interpretation of clinical variants.
Keywords: transcript annotation, variant interpretation, survey, default

Introduction

Many advances in biological understanding and genomic medicine are dependent on variant interpretation and the ability to describe a sequence change with respect to a specific annotated transcript. However, in older publications the transcript is rarely recorded, hampering the ability to reuse historical data (e.g., CFTR del-508, BRAF V600E). Occasionally, despite the existence of Human Genome Variation Society (HGVS) guidelines for variant reporting (den Dunnen et al., 2016), no transcript version is specified, legacy numbering is used, or the analysis may have used one historic transcript only. Moreover, interpretation of novel data is hampered by the variety of reference sequences used to gather evidence for variant analysis, and lack of coordination across the resources. There are two commonly used transcript sets for annotation: NCBI’s RefSeq (O’Leary et al., 2016) and EMBL-EBI’s Ensembl/GENCODE (Frankish et al., 2021). Many highly-accessed genomics resources supporting variant interpretation use transcripts from only one set, or default to a single transcript (e.g. ExAC/gnomAD (Karczewski et al., 2020; Lek et al., 2016), Human Cell Atlas (Andersson et al., 2014), GTEx (GTEx Consortium et al., 2015), ClinVar (Landrum et al., 2014), HGMD (Stenson et al., 2020)). None of these are coordinated with UniProt’s principal isoform (Bateman et al., 2017) and comparison of annotation across sets is non-trivial. Additionally, some transcript sequences do not perfectly match the reference genome used for variant calling.
With this in mind, we started to explore how to choose one default transcript for each protein-coding locus, and the merits of such a set. In 2018, we surveyed the community to understand the priorities and attitudes surrounding transcript choice and reporting. The survey results supported RefSeq and Ensembl/GENCODE agreeing on an identical transcript for each locus to be used as a common default across resources. Below we detail our other conclusions.

Methods

To gather input from the scientific community on transcript usage, and attitudes to transcript change, we developed a survey. The survey had four sections: ‘Transcript choice’, ‘Variant interpretation and reporting’, ‘Reference sequence sources’, and one on the demographics of the respondents. We had compulsory questions that required selecting a single answer, and optional questions that were a mixture of multiple-choice questions and open-ended questions. For example, our questions covered:
The examples we chose for picking transcripts were cartoon versions of real loci. We advertised the survey by email, on the Ensembl (Cunningham et al., 2018) and UCSC (Tyner et al., 2017) genome browsers, via social media, and through contacts to ClinGen and NCBI’s Genetic Testing Registry participants.

Results

The survey generated 788 responses (see questions and results here: https://tinyurl.com/embl-ebi-transcript-survey) from 32 different countries: the largest contributors were the USA, UK and Germany (40%, 19% and 5% respectively). We analysed our results into two categories based on the response to the multiple-choice question ‘Where do you work?’. Those who selected ‘clinical diagnostics’ or ‘clinical research’ were labelled ‘clinical’ (N=285; 36%) and those who selected from (University/college/academia/non-profit /research; commercial/industry; government; other) were ‘non-clinical’ (N=503; 64%). The results and requirements from these categories were different. We assayed how transcripts were used across the scientific community (question 14). The most common words in the answers included: variants, analysis, expression, RNA-seq, clinical, reporting, gene and annotation.
When presented with two choices for a primary transcript, the more abundant or the longest coding sequence, the non-clinical group showed a clear preference for choosing the more abundant transcript (question 2a, 2b). In contrast, no clear preference emerged in the clinical group (see Figure 1). In question 3a, the choice was between the transcript that covers the most clinically relevant variants, that is most abundant, that is longest, or that is used historically. The clinical group preferred the transcript that covered the most clinically relevant variants (see Figure 2); (see also question 3b). In contrast, there was no obvious preference between these choices in question 3 for the non-clinical group. There was lower preference for historical transcripts (12%; 14% of respondents - question 3a; 3b).
We received >800 additional comments across questions 1-3. Themes that emerged from these: rejected the value of a primary transcript, stated that all transcripts should be used, or proposed an artificial transcript be created to cover all exons. Many comments called for ranking and filtering methods in genome browsers and resources, supported by specific data on transcript abundance, tissue-specificity/expressivity, cell-specificity, background conditions, environmental, developmental stage and transcript quality metrics. More data was requested on flagging transcripts that were computationally determined, predicted, fully functional, validated, chosen by expert consensus as clinically relevant, or rare. The importance of cell/tissue-specificity and the difficulty of assessing abundance or relative expression was often mentioned.
For transcript sequences, respondents were asked to prioritise either that a transcript sequence matches the reference assembly, does not contain pathogenic alleles, matches the global major allele or never changes. Here, the transcript that matches the reference was the priority choice (48%) across all respondents (question 4) (Figure 3). There was only a minority to whom transcript sequences never changing was important (<10%, questions 4 and 5).
For transcript usage for reporting and interpretation, there was a preference captured by the respondent comment “I wouldn’t use just one transcript for INTERPRETATION unless it was the only one known” over only using one transcript (question 6). The preferred option for clinical respondents was to report on the primary transcript and the affected transcript (39%) rather than across all transcripts (14%). The opposite was true for the ‘non-clinical’ group (18% vs 40% respectively) (question 7).
We surveyed the reference sequences used for reporting in question 8 (Figure 4). In general, ‘clinical’ respondents used RefSeq, Locus Reference Genomic (LRG) (Dalgleish et al., 2010; MacArthur et al., 2014) and GRCh37, rather than Ensembl/GENCODE or GRCh38. Whereas the ‘non-clinical’ community replies were more equally spread across using GRCh38 and GRCh37, RefSeq or Ensembl/GENCODE but not LRG.
Results from the survey indicated that having RefSeq and Ensembl/GENCODE agree on one primary transcript per gene would be welcome (54% overall; 67% of ‘clinical’ respondents, question 10). We revisited the question ‘Do you want us to provide one primary transcript’ at the end of the survey requiring a Yes, No or ‘Not sure’ answer. Here 60% of the ‘clinical’ respondents were in favour, compared with 48% of ‘non-clinical’ ones.
With input from this survey results, our conclusions and recommendations are that:
  1. RefSeq and Ensembl/GENCODE collaborate to agree on:
  2. one identical primary transcript per locus that perfectly matches the GRCh38 reference assembly. This is to ensure the community, browsers and resources use a good, consensus choice of transcript for analyses or situations that require only one (e.g., default display per gene).
  3. minimal additional identical transcripts that match the reference assembly required for clinical reporting.
  4. Transcripts are updated from historical exemplars, using modern datasets to choose a representative transcript:
evaluated on predicted functional significance and abundance rather than due to longest length, or being defined first (i.e., the historical transcript).
whose sequence is an exact reference genome sequence match.
All resources adopt this primary agreed transcript for the most effective benefit of the workings of the scientific community.
Genome browsers and resources consider improvements to their methods of filtering and ranking transcripts to facilitate choosing the appropriate transcript(s). Often, using only the one primary transcript per locus may not be right.

Discussion

Across the survey results as a whole, there is no agreed method for designating a primary transcript. However, the value of consensus between Ensembl/GENCODE and RefSeq was highlighted as important. There is a history of collaboration between the two groups, for example on the Consensus CDS (CCDS) project (Pujar et al., 2018) and LRG. For many transcripts, the CCDS project has achieved consensus for the exon/intron structure over the protein-coding region, but there remain coding sequence discrepancies and structure differences in the untranslated regions (UTRs). The LRG project focuses on recording historical sequences for variant reporting that should never change, and therefore many of these do not perfectly match the reference assembly. However, the survey demonstrated a tolerance for change (only 6% selected ‘Never update’ in question 5).
Interestingly, many suggested the ideal primary transcript should contain all exons. This ‘meta transcript’ approach has been used for a few LRGs (e.g., LRG_391 for TTN; and LRG_202 for NEB) that represent an inferred transcript model containing all identifiable in-frame coding exons. However, it leads to the creation of primary transcripts that do not reflect biological reality and which are not guaranteed to be comprehensive: they may contain exons that show huge differences in their inclusion rates generally, and in specific tissues; they may include mutually exclusive exons; they cannot include exons in different frames; and they will need to be updated if novel coding exons are subsequently discovered.
The survey reported many, especially clinical groups, are still using GRCh37, released in 2009. GRCh38, released in 2013, offers a more complete genome that is being continuously improved by the Genome Reference Consortium (GRC) (Schneider et al., 2017) through a supplemental release model. Ensembl/GENCODE gene annotation is only being updated on GRCh38. Therefore, it is only the annotation on GRCh38 that will benefit from all the improvements supported by the incorporation of new data sets (such as long transcriptomic data generated using methods developed by Oxford Nanopore Technologies and Pacific Biosciences), and of tools (such as the PhyloCSF method (Lin, Jungreis, & Kellis, 2011) for identifying regions of the genome with conserved protein-coding potential). Major resources such as gnomAD and DECIPHER are also now using GRCh38.
Worth noting is that many survey comments expressed resistance to the very idea of a default transcript. They rightly pointed out that biology cannot be simplified in this manner, however appealing the concept. We agree completely that genome analysis requires considering multiple transcripts per gene and Ensembl remains absolutely committed to annotating all evidence-based transcripts at every locus. Analysis, including the interpretation of variants identified from clinical sequencing, should always be in relation to the most relevant and abundant isoform(s) for the tissue of interest at the developmental stage of interest and in the correct cell type. In general, we do not yet have the data to determine this. Although projects such as GTEx and Human Cell Atlas have and will change the landscape of transcriptomic data available, currently for the majority of developmental stages, there is a lack of this critical information. As a result, in the absence of tissue-specific data, any analysis should consider all transcripts or proteins at the locus. We urge more cooperation between clinical diagnostics and research to use a broader transcript set and thereby remove the bias in reported transcripts.
However, for practical reasons it is sometimes helpful to have only one transcript for sharing and comparing results across experiments, datasets and collaborations. Indeed, many browsers, bioinformatics tools and variant interpretation pipelines have chosen a default transcript, independently from each other. For example, Ensembl and UniProt have had their own ‘canonical’ (available only through the Ensembl API) and ‘principal isoform’ choices, respectively, for default transcripts and proteins for over a decade while RefSeq has a ‘select’ transcript and HGMD has a default RefSeq. Often these have been based on the longest transcript (https://www.ensembl.org/Help/Glossary), or the first sequences published, or most prevalent (https://www.uniprot.org/help/canonical_and_isoforms) but are not necessarily consistent or coordinated with other resources.
It is clear, therefore, that the concept of a default transcript already exists across resources but is uncoordinated. The survey results demonstrated a desire for a default transcript, but in the absence of a consensus choice so far, we see that each genomics resource, scientist and experiment choose a different transcript. Selecting one particular transcript per locus comes with a risk of biasing the scientific community towards ignoring the full transcriptome. However, a collaboration between RefSeq and Ensembl/GENCODE would provide the leadership necessary to unite the community and provide a consensus choice for a set of results and opinions that lack a clear consensus from the survey. This would be a practical and coordinated effort to define one default transcript per locus. There is no overall ‘correct’ choice but the most important and valuable property of a default transcript is that it is consistent, for reporting and to ease use of different resources and tools that require a default transcript. Equally important would be to work with all major browsers and resources (e.g., NCBI, Ensembl, Ensembl’s Variant Effect Predictor, UCSC Genome Browser, gnomAD, DECIPHER, UniProt, Panel App, COSMIC etc.) to ensure adoption of the common default transcript.

Acknowledgements:

We would like to thank the 788 individuals who completed our survey and everyone who helped advertise it. Thank you also to Caroline Wright for useful analysis discussions, and the following for their feedback on the survey design: Deanna Church, Mark Diekhans, Terence Murphy, Heidi Rehm, Magali Ruffier, Andrew Yates.

References

Andersson, R., Gebhard, C., Miguel-Escalada, I., Hoof, I., Bornholdt, J., Boyd, M., … Sandelin, A. (2014). An atlas of active enhancers across human cell types and tissues. Nature , 507 (7493), 455–461. doi: 10.1038/nature12787
Bateman, A., Martin, M. J., O’Donovan, C., Magrane, M., Alpi, E., Antunes, R., … Zhang, J. (2017). UniProt: The universal protein knowledgebase. Nucleic Acids Research , 45 (D1), D158–D169. doi: 10.1093/nar/gkw1099
Cunningham, F., Achuthan, P., Akanni, W., Allen, J., Amode, M. R., Armean, I. M., … Flicek, P. (2018). Ensembl 2019. Nucleic Acids Research . doi: 10.1093/nar/gky1113
Dalgleish, R., Flicek, P., Cunningham, F., Astashyn, A., Tully, R. E., Proctor, G., … Maglott, D. R. (2010). Locus Reference Genomic sequences: An improved basis for describing human DNA variants.Genome Medicine , 2 (4), 24. doi: 10.1186/gm145
den Dunnen, J. T., Dalgleish, R., Maglott, D. R., Hart, R. K., Greenblatt, M. S., McGowan-Jordan, J., … Taschner, P. E. M. (2016). HGVS Recommendations for the Description of Sequence Variants: 2016 Update. Human Mutation , 37 (6), 564–569. doi: 10.1002/humu.22981
Frankish, A., Diekhans, M., Jungreis, I., Lagarde, J., Loveland, J. E., Mudge, J. M., … Flicek, P. (2021). GENCODE 2021. Nucleic Acids Research , 49 (D1), D916–D923. doi: 10.1093/nar/gkaa1087
GTEx Consortium, T., Ardlie, K. G., Deluca, D. S., Segrè, A. V., Sullivan, T. J., Young, T. R., … Dermitzakis, E. T. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science , 348 (6235), 648–660. doi: 10.1126/science.1262110
Karczewski, K. J., Francioli, L. C., Tiao, G., Cummings, B. B., Alföldi, J., Wang, Q., … MacArthur, D. G. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans.Nature , 581 (7809), 434–443. doi: 10.1038/s41586-020-2308-7
Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., & Maglott, D. R. (2014). ClinVar: Public archive of relationships among sequence variation and human phenotype.Nucleic Acids Research , 42 (D1), D980–D985. doi: 10.1093/nar/gkt1113
Lek, M., Karczewski, K. J., Minikel, E. V., Samocha, K. E., Banks, E., Fennell, T., … Exome Aggregation Consortium. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature ,536 (7616), 285–291. doi: 10.1038/nature19057
Lin, M. F., Jungreis, I., & Kellis, M. (2011). PhyloCSF: A comparative genomics method to distinguish protein coding and non-coding regions.Bioinformatics , 27 (13), i275–i282. doi: 10.1093/bioinformatics/btr209
MacArthur, J. A. L., Morales, J., Tully, R. E., Astashyn, A., Gil, L., Bruford, E. A., … Cunningham, F. (2014). Locus Reference Genomic: Reference sequences for the reporting of clinically relevant sequence variants. Nucleic Acids Research , 42 (D1), D873–D878. doi: 10.1093/nar/gkt1198
O’Leary, N. A., Wright, M. W., Brister, J. R., Ciufo, S., Haddad, D., McVeigh, R., … Pruitt, K. D. (2016). Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Research , 44 (D1), D733-745. doi: 10.1093/nar/gkv1189
Pujar, S., O’Leary, N. A., Farrell, C. M., Loveland, J. E., Mudge, J. M., Wallin, C., … Pruitt, K. D. (2018). Consensus coding sequence (CCDS) database: A standardized set of human and mouse protein-coding regions supported by expert curation. Nucleic Acids Research ,46 (Database issue), D221–D228. doi: 10.1093/nar/gkx1031
Schneider, V. A., Graves-Lindsay, T., Howe, K., Bouk, N., Chen, H.-C., Kitts, P. A., … Church, D. M. (2017). Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Research , 27 (5), 849–864. doi: 10.1101/gr.213611.116
Stenson, P. D., Mort, M., Ball, E. V., Chapman, M., Evans, K., Azevedo, L., … Cooper, D. N. (2020). The Human Gene Mutation Database (HGMD®): Optimizing its use in a clinical diagnostic or research setting. Human Genetics , 139 (10), 1197–1207. doi: 10.1007/s00439-020-02199-3
Tyner, C., Barber, G. P., Casper, J., Clawson, H., Diekhans, M., Eisenhart, C., … Kent, W. J. (2017). The UCSC Genome Browser database: 2017 update. Nucleic Acids Research , 45 (D1), D626–D634. doi: 10.1093/nar/gkw1134

Figure legends