Figure 2
Figure 2. Workflow of FoxB structure determination. The structure was determined by MR-SAD using the AlphaFold2 model and experimental phases. (A) Anomalous difference map with Se and Fe sites at 2σ. (B) Overall map of FoxB after refinement (2σ). (C) Superposition of the final model (green) and AlphaFold2 model (cyan) shows excellent agreement. Density for heme groups (not present in AlphaFold2 model) is shown.
Model accuracy
The AlphaFold2 model that was used for the study (T1058TS427_3) shows a remarkable similarity to the final structure17. The overall RMSD is 1.17 Å for all atoms and 0.973 Å for Cα atoms. Not only were all transmembrane helices built and registered correctly, but also the periplasmic domains containing several loops were modelled with high accuracy. There was no density for the cytoplasmic loop connecting TM helices 2 and 3 (residues 172-188), and it was therefore omitted from the final model. Molecular replacement was only successful with the AlphaFold2 model but not with server models from the CASP14 experiment (>30 models tried, many of them with correct overall fold).
The success of the AlphaFold2 models seems to be due to their models “getting the details right”, which was required for a clear MR solution. As one example for the accuracy of the AlphaFold2 model, the His residues coordinating the two heme groups in FoxB were positioned correctly, although this model did not contain heme groups (as we only provided the protein sequence to CASP14). This fact however, also highlights a current limitation of the AlphaFold2 model: While it provides an astonishing good model for the apo protein, it is obviously still lacking the functional groups (two heme groups in case of FoxB), which are responsible for the biological function.
The astounding accuracy of AlphaFold2 models of all subunits of phage AR9 non-virion RNA polymerase (CASP: T1092-T1096) – by AF, MLS and PGL.
From email to the CASP Prediction Center: We are shocked… stunned… by the quality of the model. You would not believe how much effort we have put into getting this structure. Years of work… Both cryo-EM and crystallography… I mean, this is really shocking. Petr Leiman
Brief description of the target
A group of large or “jumbo” bacteriophages, with genomes larger than 200 kbp, encode two distinct DNA-dependent RNA polymerases (RNAPs), allowing these phages to assemble independently from the host RNAP21-24. One of these phage-encoded RNAPs is packaged into the phage capsid and hence is called the virion RNAP (vRNAP). Following the attachment to the host cell, the virus injects the vRNAP together with its DNA into the host cytoplasm. After injection, the vRNAP transcribes early phage genes, including those of the second RNAP (the non-virion RNAP, nvRNAP). The latter transcribes late genes, including those that encode for the vRNAP, which is then packaged into newly assembled phage particles. The exact mechanism of this temporal and spatial activation/regulation of transcription is unclear but it is known that v- and nvRNAPs recognize different promoters 23.
Both v- and nvRNAPs are distantly related to multi-subunit RNAPs (msRNAPs) of bacteria, eukaryotes, and archaea23. The universally conserved core of cellular msRNAPs contains six subunits α2ββ′ω, and the catalytic cavity is formed by β and β′25. However, neither v- or nvRNAPs contain homologs of α or ω subunits, and their β and β′ subunits are split into two or three separate genes that are located in different regions of the phage genome. For sequence-specific initiation of transcription, the phage AR9 nvRNAP core is required to form a complex with a promoter specificity subunit gene product 226 (gp226) that shows no sequence similarity to any known bacterial, eukaryotic, or archaeal transcription initiation factor. In fact, the amino acid sequence of gp226 was a singleton in the GenBank database at the time of CASP14 experiment.
Besides employing a unique transcription factor, the AR9 nvRNAP possesses a number of other distinct properties. Unlike any known msRNAP, the AR9 nvRNAP recognizes the promoter in the template strand of double stranded DNA and can initiate promoter-specific transcription on single stranded DNA 26. Furthermore, as the genomic DNA of bacteriophage AR9 contains deoxyuridine instead of thymidine21, the AR9 nvRNAP is critically sensitive to the presence of uracils in two key positions of its promoter sequence, and promoters with thymines in these positions are not recognized 26. To understand the novel and unusual mechanism of promoter recognition by the AR9 nvRNAP, we decided to determine the structure of this enzyme in various states: in complex with the specificity subunit and without it, and in DNA template-bound and DNA-free forms. For the template, we used a short DNA oligonucleotide that contained a promoter recognized by the AR9 nvRNAP in vivo and in vitro .
How AlphaFold2 models helped solve the structure
The most feature-full and continuous electron density map of the AR9 nvRNAP was initially obtained by cryo-electron microscopy (cryo-EM) imaging of the nvRNAP holoenzyme (i.e. containing the specificity subunit) in complex with the promoter-containing DNA oligonucleotide. This complex contained five polypeptide chains – the specificity subunit gp226, the N- and C-terminal parts of the β subunit gp105 and gp089 (respectively), and the N- and C-terminal parts of the β′ subunit gp270 and gp154 (respectively) – and the DNA oligonucleotide, the structure of which will be described elsewhere. The cryo-EM reconstruction was calculated using cryoSPARC27 and had a resolution of 3.8 Å.
In parallel, several maps of the AR9 nvRNAP β-β′ core (i.e. without the specificity subunit) of varying quality and resolutions were obtained using X-ray crystallography. The dataset that produced the best electron density also extended to 3.8 Å resolution, albeit this map was significantly worse (poorer connectivity and quality of side chain features) than the cryo-EM map. The phases for this dataset were obtained by eight-fold non-crystallographic averaging 28,29of molecular replacement phases30 calculated with the help of a partial model. The latter was built using a single wavelength anomalous dispersion map of a dataset with a smaller unit cell31-33.
According to HHpred analysis at the time34, the most similar RNAP with a known atomic structure was that of Mycobacterium tuberculosis (PDB code 5ZX335). The AR9 nvRNAP gp089, gp270, and gp154 proteins could all be aligned – with a 20-24% sequence identity and 100% probabilities – to continuous stretches of the M. tuberculosis RNAP β and β′ subunits. Gp105 was a more difficult target, with only its C-terminal half being predicted to be similar to a fragment of the M. tuberculosis RNAP β subunit with an 80% probability and an E value of 2.3. The structure of gp226, as it was a unique sequence in the entire GenBank, could not be reliably predicted by any tool.
Using both the best cryo-EM and X-ray maps of the AR9 nvRNAP and the structure of the M. tuberculosis RNAP as a chain-tracing guide in stretches of high sequence similarity, we manually built ~90% of the AR9 nvRNAP structure19. Some peripheral domains of gp105, gp154, and gp226 and regions for which no homology models existed were particularly challenging. Fortunately, while we were working on improving the cryo-EM map and X-ray phases to make the structure building process for these regions possible, the models of all five proteins produced by the AlphaFold2 team were made available to us by the CASP14 organizers. To our amazement, the AlphaFold2 models were of excellent quality and fit the cryo-EM and X-ray maps near perfectly almost everywhere including the no-homology regions (Fig. 3). This made the completion of the structure building process nearly trivial.