Figure 8. Scatter plot of evaluation units in CASP14 (A, left) and CASP15 (B, right) represented by sequence (HHscore, Y-axis) and structure (LGA_S, X-axis) scores of the top template. Evaluation units in the left panel are marked according to the difficulty categories as manually assigned in CASP14: full squares – TBM-easy; hollow squares – TBM-hard; hollow triangles –TBM/FM; full triangles – FM. Targets of the same difficulty cluster together in the suggested (X,Y) axes. An automatic delineation of EUs into four classes (X+Y<70, red; 70-100, yellow; 100-130 green; >130, blue) based on the results of sequence- and structure-based searches of the PDB is suggested to mimic the CASP14 difficulty categories. The schema is applied to define target prediction classes in CASP15 (right panel).
Two scores, HHscore and LGA_S, for sequence- and structure-based relationships of the target with PDB entries, were defined in Methods. They are plotted against each other for all EUs in CASP14 and CASP15 (Figure 8). The classification of the CASP14 data resulting from the previous procedures 10– based partly on predictor performance and involving manual intervention – is indicated by symbols in panel A. This reveals that TBM-easy and FM EUs cluster in these coordinates in the upper right and lower left corners respectively, while TBM-hard and TBM/FM EUs predominantly occupy areas immediately above and below the diagonal, respectively. It also can be seen that all triangle markers but two (FM and TBM/FM targets) are below the diagonal and all squares but one (TBM-easy and TBM-hard) are above. Thus, if we consider the diagonal line (HHscore+LGA_S=100) as a boundary between the wider TBM (TBM-easy and TBM-hard together) and FM categories (FM and TBM/FM), then there are only three targets for which the prior CASP14 and current automated classifications schemes disagree.
To further delineate TBM-easy from TBM-hard, and FM from TBM/FM we draw two lines parallel to the diagonal. These lines were drawn symmetrically so that the areas between them and the diagonal include the majority of the TBM-hard (upper) and TBM/FM (lower) EUs yet not encroaching deeply into the TBM-easy and FM territory. Based on the CASP14 data, the split lines were drawn at HHscore+LGA_S=70 and 130 levels. As a side note, we want to mention that we experimented with several other splitting schemas (like rectangular or spherical divisions) and found the linear split to be the simplest and best fitting the CASP14 and CASP13 target classifications. When the suggested schema is applied to the classification of CASP15 EUs (Figure 8B), we see that the points in the graph are nicely separated, with particularly clear clustering in the FM and TBM-easy zones.
Using this classification approach, the CASP15 EUs were automatically assigned to four largely homology-based prediction classes (see Figure 1B and Table 1). Forty-seven EUs were assigned to the TBM-easy class, 15 to TBM-hard, 8 to TBM/FM, and 39 (~35%) to FM - a class with the weakest or no evolutionary relation to available folds. These data show that the CASP15 target set was one of the most difficult (homology-wise) in the whole history of CASP. For comparison, the FM class constituted only 24% of all targets in CASP14, and 27% in CASP13. Conceivably this rise may already illustrate the impact of AF2 on target selection in structural biology: experimentalists may be switching attention to more structurally novel targets with which AF2 still struggles.
As discussed in more detail elsewhere2129, it is clear that FM targets comprise the majority of those with which even the top predictive methods struggled, even though some FM targets were well-predicted. Thus, even though it is well known that AF2 (on which most predictive methods were based) generalizes beyond its training set, the absence of similar structural folds in the PDB still leads to a greater risk of predictive failure. Factors further predisposing a target to less accurate prediction appear to include shallow Multiple Sequence Alignment (MSA) (it is known that evolutionary covariance information extracted from MSAs is required for accurate modelling of natural proteins by AF230,31, potentially in order to obtain a sufficiently accurate initial structure estimate). Especially given the relatively small numbers of problematic targets in each CASP, however, a deeper study on this subject is needed, and deep learning methods could help with this task.