Figure 1: Distributions across PLC pockets of A) experimental resolution, B) Difference between R and Rfree, C) Ligand RSCC, and D) The percentage of protein atoms within 6Å of the ligand, which have an RSCC > 0.8. Pockets are divided into categories depending on the number of rotatable bonds of the ligands they contain. In each panel, the black line shows the suggested threshold, and the percentage of pockets passing this criterion is displayed.
In an automated benchmarking setting such as CAMEO, where quality information is not available at the time the targets are selected, filtering out even half of the data after predictions have been generated would be unfortunate, indicating that even the relaxed criteria are too stringent as a post-filter. An alternative would be to take quality into account in the scoring process, and downweight low quality regions of a structure in aggregate scores, without removing the entire target. Ideally an atom-level weighting would be used, especially for larger ligands that can display variable levels of quality within the residue itself. Unfortunately the PDB does not make atom-level quality information available in the validation reports at the time of writing, and the only information that would be available is occupancy numbers which are part of the structural data.
However, analyzing and incorporating validation data is a critical step towards creating a representative dataset for other benchmarking settings. For example, of the 255 small molecule pockets in the PDBBind time-split test-set, 105 do not pass the relaxed criteria, which could bias the results seen in recent benchmarking efforts using this test set. Previous efforts have been made to create high-quality subsets of PDBBind specifically for evaluation purposes28. However, these produced very small test sets, unlikely to be representative of the entire protein-ligand space. The stringent Iridium criteria, the suggested relaxed criteria, and the assessment of novelty and diversity described in the next section form the basis for the creation of a representative benchmark dataset. Indeed, similar efforts to create benchmark sets for PLC are ongoing in the ELIXIR 3D-BioInfo community29. The results of that initiative could be incorporated in this assessment once they are available.

2.2 Is a protein-ligand complex target interesting to assess?

In the context of large scale structural databases, such as the PDB, it is possible to encounter several very similar PLC or complexes with the same protein and ligand that have been crystallized in different experimental conditions or resolved by means of different experimental methods. When it comes to automated benchmarking of PLC prediction, besides the quality of the structure, an important aspect to consider is the novelty of the PLC to assess.
The CASP15 CASP-PLI assessment9highlighted the superiority of template-based methods to model PLC accurately. While most top predictions were produced by human groups rather than automated methods, it is likely that automated methods will in the future also leverage template information to predict PLC. Therefore, when generating a benchmarking dataset for PLC prediction, we need to ensure that PLC are not already represented in the PDB. For a challenge such as CAMEO, the exact protein conformation and the pose of the ligand within the protein complex is unknown. Thus, we will use the sequence as a proxy for protein novelty. As very similar ligands can have striking differences in their poses, and we would like to retain as many PLC as possible in the CAMEO pre-filtering setting, we use ligand names as a proxy for the novelty of the ligand pose. To that end, we investigated the novelty of the 236,538 small molecule pockets across 75,065 PLC and 32,273 unique small-molecule ligands described in section 1.1.
We assessed the novelty of PLC released every year in the PDB by verifying whether a particular combination of polymer entities and ligands was present in previously released structures. For that purpose, we performed sequence-based clustering of all polymer entities followed by the assignment of an identifier to each PLC entry, consisting of the sequence cluster identifiers of each entity and the chemical component code of the ligands present in the PLC. Using different minimum sequence identity thresholds helps reveal the level of novelty between the entities of a PLC compared to previously seen PLC. Similarly, even for PLC with identical proteins, the combination of ligands seen may differ. The distribution of sequence clusters and ligand combinations seen per year is shown in Figure 2, along with the fraction of PLC that pass the relaxed quality criteria from Section 1. For example, the four different bars for the 70-90% cluster in the year 2022 represent, in order,(1) all PLC released in 2022 where every entity in the PLC has 70-90% identity to every entity in a matching PLC from a previous year but the ligands are not all the same, (2) same as (1) but only the PLC passing the relaxed quality criteria from Section 1 (3)all PLC released in 2022 where every entity has 70-90% identity to every entity in a matching PLC from a previous year and the ligands are all the same, and (4) same as (3) but only the PLC passing the relaxed quality criteria from Section 1.
We see that, from the protein perspective, 78.85% of PLC (and 71.83% of valid PLC) released in 2022 have at least 30% sequence identity to a matching PLC from previous years (across all entities). However, most of these (79.14%) still have different combinations of ligands, indicating that they may still be interesting to assess for PLC prediction. We consider two different minimum sequence identity thresholds, 30% for creating a diverse dataset and 90% for PLC prediction in CAMEO, and define a PLC as novel if the minimum sequence identity between any of its entities is less than the threshold in all matching PLC, or at least one ligand in the PLC is not seen in matching PLC. With this classification criteria, we found that out of all the PLC released in 2022, 4515 (83.55%) PLC were novel and 889 were redundant at a threshold of 30%, and 4833 (89.43%) PLC were novel and 571 were redundant at a threshold of 90%. Hence, even at 30% sequence identity, 83.55% of all released structures contained some kind of novelty, with at least one previously unseen protein(entity)-ligand combination. Among the PLC that passed the validation criteria, 2202 (86.76%) PLC were novel and 336 were redundant at a threshold of 30%, and 2360 (92.99%) PLC were novel and 178 were redundant at a threshold of 90%.
Thus, most newly-released PLC are novel from either the protein or the ligand perspective. However, every year some redundant PLC are also released in the range of 10-20% redundant structures per year, out of which more than half are highly redundant structures (90-100% sequence identity and same ligands). The PDBBind time-split test-set also suffers from a high degree of redundancy, with 62% of the test-set proteins having >90% sequence identity with other test-set proteins and 59% having >90% identity to proteins in the training-set. This indicates that this set would not be able to accurately represent protein-ligand space, even if all the ligands were chemically dissimilar, which is not the case.