A computational method predicting guide RNA activities
According to our results above, quite a fraction of guide RNAs showed moderate or no activity that about 50% of guide RNAs in the library got activity scores > 0.1 (lgscore>-1), which indicates guide RNAs designed without a rational method could not be successfully used in genome editing. It is time and labor consuming to test each guide RNA before a gene-editing experiment, thus an in silico method for guide RNA efficacy prediction is in need, and recently several works have been developed to facilitate it.[ ]As we firstly created a dataset of Cpf1 guide RNA activity in prokaryotes, a novel predicting method based on our results could be a supplement to current studies and test the generalization ability of those methods based on eukaryotic datasets.
We filtered our results by removing data of low quality, and established a dataset based on selective2. We randomly separated the dataset (90% for training and 10% for testing) in order to avoid overfitting, and used 10-fold cross-validation to retest the capacity of the trained model (Fig.2a). Then, we defined a series of featurization considering the DNA sequence of the protospacer, PAM and fraction of base that convert each sequence in our library into more than 350 binary and continuous feature information as inputs. Using features extracted from each guide RNA, we built a regression model to predict its activity. (Fig.2b) We re-separated the dataset after randomly shuffling to do the training 10 times in total, and there was no significant difference at the performance of all trained models, suggesting that our model is robust and no bias have been introduced by separating the dataset (Fig.2c Fig2d).
We compared our predictive results with deepCpf1, the most-cited work, to evaluate the performance of our model. [36] We found weak correlation between experiment results and the predictions from deepCpf1, while our model is much more predictive with Spearman correlation coefficient of 0.80 on average(Fig.2c Fig2d). It indicated that the models trained with data from mammalian cells provided limited comprehension of the guide RNA sequence features contributing to cleavage activity on the Impact of chromatin structures and the NHEJ repair pathway.
To verify our founding and explore the mechanism of different predictive power between two models, we further tested their performance on forecasting the most efficient guide RNAs as well as the inefficient ones, since the selection of guide RNA for efficient genome editing is in critical demand in research and clinical. After we have processed the data of every sequence in our library using our model and deepCpf1, each sequence got a prediction score corresponding to its experimental activity. Then, we accessed the ability of each model to distinguish efficient guide RNAs from inefficient ones individually, by comparing the experimentally measured activity scores of a group of sequence in high prediction scores with which in low scores (Fig2.e). According to the scores predicted by our model, there was significant difference of measured activities between predictive high-score group and low-score group, while no significant difference of which by deepCpf1 in contrary, confirming that our model has a better predictive capacity Furthermore, we investigated the reason that the improvement of our model may be attributed to. This time we divided the test library into high-score group and low-score group according to their experimental activity and compared those prediction scores processed by each model (Fig2.f) As expected, our model performed much better than deeCpf1 in both efficient and inefficient guide RNA predictions. Interestingly, the deepCpf1 model was proven to be of almost no ability to predict efficient guide RNAs, but of weak ability to predict inefficient ones accurately. It revealed the disability in characterization of high-activity guide RNA as a primary reason why deepCpf1 underperformed in prokaryotic. We tried to make an explanation for why deepCpf1 is not sensitive when working with data from more efficient guide RNA, later. Overall, our model makes predictions of guide RNA activity better than current approaches and is extremely good at predicting efficient ones, at least in prokaryotic where developed.
We next investigated the sequence features contributing to guide RNA activity. We mainly focus on the sequence composition of protospacer besides some other factors reported including GC content, melting temperature. A linear model was used here to plot the coefficients of position-dependent dimers and trimers respectively(Fig.3a,b,c ). Results of T7 endonuclease I (T7EI) assay showed that guide RNA sequence features could affect genome editing in both human cells (Fig. 3d ) . It is notable that approximately equal effect of dimer/trimer each position in seed region of protospacer was observed considering their distance from the PAM, while the first single nucleotide was used to be known as a stronger factor. We also observed the promotional effect of AH dimers, AHN trimers and the inhibitory role of GB, TK dimers, GBN trimers at certain positions. Our findings were in consistence with the results of another activity profiling screening independent, although we obtained a larger scale library and provided a more comprehensive and convincing model.[33] These effects may be attributed to the expression level or stability of guide RNA as well as the interaction of Cpf1-crRNA complex with its DNA substrate.