Introduction
The class 2 clustered regularly interspaced short palindromic repeats (CRISPR)-CRISPR-associated proteins (Cas), which are derived from prokaryotic immune system, were identified as programmable, RNA-guided nucleases.[1-7] Generally, each CRISPR-Cas system is composed of Cas proteins and a guide RNA. In a broad spectrum of eukaryotic and prokaryotic species, CRISPR/Cas9 and CRISPR/Cpf1 could be expressed heterologously with relative guide RNAs to target complementary DNA sequences, exhibiting many advantages as powerful genome editing tools.[8-11] Cpf1 was reported with several differences from Cas9: first, Cpf1 processes its own guide RNAs and does not require a tracrRNA; second, there is a longer distance between the seed sequence and cleavage site; third, Cpf1 recognizes thymidine-rich PAM sequence; fourth, Cpf1 generates cleavage with 5′overhangs.[12,13,14] These features make Cpf1 expand the toolkit for genome editing.[15,16,17]
A general issue for the application of Cpf1 appears to be the unpredictable success of guide RNA design.[18,19] However, limited information of the relationship between guide RNAs sequence and activity is available. There is a number of tools and applications developed to predict guide RNA performance of Cas9.[20-28] It may seem that the guide RNA design for Cpf1 would benefit from these information and strategies. Recent studies for Cpf1 attempted to describe the guide RNA sequence-activity relationship and present algorithms to predict the activity of Cpf1 guide RNAs.[20-22]
Nevertheless, such approaches were developed in mammalian cell lines where Cpf1 activities at endogenous sites were found to be affected by chromatin accessibility as well as target sequence composition. And the known nonhomologous end-joining (NHEJ) pathway preference for different DSB substrates may also reshape the guide RNA activity landscape
To exclude these factors and gain more general insights into the relationship of guide RNA sequence and activity, we launched high-throughput screening experiments and collect large-scale datasets in E.coli cells, in which NHEJ molecular machinery is entirely absent.
In this paper, we described a library of >12,500 target sequence and guide RNA pairs and evaluated guide RNA activity inE. coli by associating CRISPR/Cpf1-induced DNA cleavage with cellular lethality. The guide RNA activity revealed significant diversity. It’s worth noting that the current guide RNA activity prediction models showed Spearman correlations of only 0.56 when tested with our data. We therefore proposed a computational approach to design Cpf1 guide RNAs allowing the prediction of efficient and inefficient guide RNAs with an improved performance with Spearman correlation of 0.80. Lastly, our model identified important guide RNA sequence features that contribute to DNA cleavage.