Tim Repke

and 7 more

Introduction Priority screening has the potential to reduce the number of records that need to be annotated in systematic literature reviews. So‐called technology‐assisted reviews (TAR) use machine‐learning with prior include/exclude annotations to continuously rank unseen records by their predicted rele‐ vance to find relevant records earlier. In this article, we present a systematic evaluation of methods to determine when it is safe to stop screening when using prioritisation. Methods We implement an open‐source evaluation framework that features a novel method to generate rankings and simulate priority screening processes for 86 real‐world datasets. We use these sim‐ ulations to evaluate 12 statistical or rule‐based (heuristic) stopping methods, testing a range of hyper‐parameters for each. Results The work‐saving potential and performance of stopping criteria heavily rely on ‘good’ rankings, which is typically not achieved by a single ranking algorithm across the entire screening process. Our evaluation shows that the existing stopping methods either fail to reliably stop without missing relevant records or fail to utilise the full potential work‐savings. Only one method reliably met the set recall target but stops conservatively. Conclusions Many digital evidence synthesis tools provide priority screening features that are already used in many research projects. However, the theoretical work‐savings demonstrated in retrospective sim‐ ulations of prioritisation can only be unlocked with safe and reproducible stopping criteria. Our results highlight the need for improved stopping methods and guidelines on how to responsibly use priority screening. We also urge screening platforms to provide indicators and authors to transpar‐ently report metrics when automating (parts of) their synthesis.