Ben Baranker -

Introduction:Systematic literature reviews (SLRs) are fundamental to evidence-based medicine, serving to synthesize existing research, inform clinical practice, and guide policy decisions. Most systematic reviews follow standardized guidelines, such as “Preferred Reporting Items for Systematic Reviews and Meta-Analyses” (PRISMA), to maintain consistency and a holistic human review process (Liberati et al., 2009). However, the traditional process of conducting SLRs is notoriously time-consuming and resource-intensive, often spanning months or even years to complete and demanding significant human resources. This labor-intensive nature can delay the timely dissemination of critical findings and impede the swift translation of research into practice (Beller et al., 2013).In recent years, the emergence of artificial intelligence (AI), particularly large language models (LLMs), has presented enormous promise in alleviating the methodological burden of SLRs. When compared to the current human review process, many AI LLMs have shown comparable, and in some cases superior, performance in key steps such as literature screening and study selection. For instance, Matsui et al. found that GPT-4 achieved specificity between 85–99% and sensitivity around 88–96% across test cases in a mental health systematic analysis (Matsui et al., 2024). In this study GPT-4 missed none of the key studies using a “3-layer” prompting strategy that refined output iteratively, placing its performance within the 77–100% sensitivity range of human reviewers (Matsui et al., 2024). Similarly, Cao et al. demonstrated that with well-engineered prompts specifying review objectives and inclusion criteria, LLMs could reach even higher sensitivity than dual-human reviewers in some contexts; while maintaining similar accuracy and drastically reducing the time it takes to screen thousands of citations (Cao et al., 2024). Lai et al. further found that a Claude-based system could accurately extract information and perform risk-of-bias assessments in systematic review with ≥95% accuracy using automated data extraction from 107 randomized trials (Lai et al., 2025).Despite recent advancements, the use of AI for independent systematic review remains associated with a high risk of false negatives and variable results (Guo et al., 2024; Matsui et al., 2024; Khraisha et al., 2024; Li et al., 2024). Underwhelming sensitivities are particularly concerning since this indicates the presence of false negatives where crucial data to the research is being excluded. Therefore, retaining human involvement in the screening and review process is essential to ensure accuracy and methodological integrity (Sanghera et al., 2025). Several widely-used platforms, including ASReview, Abstrackr, RobotAnalyst, EPPI-Reviewer, DistillerSR, and Elicit employ machine learning algorithms to prioritize article screening and iteratively refine predictions based on human reviewer input. These tools exemplify the human-in-the-loop LLM model and have demonstrated effectiveness in systematic review (Khalil et al., 2022; van Dijk et al., 2023; Yao et al., 2024).Even with integrated human oversight, ensuring methodological transparency and reproducibility remains critical. The lack of a well-established, standardized protocol to approach LLM-assisted systematic review produces inconsistencies in LLM SLRS, making it difficult to compare results across studies or reliably implement LLM tools in practice. This manuscript introduces the Screening of Title and Abstracts, Re-evaluation, and full-text Review (STARR) protocol, a novel standardized approach to SLR screening using LLMs, designed to address these methodological deficiencies and enhance the reliability and efficiency of evidence synthesis.