Ana M. Barragán -

Introduction: Artificial intelligence (AI) is a branch of technology enabling machines to emulate complex human skills; it can also entail problem-solving using bioinspired methods. It is used for automating systematic literature reviews (SLR), i.e., defining a clinical question, locating relevant literature, preliminary screening, study evaluation, data extraction and analysis. Preliminary reference screening is one of the most time-consuming and error-prone phases involved in developing a systematic review. While AI promises to expedite this process, adopting it faces challenges due to concerns about compatibility and transparency. This review has been aimed at identifying current evidence concerning AI use during preliminary SLR reference screening; it describes characteristics such as the different metrics used for reporting performance and how the different algorithms, pipelines, workflows or web applications are validated. AI resource users’ reflections regarding SLR screening automation have also been summarised. Methods: A scoping review was conducted following Joanna Briggs Institute’s (JBI) methodology; its objective was to identify existing evidence regarding the use of AI-based resources for preliminary reference screening, aimed at identifying AI resources’ characteristics, dataset subset areas, the metrics reported for assessing an AI tool’s performance, such tool’s usefulness and open-source code use. Such searches were limited to articles published between 2019 and 2025; the results included frequency description, tables and figures and a decision flowchart reflecting the amount of references and articles retrieved, excluded from, or included in the final analysis. Likewise, article authors’ experiences of AI-based preliminary reference screening were also compiled. Results: The scoping review identified 99 studies published from 2019 to 2025 which were then grouped as web applications, model comparisons, generative models, pre-trained models or pipelines/workflows used for preliminary systematic literature review (SLR) reference screening. Most studies came from North America and Europe. Evaluating these tools often relied on retrospective comparisons with human reviewers’ work, workload reduction being the most reported metric for assessing their utility. Considerations concerning AI resource use focused on the need for standardised evaluation metrics, stopping criteria, study design and the data sets used, resource characteristics facilitating usability, best practice and future research areas. Conclusion: The findings indicated substantial heterogeneity regarding the types of AI resources used, considerable variation concerning the metrics used for reporting performance, differences in how such metrics are defined and a clear need for standardising reporting methods, study designs and related procedures.