Ankita Sood

and 6 more

Introduction Systematic literature reviews (SLRs) are a vital aspect of evidence-based research, directing healthcare decisions and impacting policymaking-specific issues, however, the traditional process of conducting them could be lengthy, labor-intensive, and costly. This augments the need for more efficient strategies, such as automation using generative artificial intelligence (AI), which could help the researchers reduce their workload and streamline the SLR process. The current study investigates the relative efficiency of the generative AI models (Claude Sonnet 3.5, Gemini Flash 1.5, and GPT-4) in the title and abstract screening phase of SLRs. Methods Key biomedical databases, including Embase ®, Medline ®, and Cochrane, were searched to identify relevant randomised controlled trials in patients with schizophrenia. This study presented a hybrid approach for systematic reviews, where one reviewer is a human expert and the other leverages three large language models (LLM). A subject matter expert in conducting SLRs, optimized and fine-tuned the final prompt, delivered through a Python application programming interface, to identify evidence meeting key inclusion and exclusion criteria. The screening results obtained from one human reviewer and three AI models were reviewed by subject matter expert (SME). AI models’ performance was evaluated using metrics such as accuracy, sensitivity, specificity, and precision to assess their success in identifying publications included in the final SLR. Results All three AI models performed exceptionally well in screening based on titles and abstracts. While there were no significant differences in accuracy rates, Gemini Flash 1.5 exhibited the highest accuracy rate at 96.02%, followed by GPT-4 (95.00%) and Claude Sonnet 3.5 (94.69%). In terms of sensitivity, GPT-4 exhibited better results, attaining 95.97% of sensitivity, followed by 94.63% with Gemini Flash 1.5, and 88.59% with Claude Sonnet 3.5. Among the AI models evaluated, GPT demonstrated highest concordance with the human reviewer at 88.77%, followed closely by Gemini Flash at 86.63% and Claude Sonnet at 85.81%, indicating a consistently high level of agreement across all models.