Matteo Pavone -

Objective: This study aimed to evaluate the accuracy, consistency, and reliability of three large language models—ChatGPT-4.0, DeepSeek-R1, and Gemini-2.0—in answering cervical cancer-related questions based on the ESGO/ESTRO/ESP guidelines. Design: Prospective, comparative in silico benchmarking study. Setting: Fondazione Policlinico Universitario A. Gemelli, Italy. Population or Sample: Fifty questions derived from the ESGO/ESTRO/ESP Guidelines for Cervical Cancer. Methods: Each question was submitted simultaneously to ChatGPT-4.0, DeepSeek-R1, and Gemini-2.0, and re-entered twice to assess response repeatability. Answers were evaluated for accuracy using a Global Quality Score (GQS) from 1 (poor) to 5 (completely accurate). Consistency (intra-model response stability) and reliability (alignment with guidelines) were assessed using binary classification. Main Outcome Measures: Mean GQS, percentage of GQS 5 responses, consistency between repeated answers, and reliability. Results: ChatGPT-4.0 achieved the highest performance, with 42% of responses rated GQS 5, followed by Gemini-2.0 (30%) and DeepSeek-R1 (28%). DeepSeek-R1 scored significantly lower in mean GQS (3.02 ± 1.671) compared to ChatGPT-4.0 (3.74 ± 1.411, p=0.022). Response consistency varied significantly, with ChatGPT-4.0 and DeepSeek-R1 showing differences from Gemini-2.0 (p=0.034 and p=0.044, respectively). No significant difference was observed in reliability (p=0.602). Conclusion: All three models showed suboptimal accuracy in applying cervical cancer guidelines. ChatGPT-4.0 was the most accurate and consistent, while DeepSeek-R1 underperformed. Despite similar reliability across models, expert oversight remains essential to ensure safe clinical application and prevent misinformation. Funding: None.