Ibrahim Ali Kabbash -

Background: Statistical analysis is central to medical research. Generative artificial intelligence have recently emerged as potential tools to support data analysis in medical research. Objective: to evaluate the accuracy and reliability of artificial intelligence in performing statistical analyses by comparing its output with previously published statistical results revised by professional biostatisticians. Methods: An observational, comparative secondary analysis of 14 previously analyzed datasets. The GPT-4o analyses were conducted between May and September 2025 using a standardized prompt chain. A 13-item rubric was used to score each task (2 = identical, 1 = almost identical, and 0 = Dissimilar). The dataset-level categories were pre-specified as perfect (100%), good (90%–<100%), and poor (<90%). Results: The GPT-4o model correctly identified the file structure in 13/14 (92.9%) datasets and accurately assessed normality in 12/14 (85.7%). Its errors were mainly schema-related, treating coded categorical fields as numeric and missing some composite recoding; therefore, numeric values sometimes differed even when significance decisions matched human analyses. Of the 14 datasets, the GPT-4o output was classified as perfect for two (14.3%), good for 10 (71.4%), and poor for two (14.3%). Conclusion: The GPT-4o model performed well on structured tasks with clear prompts and correct variable typing, often converging with human analyses on significance decisions. However, it struggled with coded categorical data, complex recoding, and publication-ready tables, with numerical discrepancies in several datasets. While the tested GenAI model could assist early analytical work under expert supervision, further research is warranted, particularly with evolving GenAI models.