Thomas Thesen -

Developing high-quality pharmacology multiple-choice questions (MCQs) is challenging in large part due to continually evolving therapeutic guidelines and the complex integration of basic science and clinical medicine in this subject area. Large language models (LLMs) like ChatGPT-4 have repeatedly demonstrated proficiency in answering medical licensing exam questions, prompting interest in their use for generating high stakes exam-style questions. This study evaluates the performance of ChatGPT-4o in generating USMLE-style pharmacology questions based on ASPET/AMSPC Knowledge Objectives and assesses the impact of retrieval-augmented generation (RAG) on question accuracy and quality. Using standardized prompts, 50 questions (25 RAG, 25 non-RAG) were generated and subsequently evaluated by expert reviewers. Results showed higher accuracy for non-RAG questions (88.0% vs 69.2%), though the difference was not statistically significant. No significant differences were observed in other quality dimensions. These findings suggest that sophisticated LLMs can generate high-quality pharmacology questions efficiently without RAG, though human oversight remains crucial.