Background Real-world patient scenarios often involve a composite of text and various clinical images. We explored the performance of multimodal large language models in diagnosing complex infectious disease cases containing both text and clinical images. Methods We assessed the performance of two publicly accessible large language models—GPT-4 and Claude Opus—on a series of 25 complex infectious disease scenarios with 67 clinical images sourced from [www.IDImages.org](http://www.IDImages.org) . Accuracy in identifying the diagnosis and including it within the differential diagnoses was evaluated using a binary scoring system. Image description quality was assessed using a Likert scale comparing the models’ outputs against provided descriptions. Scoring was conducted by two independent clinicians, with the final scores being computed as averages. Results With a maximum score of 1, Opus achieved a higher mean accuracy (0.74) compared to GPT-4 (0.56) in determining the correct diagnosis. Nevertheless, both models demonstrated comparably high accuracy in including the primary diagnosis within the differential diagnoses, with scores of 0.90 for GPT-4 and 0.88 for Opus. In terms of image type, they performed best in interpreting physical findings images, followed by radiographic images and pathological specimens. No statistical differences were found between the two models in their overall ability to describe images or within each image category. Conclusions The two models demonstrated comparable accuracy in handling complex infectious disease scenarios that integrated both texts and images, suggesting potential utility in enhancing clinical support.