Dayoda Rikitoshi -

The rapid development of artificial intelligence has led to the creation of sophisticated models capable of interpreting and generating visual content. Despite their advancements, commercial models such as ChatGPT-4V and Gemini 1.5 Pro Vision are prone to visual hallucinations, where generated content may not accurately reflect the input images. Addressing this issue is crucial as it directly impacts the reliability and trustworthiness of these models in applications requiring precise visual understanding. This article presents a comprehensive evaluation framework designed to quantitatively and qualitatively assess visual hallucinations in these models. The methodology includes dataset preparation, automated annotation, prompt design, response collection, automated comparison, error analysis, and statistical analysis. The findings reveal significant insights into the types of hallucinations, their frequencies, and the factors contributing to their occurrence. The results underscore the need for improved training techniques, advanced model architectures, and robust evaluation metrics to enhance the accuracy and contextual understanding of visual content generated by these models. By providing a detailed analysis of visual hallucinations, the study contributes to the ongoing efforts to develop more reliable and accurate AI systems, ensuring their safe and effective integration into critical applications.