Lili Quan

and 6 more

Large language models (LLMs) have recently achieved significant success across various application domains, garnering substantial attention from different communities. Unfortunately, many faults still exist that LLM cannot properly predict. Such faults will harm the usability of LLMs in general and could introduce safety issues in reliability-critical systems such as autonomous driving systems. How to quickly reveal these faults in real-world datasets that LLM could face is important, but challenging. The major reason is that the ground truth is necessary but the data labeling process is heavy considering the time and human effort. To handle this problem, in the conventional deep learning testing field, test selection methods have been proposed for efficiently evaluating deep learning models by prioritizing faults. However, despite their importance, the usefulness of these methods on LLMs is unclear and underexplored. In this paper, we conduct the first empirical study to investigate the effectiveness of existing test selection methods for LLMs. Experimental results on four different tasks (including both code tasks and natural language processing tasks) and four LLMs (e.g., LLaMA3 and GPT4) demonstrated that simple methods such as Margin perform well on LLMs but there is still a big room for improvement. Based on the study, we further propose MuCS, a prompt Mutation-based prediction Confidence Smoothing framework to boost the test selection capability. Concretely, multiple prompt mutation techniques have been proposed to help collect diverse outputs for confidence smoothing. The results show that our proposed framework significantly enhances existing methods with test relative coverage improvement by up to 70.53%.