We use cookies to improve your experience with our site.
Ying Xing, Jia-Qi Huang, Hai-Jiao Zhao, Long Zhang, Wen-Jin Li. Safety Evaluation for Large Language Models with Metamorphic Testing: an Empirical StudyJ. Journal of Computer Science and Technology. DOI: 10.1007/s11390-026-5705-z
Citation: Ying Xing, Jia-Qi Huang, Hai-Jiao Zhao, Long Zhang, Wen-Jin Li. Safety Evaluation for Large Language Models with Metamorphic Testing: an Empirical StudyJ. Journal of Computer Science and Technology. DOI: 10.1007/s11390-026-5705-z

Safety Evaluation for Large Language Models with Metamorphic Testing: an Empirical Study

  • Large language models (LLMs) have seen widespread deployment across numerous applications, yet their potential to generate harmful or illicit content poses significant safety risks. Evaluating such risks requires both high-quality benchmarking datasets and effective evaluation methodologies. In this work, we introduce LLMSafetyChoice, a multilingual benchmark for content safety evaluation, consisting of 11,911 multiple-choice questions in both Chinese and English, covering four safety domains and eight categories per language. We further introduce metamorphic testing as a systematic approach for evaluating LLMs' safety by defining 7 metamorphic relations on LLMSafetyChoice. We conduct an extensive empirical study across 1408 evaluation scenarios (11 LLMs × (8 categories × 2 languages) × (1 baseline + 7 transformations)). Our results reveal key insights into model behavior under safety-critical conditions and demonstrate the effectiveness of metamorphic testing in uncovering subtle safety vulnerabilities. The benchmark and evaluation results are publicly available at https://anonymous.4open.science/r/LLMMetamorphic-08C9/.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return