Safety Evaluation for Large Language Models with Metamorphic Testing: an Empirical Study

Ying Xing; Jia-Qi Huang; Hai-Jiao Zhao; Long Zhang; Wen-Jin Li

doi:10.1007/s11390-026-5705-z

Ying Xing, Jia-Qi Huang, Hai-Jiao Zhao, Long Zhang, Wen-Jin Li. Safety Evaluation for Large Language Models with Metamorphic Testing: an Empirical Study. Journal of Computer Science and Technology. DOI: 10.1007/s11390-026-5705-z

Citation:

Safety Evaluation for Large Language Models with Metamorphic Testing: an Empirical Study

Abstract

Abstract

Large language models (LLMs) have seen widespread deployment across numerous applications, yet their potential to generate harmful or illicit content poses significant safety risks. Evaluating such risks requires both high-quality benchmarking datasets and effective evaluation methodologies. In this work, we introduce LLMSafetyChoice, a multilingual benchmark for content safety evaluation, consisting of 11,911 multiple-choice questions in both Chinese and English, covering four safety domains and eight categories per language. We further introduce metamorphic testing as a systematic approach for evaluating LLMs' safety by defining 7 metamorphic relations on LLMSafetyChoice. We conduct an extensive empirical study across 1408 evaluation scenarios (11 LLMs × (8 categories × 2 languages) × (1 baseline + 7 transformations)). Our results reveal key insights into model behavior under safety-critical conditions and demonstrate the effectiveness of metamorphic testing in uncovering subtle safety vulnerabilities. The benchmark and evaluation results are publicly available at https://anonymous.4open.science/r/LLMMetamorphic-08C9/.

FullText(HTML)

References (0)

Relative Articles

Supplements (2)

Cited By

Safety Evaluation for Large Language Models with Metamorphic Testing: an Empirical Study

Abstract

Catalog

Export File

Citation

Format

Content