Large language models (LLMs) have seen widespread deployment across numerous applications, yet their potential to generate harmful or illicit content poses significant safety risks. Evaluating such risks requires both high-quality benchmarking datasets and effective evaluation methodologies. In this work, we introduce LLMSafetyChoice, a multilingual benchmark for content safety evaluation, consisting of 11,911 multiple-choice questions in both Chinese and English, covering four safety domains and eight categories per language. We further introduce metamorphic testing as a systematic approach for evaluating LLMs' safety by defining 7 metamorphic relations on LLMSafetyChoice. We conduct an extensive empirical study across 1408 evaluation scenarios (11 LLMs × (8 categories × 2 languages) × (1 baseline + 7 transformations)). Our results reveal key insights into model behavior under safety-critical conditions and demonstrate the effectiveness of metamorphic testing in uncovering subtle safety vulnerabilities. The benchmark and evaluation results are publicly available at
https://anonymous.4open.science/r/LLMMetamorphic-08C9/.