KnowBench：评估大型视觉语言模型上的知识一致性

马征; 杨皓天; 张建兵; 陈家骏

doi:10.1007/s11390-025-5512-y

摘要:

研究背景 LVLM已成为多模态人工智能领域的研究热点，尤其在图像描述与视觉问答等任务中展现出强大能力，其在实践中的卓越表现，正持续推动学界与工业界深入探索其性能边界与内在机制。评估LVLM性能的一个重要方法是构建相关的基准测试，并对其进行有针对性的测试。但当前LVLM评估方法主要依赖于评估模型为视觉问题提供正确答案的能力，而缺乏对选择正确答案所对应的知识的关注。

目的现有的基准测试主要关注模型答案是否正确，很少有研究探讨模型选择特定选项的原因。这种对正确性而非因果关系的关注限制了研究人员对模型推理过程及其利用全面知识整合能力的洞察。为了解决这个问题，论文提出了KnowBench，这是一个创新且全面的基准测试，旨在严格评估LVLM对基于知识的问题做出回答所依据知识的精确度和可靠性。

方法该论文精心挑选包含世界知识的问答数据，保证数据的严谨性和难度，并补充了支撑每个答案选项的相应知识陈述，创建了一个包含1081个样例的测试数据集，称为KnowBench，并提供了中文和英文两个版本。论文对答案和知识分别进行评估，使用五个指标作为评估标准：知识准确率（Know.）、答案准确率（Ans.）、全部准确率（Both）（必须正确选择知识和答案）、知识不一致率（K-Incon.）和答案不一致率（A-Incon.）（K-Incon.表示知识选择正确，但答案错误；A-Incon.则相反）。

结果实验发现所有模型都存在知识与答案不一致的问题，且大部分模型在中文和英文的表现上存在明显差异。GPT-4o的Know.得分高达92.1%，Ans.得分也高达88.2%，但Both准确率只有85.8%；InternVL-2（26B）的Know.准确率为90.1%，Ans.准确率为84.7%，但Both准确率略低，为81.7分。K-Incon.和 A-Incon.指标进一步凸显了这些差异。例如，GPT-4o的A-Incon.为9.9%，这表明该模型虽然能够提供正确答案，但会无法选择相应的准确知识。

结论论文提出KnowBench，一个针对大型视觉语言模型LVLM的全新中英文评估基准，旨在评估答案的准确性以及答案背后知识的一致性。在KnowBench上的实验表明，虽然知识掌握程度与答案准确性呈正相关，模型间的知识一致性存在差异，而且不同语言的性能存在差异。因此，当前LVLM需要改进知识和推理的一致性，而KnowBench可以作为研究这个问题的基准测试集。

Abstract: Large visual language models (LVLMs) have revolutionized the multimodal domain, demonstrating exceptional performance in tasks requiring fusing visual and textual information. However, the current evaluation benchmarks fail to adequately assess the knowledge alignment between images and text, focusing primarily on answer accuracy rather than the reasoning processes behind them. To address this gap and enhance the understanding of LVLMs’ capabilities, we introduce KnowBench, a novel benchmark designed to assess the alignment of knowledge between images and text for LVLMs. KnowBench comprises 1 081 image-question pairs, each with four options and four pieces of corresponding knowledge across 11 major categories. We evaluate mainstream LVLMs on KnowBench, including proprietary models like Gemini, Claude, and GPT, and open-source models like LLaVA, Qwen-VL, and InternVL. Our experiments reveal a notable discrepancy in the models’ abilities to select correct answers and corresponding knowledge whether the models are open-source or proprietary. This indicates that there is still a significant gap in the current LVLMs’ knowledge alignment between images and text. Furthermore, our further analysis shows that model performance on KnowBench improves with increased parameters and version iterations. This indicates that scaling laws have a significant impact on multimodal knowledge alignment, and the iteration of the model by researchers also has a positive effect. We anticipate that KnowBench will foster the development of LVLMs and motivate researchers to develop more reliable models. We have made our dataset publicly available at https://doi.org/10.57760/sciencedb.29672.

KnowBench：评估大型视觉语言模型上的知识一致性

KnowBench: Evaluating the Knowledge Alignment on Large Visual Language Models