Large Visual Language Models (LVLMs) have revolutionized the multimodal domain, demonstrating exceptional performance in tasks requiring fusing visual and textual information. However, the current evaluation benchmarks fail to adequately assess the knowledge alignment between images and text, focusing primarily on answer accuracy rather than the reasoning processes behind them.
To address this gap and enhance the understanding of LVLMs' capabilities, we introduce KnowBench, a novel benchmark designed to assess the alignment of knowledge between images and text for LVLMs. KnowBench comprises 1081 image-question pairs, each with four options and four pieces of corresponding knowledge across 11 major categories.
We evaluated mainstream LVLMs on KnowBench, including proprietary models like Gemini, Claude, and GPT, and open-source models like LLaVA, Qwen-VL, and InternVL. Our experiments reveal a notable discrepancy in the models' abilities to select correct answers and corresponding knowledge whether the models are open-source or proprietary. This indicates that there is still a significant gap in the current LVLMs'
knowledge alignment between images and text.
Furthermore, our further analysis shows that model performance on KnowBench improves with increased parameters and version iterations.
This indicates that scaling laws have a significant impact on multimodal knowledge alignment, and the iteration of the model by researchers also has a positive effect.
We anticipate that KnowBench will foster the development of LVLMs and motivate researchers to develop more reliable models. We have made our dataset publicly available at
https://huggingface.co/datasets/dale19/KnowBench.