We use cookies to improve your experience with our site.

学习一种基于混合条件门控模块的视觉问答方法

Learning a Mixture of Conditional Gating Blocks for Visual Question Answering

  • 摘要:
    研究背景 视觉问答任务要求模型能够根据给定的图像和文本问题推断出相应的答案。在视觉语言问答过程中,需要模型能够正确的理解图像的语义信息,并根据文本问题定义的任务内涵进行推理和解答。人类具备对视觉场景信息进行全面认知的能力,并基于问题类型,进行物体分类,属性识别,空间位置或物体间关系的推断。然而如何设计一个通用的深度学习模型,使其能够适应任意类型问题对视觉场景进行回答,仍然是一个巨大的挑战。
    目的 我们的目标是将动态网络的适应性引入到视觉问答模型中,使模型能够根据不同的问题类型产生定制化的网络模块,从而完成对不同类型问题的推理和回答。
    方法 我们基于动态网络的思想,分别在Transformer的多头注意力模块与卷积网络模块中引入了文本问题引导的条件门控层。通过在低级语义使用具有条件门控层的卷积网络模块和在高级语义阶段使用具有条件门控层的卷积网络模块,提出了一种由复合条件模块构成的视觉问答模型 McG (Mixture of Conditional Gating blocks),该模型将问题引导的动态网络同时引入到卷积网络和Transformer中,使得模型具备平移不变性,全局依赖建模能力和动态适应性。
    结果 McG模型在CLEVR和VQA-Abstract数据集上的表现优异,精度分别达到了99.70%和70.80%。
    结论 McG模型能够融合动态参数模式、Transformer和卷积网络的优点。动态参数提升了模型的适应性、容量和可解释性;Transformer能够建模全局特征依赖性;卷积网络模块具有局部建模能力和平移不变性。McG能够利用条件门控层,根据问题特征在低级和高级语义阶段影响视觉网络,从而具备更强的问题适应性。

     

    Abstract: As a Turing test in multimedia, visual question answering (VQA) aims to answer the textual question with a given image. Recently, the “dynamic” property of neural networks has been explored as one of the most promising ways of improving the adaptability, interpretability, and capacity of the neural network models. Unfortunately, despite the prevalence of dynamic convolutional neural networks, it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner. Typically, due to the large computation cost of transformers, researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks. To this end, we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task. In particular, we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block (cMHSA). Furthermore, our question-guided cMHSA is compatible with conditional ResNeXt block (cResNeXt). Thus a novel model mixture of conditional gating blocks (McG) is proposed for VQA, which keeps the best of the Transformer, convolutional neural network (CNN), and dynamic networks. The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG. We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets. Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.

     

/

返回文章
返回