Learning a Mixture of Conditional Gating Blocks for Visual Question Answering

Qiang Sun; Yan-Wei Fu; Xiang-Yang Xue

doi:10.1007/s11390-024-2113-0

Sun Q, Fu YW, Xue XY. Learning a mixture of conditional gating blocks for visual question answering. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY 39(4): 912−928 July 2024. DOI: 10.1007/s11390-024-2113-0.

Citation:

Learning a Mixture of Conditional Gating Blocks for Visual Question Answering

Abstract

Abstract

As a Turing test in multimedia, visual question answering (VQA) aims to answer the textual question with a given image. Recently, the “dynamic” property of neural networks has been explored as one of the most promising ways of improving the adaptability, interpretability, and capacity of the neural network models. Unfortunately, despite the prevalence of dynamic convolutional neural networks, it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner. Typically, due to the large computation cost of transformers, researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks. To this end, we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task. In particular, we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block (cMHSA). Furthermore, our question-guided cMHSA is compatible with conditional ResNeXt block (cResNeXt). Thus a novel model mixture of conditional gating blocks (McG) is proposed for VQA, which keeps the best of the Transformer, convolutional neural network (CNN), and dynamic networks. The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG. We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets. Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.

FullText(HTML)

References (58)

Relative Articles

Supplements (3)

Cited By

Learning a Mixture of Conditional Gating Blocks for Visual Question Answering

Abstract

Catalog

Export File

Citation

Format

Content