Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering

Ai-Wen Jiang; Bo Liu; Ming-Wen Wang

doi:10.1007/s11390-017-1755-6

Ai-Wen Jiang, Bo Liu, Ming-Wen Wang. Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering[J]. Journal of Computer Science and Technology, 2017, 32(4): 738-748. DOI: 10.1007/s11390-017-1755-6

Citation:

Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering

Abstract

Abstract

Image question answering (IQA) has emerged as a promising interdisciplinary topic in computer vision and natural language processing fields. In this paper, we propose a contextually guided recurrent attention model for solving the IQA issues. It is a deep reinforcement learning based multimodal recurrent neural network. Based on compositional contextual information, it recurrently decides where to look using reinforcement learning strategy. Different from traditional "static" soft attention, it is deemed as a kind of "dynamic" attention whose objective is designed based on reinforcement rewards purposefully towards IQA. The finally learned compositional information incorporates both global context and local informative details, which is demonstrated to benefit for generating answers. The proposed method is compared with several state-of-the-art methods on two public IQA datasets, including COCO-QA and VQA from dataset MS COCO. The experimental results demonstrate that our proposed model outperforms those methods and achieves better performance.

FullText(HTML)

References (34)

Relative Articles

Supplements (0)

Cited By

Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering

Abstract

Catalog

Export File

Citation

Format

Content