基于上下文引导型循环注意机制与深度多模态强化网络的图像问答算法

doi:10.1007/s11390-017-1755-6

基于上下文引导型循环注意机制与深度多模态强化网络的图像问答算法

Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering

摘要

摘要: 图像问答算法是计算机视觉与自然语言处理的交叉研究领域。本文提出上下文引导型的循环注意机制，基于深度强化学习的多模态循环网络用于解决图像问答问题。算法根据综合的上下文信息，采用强化学习策略循环地决定在何处寻找问题相关的视觉内容。与传统“静态”的软注意机制，本文的注意机制可被认为是一种动态的注意机制。它根据问答相关的强化学习奖励信号优化学习目标。最终学习到的综合信息包含了全局上下文和局部细节，能够更好地用于问题答案的生成。我们在两个公开数据集COCOQA和MSCOCO-VQA上与当前主流算法模型进行了比较。实验结果表明，本文提出的模型能够取得更好的性能，具有良好的优越性。

Abstract: Image question answering (IQA) has emerged as a promising interdisciplinary topic in computer vision and natural language processing fields. In this paper, we propose a contextually guided recurrent attention model for solving the IQA issues. It is a deep reinforcement learning based multimodal recurrent neural network. Based on compositional contextual information, it recurrently decides where to look using reinforcement learning strategy. Different from traditional "static" soft attention, it is deemed as a kind of "dynamic" attention whose objective is designed based on reinforcement rewards purposefully towards IQA. The finally learned compositional information incorporates both global context and local informative details, which is demonstrated to benefit for generating answers. The proposed method is compared with several state-of-the-art methods on two public IQA datasets, including COCO-QA and VQA from dataset MS COCO. The experimental results demonstrate that our proposed model outperforms those methods and achieves better performance.

HTML全文

参考文献()

施引文献

资源附件()