[1] Ren M Y, Kiros R, Zemel R. Image question answering:A visual semantic embedding model and a new dataset. arXiv:1505.02074, 2015. https://arxiv.org/abs/1505.02074v1, June 2017.[2] Gao H Y, Mao J H, Zhou J, Huang Z H, Wang L, Xu W. Are you talking to a machine? Dataset and methods for multilingual image question answering. arXiv:1505.05612, 2015. https://arxiv.org/abs/1505.05612, June 2017.[3] Antol S, Agrawal A, Lu J S, Mitchell M, Batra D, Zitnick L, Parikh D. VQA:Visual question answering. In Proc. IEEE Int. Conf. Computer Vision, December 2015, pp.2425-2433.[4] Malinowski M, Rohrbach M, Fritz M. Ask your neurons:A deep learning approach to visual question answering. arXiv:1605.02697, 2016. https://arxiv.org/abs/1605.02697, June 2017.[5] Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y. Show, attend and tell:Neural image caption generation with visual attention. In Proc. the 32nd IEEE Int. Conf. Machine Learning, February 2015, pp.2048-2057.[6] Yang Z C, He X D, Gao J F, Deng L, Smola A. Stacked attention networks for image question answering. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.21-29.[7] Xu H J, Saenko K. Ask, attend and answer:Exploring question-guided spatial attention for visual question answering. arXiv:1511.05234, 2015. https://arxiv. org/abs/1511.05234, June 2017.[8] Chen K, Wang J, Chen L C, Gao H Y, Xu W, Nevatia R. ABC-CNN:An attention based convolutional neural network for visual question answering. arXiv:1511.05960, 2015. https://arxiv.org/abs/1511.05960, June 2017.[9] Shih K J, Singh S, Hoiem D. Where to look:Focus regions for visual question answering. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.4613-4621.[10] Zhu Y K, Groth O, Bernstein M, Li F F. Visual7W:Grounded question answering in images. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.4995-5004.[11] Ilievski I, Yan S C, Feng J S. A focused dynamic attention model for visual question answering. arXiv:1604.01485, 2016. https://arxiv.org/abs/1604.01485, June 2017.[12] Kumar A, Irsoy O, Ondruska P, Iyyer M, Bradbury J, Gulrajani I, Zhong V, Paulus R, Socher R. Ask me anything:Dynamic memory networks for natural language processing. In Proc. the 33rd Int. Conf. Machine Learning, June 2016, pp.1378-1387.[13] Xiong C M, Merity S, Socher R. Dynamic memory networks for visual and textual question answering. In Proc. the 33rd Int. Conf. Machine Learning, June 2016, pp.2397-2406.[14] Lu J S, Yang JW, Batra D, Parikh D. Hierarchical questionimage co-attention for visual question answering. In Proc. Advances in Neural Information Processing System, Dec. 2016.[15] Fukui A, Park D H, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847, 2016. https://arxiv.org/abs/1606.01847, June 2017.[16] Noh H, Seo P H, Han B. Image question answering using convolutional neural network with dynamic parameter prediction. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.30-38.[17] Kim J H, Lee S W, Kwak D H, Heo M, Kim J, Ha J W, Zhang B T. Multimodal residual learning for visual QA. In Proc. the 30th Conf. Neural Information Processing System, Dec. 2016.[18] Andreas J, Rohrbach M, Darrell T, Klein D. Neural module networks. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.39-48.[19] Wang P, Wu Q, Shen C H, van den Hengel A, Dick A. Explicit knowledge-based reasoning for visual question answering. arXiv:1511.02570, 2015. https://arxiv.org/abs/1511.02570v2, June 2017.[20] Ma L, Lu Z D, Li H. Learning to answer questions from image using convolutional neural network. In Proc. the 30th AAAI Conf. Artificial Intelligence, March 2016, pp.3567-3573.[21] Mnih V, Heess N, Graves A, Kavukcuoglu K. Recurrent models of visual attention. In Proc. Advances in Neural Information Processing Systems, Dec. 2014.[22] Ba J, Mnih V, Kavukcuoglu K. Multiple object recognition with visual attention. arXiv:1412.7755, 2015. https://arxiv.org/abs/1412.7755, June 2017.[23] Li J N, Wei Y C, Liang X D, Dong J, Xu T F, Feng J S, Yan S C. Attentive contexts for object detection. arXiv:1603.07415, 2016. https://arxiv.org/abs/1603.07415, June 2017.[24] Chung K, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014. https://arxiv.org/abs/14-12.3555, June 2017.[25] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781, June 2017.[26] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2015. https://arxiv.org/abs/1409.1556, June 2017.[27] Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3/4):229-256.[28] Kiros R, Zhu Y K, Salakhutdinov R, Zemel R, Torralba A, Urtasun R, Fidler S. Skip-thought vectors. arXiv:1506.06726, 2015. https://arxiv.org/abs/1506.06726, June 2017.[29] Zhou B L, Tian Y D, Sukhbaatar S, Szlam A, Fergus R. Simple baseline for visual question answering. arXiv:1512.02167, 2015. https://arxiv.org/abs/1512.02167, June 2017.[30] Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proc. the 27th Int. Conf. Neural Information Processing Systems, Dec. 2014, pp.1682-1690.[31] Wu Z B, Palmer M. Verbs semantics and lexical selection. In Proc. the 32nd Annual Meeting on Association for Computational Linguistics, June 1994, pp.133-138.[32] Miller G A. WordNet:A lexical database for English. Communications of the ACM, 1995, 38(11):39-41.[33] Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S E, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. arXiv:1409.4842, 2014. https://arxiv.org/abs/1409.4842, June 2017.[34] He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.770-778. |