? 基于大规模图像数据集的视频描述方法
Journal of Computer Science and Technology
Quick Search in JCST
 Advanced Search 
      Home | PrePrint | SiteMap | Contact Us | Help
 
Indexed by   SCIE, EI ...
Bimonthly    Since 1986
Journal of Computer Science and Technology 2017, Vol. 32 Issue (3) :480-493    DOI: 10.1007/s11390-017-1738-7
Special Section of CVM 2017 << Previous Articles | Next Articles >>
基于大规模图像数据集的视频描述方法
Xiao-Yu Du1,2, Member, CCF, Yang Yang3,4, Member, CCF, ACM, IEEE, Liu Yang1,5, Fu-Min Shen3,4, Member, CCF, ACM, IEEE, Zhi-Guang Qin1, Senior Member, CCF, Member, ACM, IEEE, Jin-Hui Tang1,6,*, Senior Member, CCF, IEEE, Member, ACM
1. School of Information and Software Engineering, University of Electronic Science and Technology of China Chengdu 610054, China;
2. School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China;
3. Center for Future Media, University of Electronic Science and Technology of China, Chengdu 611731, China;
4. School of Computer Science and Engineering, University of Electronic Science and Technology of China Chengdu 611731, China;
5. Sichuan University West China Hospital of Stomatology, Chengdu 610041, China;
6. School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
Captioning Videos Using Large-Scale Image Corpus
Xiao-Yu Du1,2, Member, CCF, Yang Yang3,4, Member, CCF, ACM, IEEE, Liu Yang1,5, Fu-Min Shen3,4, Member, CCF, ACM, IEEE, Zhi-Guang Qin1, Senior Member, CCF, Member, ACM, IEEE, Jin-Hui Tang1,6,*, Senior Member, CCF, IEEE, Member, ACM
1. School of Information and Software Engineering, University of Electronic Science and Technology of China Chengdu 610054, China;
2. School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China;
3. Center for Future Media, University of Electronic Science and Technology of China, Chengdu 611731, China;
4. School of Computer Science and Engineering, University of Electronic Science and Technology of China Chengdu 611731, China;
5. Sichuan University West China Hospital of Stomatology, Chengdu 610041, China;
6. School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

摘要
参考文献
相关文章
Download: [PDF 6708KB]  
摘要 随着网络视频数据的迅猛增长,自动视频描述变得越发的重要。视频描述用于简要描述视频内容,是一种有更丰富的语义,更接近人类认知的视频处理方式。它不仅能够帮助用户快速理解和找到相应的视频,也能够辅助管理视频信息。但是,数据集的稀缺严重的限制了视频描述的发展。因此,本文提出了一种将图像描述语料库应用到视频描述的方法,解决了两个主要的问题:1)将图像描述的语料库有机的应用到视频描述中;2)面向海量数据的高效执行效果。为了达到这两个目标,我们改进了图像描述方法中的查找模型,将其应用于视频描述中,并使用哈希方法来管理数据,以提升在海量数据中的查询效率。最终的实验验证了该方法在各类哈希算法中的有效性。相比传统查找模型,该方法的内存开销仅有1/256,而时间开销仅占1/64,能够适应更大规模的数据处理。
关键词视频描述   哈希   图像描述     
Abstract: Video captioning is the task of assigning complex high-level semantic descriptions (e.g., sentences or paragraphs) to video data. Different from previous video analysis techniques such as video annotation, video event detection and action recognition, video captioning is much closer to human cognition with smaller semantic gap. However, the scarcity of captioned video data severely limits the development of video captioning. In this paper, we propose a novel video captioning approach to describe videos by leveraging freely-available image corpus with abundant literal knowledge. There are two key aspects of our approach: 1) effective integration strategy bridging videos and images, and 2) high efficiency in handling ever-increasing training data. To achieve these goals, we adopt sophisticated visual hashing techniques to efficiently index and search large-scale images for relevant captions, which is of high extensibility to evolving data and the corresponding semantics. Extensive experimental results on various real-world visual datasets show the effectiveness of our approach with different hashing techniques, e.g., LSH (locality-sensitive hashing), PCA-ITQ (principle component analysis iterative quantization) and supervised discrete hashing, as compared with the state-of-the-art methods. It is worth noting that the empirical computational cost of our approach is much lower than that of an existing method, i.e., it takes 1/256 of the memory requirement and 1/64 of the time cost of the method of Devlin et al.
Keywordsvideo captioning   hashing   image captioning     
Received 2016-12-26;
本文基金:

This work was partially supported by the National Basic Research 973 Program of China under Grant No. 2014CB347600, the National Natural Science Foundation of China under Grant Nos. 61522203, 61572108, 61632007, and 61502081, the National Ten-Thousand Talents Program of China (Young Top-Notch Talent), the National Thousand Young Talents Program of China, the Fundamental Research Funds for the Central Universities of China under Grant Nos. ZYGX2014Z007 and ZYGX2015J055, and the Natural Science Foundation of Jiangsu Province of China under Grant No. BK20140058.

通讯作者: Jin-Hui Tang     Email: jinhuitang@njust.edu.cn
About author: Xiao-Yu Du is currently a lecturer in the School of Software Engineering of Chengdu University of Information Technology, Chengdu, and a Ph.D. candidate of University of Electronic Science and Technology of China, Chengdu. He received his M.E. degree in computer software and theory in 2011 and B.S. degree in computer science and technology in 2008, both from Beijing Normal University, Beijing. His research interests include multimedia analysis and retrieval, computer vision, and machine learning.
引用本文:   
Xiao-Yu Du, Yang Yang, Liu Yang, Fu-Min Shen, Zhi-Guang Qin, Jin-Hui Tang.基于大规模图像数据集的视频描述方法[J]  Journal of Computer Science and Technology , 2017,V32(3): 480-493
Xiao-Yu Du, Yang Yang, Liu Yang, Fu-Min Shen, Zhi-Guang Qin, Jin-Hui Tang.Captioning Videos Using Large-Scale Image Corpus[J]  Journal of Computer Science and Technology, 2017,V32(3): 480-493
链接本文:  
http://jcst.ict.ac.cn:8080/jcst/CN/10.1007/s11390-017-1738-7
Copyright 2010 by Journal of Computer Science and Technology