›› 2017,Vol. 32 ›› Issue (3): 443-456.doi: 10.1007/s11390-017-1735-x

所属专题: Artificial Intelligence and Pattern Recognition Computer Graphics and Multimedia

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

基于深度卷积神经网络和条件随机场的时空一致性深度恢复

Xu-Ran Zhao, Xun Wang*, Senior Member, CCF, Member, ACM, IEEE, Qi-Chao Chen   

  1. School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, China
  • 收稿日期:2016-12-23 修回日期:2017-03-20 出版日期:2017-05-05 发布日期:2017-05-05
  • 通讯作者: Xun Wang E-mail:wx@zjgsu.edu.cn
  • 作者简介:Xu-Ran Zhao is currently an assistant professor at the School of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou. He received his B.S. degree in electronic and information technologies from Shanghai University, Shanghai, and M.S. degree in electrical and computer engineering from Georgia Institute of Technology, Atlanta, in 2006 and 2010 respectively. He received his Ph.D. degree from Telecom ParisTech, Paris, in 2013. During 2014~2016, he worked as a postdoctoral researcher on machine learning in School of Computer Science at Aalto University, Helsinki. His current research interests include pattern recognition, computer vision and biometric recognition.
  • 基金资助:

    This work is supported in part by the Natural Science Foundation of Zhejiang Province of China under Grant No.LQ17F030001,the National Natural Science Foundation of China under Grant No.U1609215,Qianjiang Talent Program of Zhejiang Province of China under Grant No.QJD1602021,the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No.2014BAK14B01,and Beihang University Virtual Reality Technology and System National Key Laboratory Open Project under Grant No.BUAA-VR-16KF-17.

Temporally Consistent Depth Map Prediction Using Deep CNN and Spatial-temporal Conditional Random Field

Xu-Ran Zhao, Xun Wang*, Senior Member, CCF, Member, ACM, IEEE, Qi-Chao Chen   

  1. School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, China
  • Received:2016-12-23 Revised:2017-03-20 Online:2017-05-05 Published:2017-05-05
  • Contact: Xun Wang E-mail:wx@zjgsu.edu.cn
  • About author:Xu-Ran Zhao is currently an assistant professor at the School of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou. He received his B.S. degree in electronic and information technologies from Shanghai University, Shanghai, and M.S. degree in electrical and computer engineering from Georgia Institute of Technology, Atlanta, in 2006 and 2010 respectively. He received his Ph.D. degree from Telecom ParisTech, Paris, in 2013. During 2014~2016, he worked as a postdoctoral researcher on machine learning in School of Computer Science at Aalto University, Helsinki. His current research interests include pattern recognition, computer vision and biometric recognition.
  • Supported by:

    This work is supported in part by the Natural Science Foundation of Zhejiang Province of China under Grant No.LQ17F030001,the National Natural Science Foundation of China under Grant No.U1609215,Qianjiang Talent Program of Zhejiang Province of China under Grant No.QJD1602021,the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No.2014BAK14B01,and Beihang University Virtual Reality Technology and System National Key Laboratory Open Project under Grant No.BUAA-VR-16KF-17.

基于深度卷积神经网络的方法近年来持续刷新着单目图像深度恢复任务的精确度记录。然而当处理基于视频的深度恢复应用,例如影视作品的2D转3D问题时,由于现存的方法都是针对单帧图像进行神经网络的优化,恢复出的深度图经常会出现时域上的不连续性。本文提出了一种新型的时空一致性条件随机场模型,对相邻帧的估计深度图进行时域上的约束,并且集成到深度神经网络模型中。我们首先使用时域一致性超像素分割的方法来建立相邻帧间对应物体的联系,然后使用卷积神经网络对每一个超像素回归出一个单一的深度值。之后,我们提出了时空一致性条件随机场模型,对这些超像素回归深度值在时域和空域上的连续性同时进行约束。卷积神经网络和条件随机场中的参数可以通过反向传播同时更新,实现端对端的学习。实验结果表明,我们提出的方法对比基于单帧图像的方法不仅明显的提高了深度恢复的时域连续性,也提高了深度恢复的精度。

Abstract: Deep convolutional neural networks (DCNN) based methods recently keep setting new records on tasks of predicting depth maps from monocular images. When dealing with video-based applications such as 2D to 3D video conversion, however, these approaches tend to produce temporally inconsistent depth maps, since their CNN models are optimized over single frames. In this paper, we address this problem by introducing a novel spatial-temporal Conditional Random Fields (CRF) model into the DCNN architecture, which is able to enforce temporal consistency between depth map estimations over consecutive video frames. In our approach, temporally consistent superpixel (TSP) is first applied to an image sequence to establish correspondence of targets in consecutive frames. A DCNN network is then used to regress the depth value of each temporal superpixel, followed by a spatial-temporal CRF layer to model the relationship of the estimated depths in both spatial and temporal domain. The parameters in both DCNN and CRF models are jointly optimized with back propagation. Experimental results show that our approach not only is able to significantly enhance the temporal consistency of estimated depth maps over existing single-frame-based approaches, but also improves the depth estimation accuracy in terms of various evaluation metrics.

[1] Saxena A, Sun M, Ng A. Learning 3-D scene structure from a single still image. In Proc. the 11th IEEE International Conference on Computer Vision, October 2007.

[2] Shotton J, Sharp T, Kipman A et al. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 2013, 56(1): 116-124.

[3] Cheng K L, Ju X, Tong R F, Tang M, Chang J, Zhang J J. A linear approach for depth and colour camera calibration using hybrid parameters. Journal of Computer Science and Technology, 2016, 31(3): 479-488.

[4] Fanello S R, Keskin C, Izadi S, Kohli P, Kim D, Sweeney D, Criminisi A, Shotton J, Kang S B, Paek T. Learning to be a depth camera for close-range human capture and interaction. ACM Transactions on Graphics, 2014, 33(4): 86:1-86:11.

[5] Zhang L, Vázquez C, Knorr S. 3D-TV content creation: Automatic 2D-to-3D video conversion. IEEE Transactions on Broadcasting, 2011, 57(2): 372-383.

[6] Zhang G F, Jia J, Wong T T, Bao H J. Consistent depth maps recovery from a video sequence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(6): 974-988.

[7] Tsai Y M, Chang Y L, Chen L G. Block-based vanishing line and vanishing point detection for 3D scene reconstruction. In Proc. International Symposium on Intelligent Signal Processing and Communications, December 2006, pp.586-589.

[8] Zhang R, Tsai P S, Cryer J E, Shah M. Shape-from-shading: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999, 21(8): 690-706.

[9] Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network. In Proc. Advances in Neural Information Processing Systems, December 2014, pp.2366-2374.

[10] Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. IEEE International Conference on Computer Vision, December 2015, pp.2650-2658.

[11] Li L, Shen C H, Dai Y C, van den Hengel A, He M. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.1119-1127.

[12] Liu F, Shen C, Lin G. Deep convolutional neural fields for depth estimation from a single image. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.5162-5170.

[13] Liu F, Shen C H, Lin G S, Reid I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2024-2039.

[14] Chang J, Wei D, Fisher J. A video representation using temporal superpixels. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.2051- 2058.

[15] Azarbayejani A, Pentland A P. Recursive estimation of motion, structure, and focal length. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995, 17(6): 562-575.

[16] Pollefeys M, van Gool L V, Vergauwen M, Verbiest F, Cornelis K, Tops J, Koch R. Visual modeling with a hand-held camera. International Journal of Computer Vision, 2004, 59(3): 207-232.

[17] Zhang G F, Jia J, Hua W, Bao H J. Robust bilayer segmentation and motion/depth estimation with a handheld camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(3): 603-617.

[18] Saxena A, Chung S, Ng A Y. 3-D depth reconstruction from a single still image. International Journal of Computer Vision, 2008, 76(1): 53-69.

[19] Saxena A, Sun M, Ng A. Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5): 824- 840.

[20] Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks. In Proc. the 26th Advances in Neural Information Processing Systems, December 2012, pp.1106-1114.

[21] Zhu Z, Liang D, Zhang S, Huang X, Li B L, Hu S M. Trafficsign detection and classification in the wild. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.2110-2118.

[22] Nakajima Y, Saito H. Robust camera pose estimation by viewpoint classification using deep learning. Computational Visual Media, 2016.

[23] Karsch K, Liu C, Kang S B. Depth extraction from video using non-parametric sampling. In Proc. European Conference on Computer Vision, October 2012, pp.775-788.

[24] Karsch K, Liu C, Kang S B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11): 2144-2158.

[25] Farabet C, Couprie C, Najman L, LeCun Y. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1915-1929.

[26] Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D L, Huang C, Torr P. Conditional random fields as recurrent neural networks. In Proc. the IEEE International Conference on Computer Vision, December 2015, pp.1529- 1537.

[27] Achanta B, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(11): 2274-2282.

[28] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. https://arxiv.org/abs/1409.1556, March 2017.

[29] Vedaldi A, Lenc K. MatConvNet: Convolutional neural networks for MATLAB. In Proc. the 23rd ACM International Conference on Multimedia, October 2015, pp.689-692.

[30] Silberman N, Hoiem D, Kohli P, Fergus R. Indoor segmentation and support inference from RGBD images. In Proc. the 12th European Conference on Computer Vision, October 2012, pp.746-760.

[31] Liu M M, Salzmann M, He X. Discrete-continuous depth estimation from a single image. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.716-723.

[32] Fehn C, de la Barré R, Pastoor S. Interactive 3-DTVconcepts and key technologies. Proceedings of the IEEE, 2006, 94(3): 524-538.

[33] Cao X, Zheng Li, Dai Q H. Semi-automatic 2D-to-3D conversion using disparity propagation. IEEE Transactions on Broadcasting, 2011, 57(2): 491-499.

[34] Phan R, Androutsos D. Robust semi-automatic depth map generation in unconstrained images and video sequences for 2D to stereoscopic 3D conversion. IEEE Transactions on Multimedia, 2014, 16(1): 122-136.

[35] Mikolov T, Kombrink S, Burget L, Cernocky J, Khudanpur S. Extensions of recurrent neural network language model. In Proc. the IEEE International Conference on Acoustics, Speech and Signal Processing, May 2011, pp.5528-5531.

[36] Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. In Proc. International Conference on Acoustics, Speech and Signal Processing, May 2013, pp.6645-6649.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张焰; 何继潮;. Data Dependencies in Database with Incomplete Information[J]. , 1988, 3(2): 131 -138 .
[2] 张钹; 张恬; 张建伟; 张铃;. Motion Planning for Robots with Topological Dimension Reduction Method[J]. , 1990, 5(1): 1 -16 .
[3] 朱明远;. Two Congruent Semantics for Prolog with CUT[J]. , 1990, 5(1): 82 -91 .
[4] 蔡士杰; 张福炎;. A Fast Algorithm for Polygon Operations[J]. , 1991, 6(1): 91 -96 .
[5] 姚新; 李国杰;. General Simulated Annealing[J]. , 1991, 6(4): 329 -338 .
[6] 金国华; 杨学军; 陈福接;. Loop Staggering,Loop Compacting:Restructuring Techniques for Thrashing Problem[J]. , 1993, 8(1): 49 -57 .
[7] 黎仁蔚; 何锫; 张文辉;. An Introduction to IN CAPS System[J]. , 1993, 8(1): 26 -37 .
[8] 顾君忠;. An Object-Oriented Transaction Model[J]. , 1993, 8(4): 3 -20 .
[9] 姜文彬;. A Method for Minimization Design of Two-Level Logic Networks Using Multiplexer Universal Logic Modules[J]. , 1994, 9(1): 92 -96 .
[10] 马军; 马绍汉;. An O(k~2n~2) Algorithm to Find a k-Partition in a k-Connected Graph[J]. , 1994, 9(1): 86 -91 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: