Abstract Deep convolutional neural networks (DCNN) based methods recently keep setting new records on tasks of predicting depth maps from monocular images. When dealing with video-based applications such as 2D to 3D video conversion, however, these approaches tend to produce temporally inconsistent depth maps, since their CNN models are optimized over single frames. In this paper, we address this problem by introducing a novel spatial-temporal Conditional Random Fields (CRF) model into the DCNN architecture, which is able to enforce temporal consistency between depth map estimations over consecutive video frames. In our approach, temporally consistent superpixel (TSP) is first applied to an image sequence to establish correspondence of targets in consecutive frames. A DCNN network is then used to regress the depth value of each temporal superpixel, followed by a spatial-temporal CRF layer to model the relationship of the estimated depths in both spatial and temporal domain. The parameters in both DCNN and CRF models are jointly optimized with back propagation. Experimental results show that our approach not only is able to significantly enhance the temporal consistency of estimated depth maps over existing single-frame-based approaches, but also improves the depth estimation accuracy in terms of various evaluation metrics.
This work is supported in part by the Natural Science Foundation of Zhejiang Province of China under Grant No.LQ17F030001,the National Natural Science Foundation of China under Grant No.U1609215,Qianjiang Talent Program of Zhejiang Province of China under Grant No.QJD1602021,the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No.2014BAK14B01,and Beihang University Virtual Reality Technology and System National Key Laboratory Open Project under Grant No.BUAA-VR-16KF-17.
Corresponding Authors: Xun Wang
About author: Xu-Ran Zhao is currently an assistant professor at the School of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou. He received his B.S. degree in electronic and information technologies from Shanghai University, Shanghai, and M.S. degree in electrical and computer engineering from Georgia Institute of Technology, Atlanta, in 2006 and 2010 respectively. He received his Ph.D. degree from Telecom ParisTech, Paris, in 2013. During 2014～2016, he worked as a postdoctoral researcher on machine learning in School of Computer Science at Aalto University, Helsinki. His current research interests include pattern recognition, computer vision and biometric recognition.
Cite this article:
Xu-Ran Zhao, Xun Wang, Qi-Chao Chen.Temporally Consistent Depth Map Prediction Using Deep CNN and Spatial-temporal Conditional Random Field[J] Journal of Computer Science and Technology, 2017,V32(3): 443-456
 Saxena A, Sun M, Ng A. Learning 3-D scene structure from a single still image. In Proc. the 11th IEEE International Conference on Computer Vision, October 2007. Shotton J, Sharp T, Kipman A et al. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 2013, 56(1): 116-124. Cheng K L, Ju X, Tong R F, Tang M, Chang J, Zhang J J. A linear approach for depth and colour camera calibration using hybrid parameters. Journal of Computer Science and Technology, 2016, 31(3): 479-488. Fanello S R, Keskin C, Izadi S, Kohli P, Kim D, Sweeney D, Criminisi A, Shotton J, Kang S B, Paek T. Learning to be a depth camera for close-range human capture and interaction. ACM Transactions on Graphics, 2014, 33(4): 86:1-86:11. Zhang L, Vázquez C, Knorr S. 3D-TV content creation: Automatic 2D-to-3D video conversion. IEEE Transactions on Broadcasting, 2011, 57(2): 372-383. Zhang G F, Jia J, Wong T T, Bao H J. Consistent depth maps recovery from a video sequence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(6): 974-988. Tsai Y M, Chang Y L, Chen L G. Block-based vanishing line and vanishing point detection for 3D scene reconstruction. In Proc. International Symposium on Intelligent Signal Processing and Communications, December 2006, pp.586-589. Zhang R, Tsai P S, Cryer J E, Shah M. Shape-from-shading: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999, 21(8): 690-706. Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network. In Proc. Advances in Neural Information Processing Systems, December 2014, pp.2366-2374. Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. IEEE International Conference on Computer Vision, December 2015, pp.2650-2658. Li L, Shen C H, Dai Y C, van den Hengel A, He M. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.1119-1127. Liu F, Shen C, Lin G. Deep convolutional neural fields for depth estimation from a single image. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.5162-5170. Liu F, Shen C H, Lin G S, Reid I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2024-2039. Chang J, Wei D, Fisher J. A video representation using temporal superpixels. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.2051- 2058. Azarbayejani A, Pentland A P. Recursive estimation of motion, structure, and focal length. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995, 17(6): 562-575. Pollefeys M, van Gool L V, Vergauwen M, Verbiest F, Cornelis K, Tops J, Koch R. Visual modeling with a hand-held camera. International Journal of Computer Vision, 2004, 59(3): 207-232. Zhang G F, Jia J, Hua W, Bao H J. Robust bilayer segmentation and motion/depth estimation with a handheld camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(3): 603-617. Saxena A, Chung S, Ng A Y. 3-D depth reconstruction from a single still image. International Journal of Computer Vision, 2008, 76(1): 53-69. Saxena A, Sun M, Ng A. Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5): 824- 840. Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks. In Proc. the 26th Advances in Neural Information Processing Systems, December 2012, pp.1106-1114. Zhu Z, Liang D, Zhang S, Huang X, Li B L, Hu S M. Trafficsign detection and classification in the wild. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.2110-2118. Nakajima Y, Saito H. Robust camera pose estimation by viewpoint classification using deep learning. Computational Visual Media, 2016. Karsch K, Liu C, Kang S B. Depth extraction from video using non-parametric sampling. In Proc. European Conference on Computer Vision, October 2012, pp.775-788. Karsch K, Liu C, Kang S B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11): 2144-2158. Farabet C, Couprie C, Najman L, LeCun Y. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1915-1929. Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D L, Huang C, Torr P. Conditional random fields as recurrent neural networks. In Proc. the IEEE International Conference on Computer Vision, December 2015, pp.1529- 1537. Achanta B, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(11): 2274-2282. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. https://arxiv.org/abs/1409.1556, March 2017. Vedaldi A, Lenc K. MatConvNet: Convolutional neural networks for MATLAB. In Proc. the 23rd ACM International Conference on Multimedia, October 2015, pp.689-692. Silberman N, Hoiem D, Kohli P, Fergus R. Indoor segmentation and support inference from RGBD images. In Proc. the 12th European Conference on Computer Vision, October 2012, pp.746-760. Liu M M, Salzmann M, He X. Discrete-continuous depth estimation from a single image. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.716-723. Fehn C, de la Barré R, Pastoor S. Interactive 3-DTVconcepts and key technologies. Proceedings of the IEEE, 2006, 94(3): 524-538. Cao X, Zheng Li, Dai Q H. Semi-automatic 2D-to-3D conversion using disparity propagation. IEEE Transactions on Broadcasting, 2011, 57(2): 491-499. Phan R, Androutsos D. Robust semi-automatic depth map generation in unconstrained images and video sequences for 2D to stereoscopic 3D conversion. IEEE Transactions on Multimedia, 2014, 16(1): 122-136. Mikolov T, Kombrink S, Burget L, Cernocky J, Khudanpur S. Extensions of recurrent neural network language model. In Proc. the IEEE International Conference on Acoustics, Speech and Signal Processing, May 2011, pp.5528-5531. Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. In Proc. International Conference on Acoustics, Speech and Signal Processing, May 2013, pp.6645-6649.