|
›› 2017,Vol. 32 ›› Issue (3): 443-456.doi: 10.1007/s11390-017-1735-x
所属专题: Artificial Intelligence and Pattern Recognition; Computer Graphics and Multimedia
• Special Section on Selected Paper from NPC 2011 • 上一篇 下一篇
Xu-Ran Zhao, Xun Wang*, Senior Member, CCF, Member, ACM, IEEE, Qi-Chao Chen
Xu-Ran Zhao, Xun Wang*, Senior Member, CCF, Member, ACM, IEEE, Qi-Chao Chen
基于深度卷积神经网络的方法近年来持续刷新着单目图像深度恢复任务的精确度记录。然而当处理基于视频的深度恢复应用,例如影视作品的2D转3D问题时,由于现存的方法都是针对单帧图像进行神经网络的优化,恢复出的深度图经常会出现时域上的不连续性。本文提出了一种新型的时空一致性条件随机场模型,对相邻帧的估计深度图进行时域上的约束,并且集成到深度神经网络模型中。我们首先使用时域一致性超像素分割的方法来建立相邻帧间对应物体的联系,然后使用卷积神经网络对每一个超像素回归出一个单一的深度值。之后,我们提出了时空一致性条件随机场模型,对这些超像素回归深度值在时域和空域上的连续性同时进行约束。卷积神经网络和条件随机场中的参数可以通过反向传播同时更新,实现端对端的学习。实验结果表明,我们提出的方法对比基于单帧图像的方法不仅明显的提高了深度恢复的时域连续性,也提高了深度恢复的精度。
[1] Saxena A, Sun M, Ng A. Learning 3-D scene structure from a single still image. In Proc. the 11th IEEE International Conference on Computer Vision, October 2007.[2] Shotton J, Sharp T, Kipman A et al. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 2013, 56(1): 116-124.[3] Cheng K L, Ju X, Tong R F, Tang M, Chang J, Zhang J J. A linear approach for depth and colour camera calibration using hybrid parameters. Journal of Computer Science and Technology, 2016, 31(3): 479-488.[4] Fanello S R, Keskin C, Izadi S, Kohli P, Kim D, Sweeney D, Criminisi A, Shotton J, Kang S B, Paek T. Learning to be a depth camera for close-range human capture and interaction. ACM Transactions on Graphics, 2014, 33(4): 86:1-86:11.[5] Zhang L, Vázquez C, Knorr S. 3D-TV content creation: Automatic 2D-to-3D video conversion. IEEE Transactions on Broadcasting, 2011, 57(2): 372-383.[6] Zhang G F, Jia J, Wong T T, Bao H J. Consistent depth maps recovery from a video sequence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(6): 974-988.[7] Tsai Y M, Chang Y L, Chen L G. Block-based vanishing line and vanishing point detection for 3D scene reconstruction. In Proc. International Symposium on Intelligent Signal Processing and Communications, December 2006, pp.586-589.[8] Zhang R, Tsai P S, Cryer J E, Shah M. Shape-from-shading: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999, 21(8): 690-706.[9] Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network. In Proc. Advances in Neural Information Processing Systems, December 2014, pp.2366-2374.[10] Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. IEEE International Conference on Computer Vision, December 2015, pp.2650-2658.[11] Li L, Shen C H, Dai Y C, van den Hengel A, He M. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.1119-1127.[12] Liu F, Shen C, Lin G. Deep convolutional neural fields for depth estimation from a single image. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.5162-5170.[13] Liu F, Shen C H, Lin G S, Reid I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2024-2039.[14] Chang J, Wei D, Fisher J. A video representation using temporal superpixels. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.2051- 2058.[15] Azarbayejani A, Pentland A P. Recursive estimation of motion, structure, and focal length. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995, 17(6): 562-575.[16] Pollefeys M, van Gool L V, Vergauwen M, Verbiest F, Cornelis K, Tops J, Koch R. Visual modeling with a hand-held camera. International Journal of Computer Vision, 2004, 59(3): 207-232.[17] Zhang G F, Jia J, Hua W, Bao H J. Robust bilayer segmentation and motion/depth estimation with a handheld camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(3): 603-617.[18] Saxena A, Chung S, Ng A Y. 3-D depth reconstruction from a single still image. International Journal of Computer Vision, 2008, 76(1): 53-69.[19] Saxena A, Sun M, Ng A. Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5): 824- 840.[20] Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks. In Proc. the 26th Advances in Neural Information Processing Systems, December 2012, pp.1106-1114.[21] Zhu Z, Liang D, Zhang S, Huang X, Li B L, Hu S M. Trafficsign detection and classification in the wild. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.2110-2118.[22] Nakajima Y, Saito H. Robust camera pose estimation by viewpoint classification using deep learning. Computational Visual Media, 2016.[23] Karsch K, Liu C, Kang S B. Depth extraction from video using non-parametric sampling. In Proc. European Conference on Computer Vision, October 2012, pp.775-788.[24] Karsch K, Liu C, Kang S B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11): 2144-2158.[25] Farabet C, Couprie C, Najman L, LeCun Y. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1915-1929.[26] Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D L, Huang C, Torr P. Conditional random fields as recurrent neural networks. In Proc. the IEEE International Conference on Computer Vision, December 2015, pp.1529- 1537.[27] Achanta B, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(11): 2274-2282.[28] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. https://arxiv.org/abs/1409.1556, March 2017.[29] Vedaldi A, Lenc K. MatConvNet: Convolutional neural networks for MATLAB. In Proc. the 23rd ACM International Conference on Multimedia, October 2015, pp.689-692.[30] Silberman N, Hoiem D, Kohli P, Fergus R. Indoor segmentation and support inference from RGBD images. In Proc. the 12th European Conference on Computer Vision, October 2012, pp.746-760.[31] Liu M M, Salzmann M, He X. Discrete-continuous depth estimation from a single image. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.716-723.[32] Fehn C, de la Barré R, Pastoor S. Interactive 3-DTVconcepts and key technologies. Proceedings of the IEEE, 2006, 94(3): 524-538.[33] Cao X, Zheng Li, Dai Q H. Semi-automatic 2D-to-3D conversion using disparity propagation. IEEE Transactions on Broadcasting, 2011, 57(2): 491-499.[34] Phan R, Androutsos D. Robust semi-automatic depth map generation in unconstrained images and video sequences for 2D to stereoscopic 3D conversion. IEEE Transactions on Multimedia, 2014, 16(1): 122-136.[35] Mikolov T, Kombrink S, Burget L, Cernocky J, Khudanpur S. Extensions of recurrent neural network language model. In Proc. the IEEE International Conference on Acoustics, Speech and Signal Processing, May 2011, pp.5528-5531.[36] Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. In Proc. International Conference on Acoustics, Speech and Signal Processing, May 2013, pp.6645-6649. |
No related articles found! |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |