We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Li YZ, Zheng SJ, Tan ZX et al. Self-supervised monocular depth estimation by digging into uncertainty quantification. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 38(3): 510−525 May 2023. DOI: 10.1007/s11390-023-3088-y.
Citation: Li YZ, Zheng SJ, Tan ZX et al. Self-supervised monocular depth estimation by digging into uncertainty quantification. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 38(3): 510−525 May 2023. DOI: 10.1007/s11390-023-3088-y.

Self-Supervised Monocular Depth Estimation by Digging into Uncertainty Quantification

Funds: This work was supported by the National Natural Science Foundation of China under Grant No. 61972298, CAAI-Huawei MindSpore Open Fund, and the Xinjiang Bingtuan Science and Technology Program of China under Grant No. 2019BC008.
More Information
  • Author Bio:

    Yuan-Zhen Li received her B.S. degree in mathematics from Baoji University of Arts and Sciences, Baoji, in 2015. She received her M.S. degree in mathematics from Yunnan University, Kunming, in 2018. Currently, she is working toward her Ph.D. degree in computer science and technology at the School of Computer Science, Wuhan University, Wuhan. Her research interests include depth estimation and 3D reconstruction

    Sheng-Jie Zheng received his B.S. degree in software engineering from the School of Computer Science, Dalian Maritime University, Dalian, in 2020. He is working toward his M.S. degree in computer science and technology at the School of Computer Science, Wuhan University, Wuhan. His research interests are monocular depth estimation and 3D vision

    Zi-Xin Tan is working toward her B.S. degree in software engineering at the School of Computer Science, Wuhan University, Wuhan. Her research interests are monocular depth estimation and 3D vision

    Tuo Cao received his B.S. degree in electronic information engineering from the School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, in 2015. He received his M.S. degree in physical electronics from the School of Electronic Information, Wuhan University, Wuhan, in 2018. He is working toward his Ph.D. degree in computer science and technology at the School of Computer Science, Wuhan University, Wuhan. His research interests are 3D vision, SLAM, and object pose estimation

    Fei Luo received his Ph.D. degree in computer science and technology from Wuhan University, Wuhan, in 2011. He worked as a research assistant at the School of Computer Engineering of Nanyang Technological University, Singapore, in 2009. From 2011 to 2013, he worked as a postdoctor at Human Polymorphism Study Center, Paris. He is now an assistant professor at the School of Computer Science, Wuhan University, Wuhan. His research interests include computer vision, computer graphics and data mining

    Chun-Xia Xiao received his B.S. and M.S. degrees in mathematics from Hunan Normal University, Changsha, in 1999 and 2002, respectively. He received his Ph.D. degree in applied mathematics from the State Key Lab of CAD&CG of Zhejiang University, Hangzhou, in 2006. He became an assistant professor at Wuhan University, Wuhan, in 2006, and became a professor in 2011. From October 2006 to April 2007, he worked as a postdoctor at the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong. During February 2012 to February 2013, he visited University of California-Davis, Davis. Currently, he is a professor at the School of Computer Science, Wuhan University, Wuhan. His main interests include computer graphics, computer vision, virtual reality and augmented reality. He is a member of CCF and IEEE

  • Corresponding author:

    (This work was co-supervised by Chun-Xia Xiao and Fei Luo)

    cxxiao@whu.edu.cn

  • Co-First Authors

  • Received Date: January 10, 2023
  • Accepted Date: May 21, 2023
  • Based on well-designed network architectures and objective functions, self-supervised monocular depth estimation has made great progress. However, lacking a specific mechanism to make the network learn more about the regions containing moving objects or occlusion scenarios, existing depth estimation methods likely produce poor results for them. Therefore, we propose an uncertainty quantification method to improve the performance of existing depth estimation networks without changing their architectures. Our uncertainty quantification method consists of uncertainty measurement, the learning guidance by uncertainty, and the ultimate adaptive determination. Firstly, with Snapshot and Siam learning strategies, we measure the uncertainty degree by calculating the variance of pre-converged epochs or twins during training. Secondly, we use the uncertainty to guide the network to strengthen learning about those regions with more uncertainty. Finally, we use the uncertainty to adaptively produce the final depth estimation results with a balance of accuracy and robustness. To demonstrate the effectiveness of our uncertainty quantification method, we apply it to two state-of-the-art models, Monodepth2 and Hints. Experimental results show that our method has improved the depth estimation performance in seven evaluation metrics compared with two baseline models and exceeded the existing uncertainty method.

  • [1]
    Zhao X R, Wang X, Chen Q C. Temporally consistent depth map prediction using deep convolutional neural network and spatial-temporal conditional random field. Journal of Computer Science and Technology, 2017, 32(3): 443–456. DOI: 10.1007/s11390-017-1735-x.
    [2]
    Fang F, Luo F, Zhang H P, Zhou H J, Chow A L H, Xiao C X. A comprehensive pipeline for complex text-to-image synthesis. Journal of Computer Science and Technology, 2020, 35(3): 522–537. DOI: 10.1007/s11390-020-0305-9.
    [3]
    Cao T, Luo F, Fu Y P, Zhang W X, Zheng S J, Xiao C X. DGECN: A depth-guided edge convolutional network for end-to-end 6D pose estimation. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2022, pp.3773–3782. DOI: 10.1109/CVPR52688.2022.00376.
    [4]
    Fu Y P, Yan Q G, Liao J, Xiao C X. Joint texture and geometry optimization for RGB-D reconstruction. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.5949–5958. DOI: 10.1109/CVPR42600.2020.00599.
    [5]
    Fu Y P, Yan Q G, Yang L, Liao J, Xiao C X. Texture mapping for 3D reconstruction with RGB-D sensor. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.4645–4653. DOI: 10.1109/CVPR.2018.00488.
    [6]
    Fu Y P, Yan Q G, Liao J, Zhou H J, Tang J, Xiao C X. Seamless texture optimization for RGB-D reconstruction. IEEE Trans. Visualization and Computer Graphics, 2023, 29(3): 1845–1859. DOI: 10.1109/TVCG.2021.3134105.
    [7]
    Garg R, Vijay Kumar B G, Carneiro G, Reid I. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In Proc. the 14th European Conference on Computer Vision, Oct. 2016, pp.740–756. DOI: 10.1007/978-3-319-46484-8_45.
    [8]
    Godard C, Aodha O M, Firman M, Brostow G. Digging into self-supervised monocular depth estimation. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27–Nov. 2, 2019, pp.3827–3837. DOI: 10.1109/ICCV.2019.00393.
    [9]
    Bian J W, Zhan H Y, Wang N Y, Li Z C, Zhang L, Shen C H, Cheng M M, Reid I. Unsupervised scale-consistent depth learning from video. International Journal of Computer Vision, 2021, 129(9): 2548–2564. DOI: 10.1007/s11263-021-01484-6.
    [10]
    Klingner M, Termöhlen J A, Mikolajczyk J, Fingscheidt T. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.582–600. DOI: 10.1007/978-3-030-58565-5_35.
    [11]
    Guizilini V, Ambruș R, Pillai S, Raventos A, Gaidon A. 3D packing for self-supervised monocular depth estimation. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.2482–2491. DOI: 10.1109/CVPR42600.2020.00256.
    [12]
    Ramamonjisoa M, Firman M, Watson J, Lepetit V, Turmukhambetov D. Single image depth prediction with wavelet decomposition. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.11084–11093. DOI: 10.1109/CVPR46437.2021.01094.
    [13]
    Li Y Z, Luo F, Xiao C X. Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module. Computational Visual Media, 2022, 8(4): 631–647. DOI: 10.1007/s41095-022-0279-3.
    [14]
    Watson J, Firman M, Brostow G, Turmukhambetov D. Self-supervised monocular depth hints. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27–Nov. 2, 2019, pp.2162–2171. DOI: 10.1109/ICCV.2019.00225.
    [15]
    Asai A, Ikami D, Aizawa K. Multi-task learning based on separable formulation of depth estimation and its uncertainty. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Jun. 2019, pp.21–24.
    [16]
    Mertan A, Sahin Y H, Duff D J, Unal G. A new distributional ranking loss with uncertainty: Illustrated in relative depth estimation. In Proc. the 2020 International Conference on 3D Vision, Nov. 2020, pp.1079–1088. DOI: 10.1109/3DV50981.2020.00118.
    [17]
    Teixeira L, Oswald M R, Pollefeys M, Chli M. Aerial single-view depth completion with image-guided uncertainty estimation. IEEE Robotics and Automation Letters, 2020, 5(2): 1055–1062. DOI: 10.1109/LRA.2020.2967296.
    [18]
    Choi H, Lee H, Kim S, Kim S, Kim S, Sohn K, Min D B. Adaptive confidence thresholding for monocular depth estimation. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.12788–12798. DOI: 10.1109/ICCV48922.2021.01257.
    [19]
    Poggi M, Aleotti F, Tosi F, Mattoccia S. On the uncertainty of self-supervised monocular depth estimation. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.3224–3234. DOI: 10.1109/CVPR42600.2020.00329.
    [20]
    Godard C, Aodha O M, Brostow G J. Unsupervised monocular depth estimation with left-right consistency. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.6602–6611. DOI: 10.1109/CVPR.2017.699.
    [21]
    Zhou T H, Brown M, Snavely N, Lowe D G. Unsupervised learning of depth and ego-motion from video. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.6612–6619. DOI: 10.1109/CVPR.2017.700.
    [22]
    Yin Z C, Shi J P. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In Proc. the 2018 IEEE/CVF conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.1983–1992. DOI: 10.1109/CVPR.2018.00212.
    [23]
    Zou Y L, Luo Z L, Huang J B. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.38–55. DOI: 10.1007/978-3-030-01228-1_3.
    [24]
    Abdar M, Pourpanah F, Hussain S, Rezazadegan D, Liu L, Ghavamzadeh M, Fieguth P, Cao X C, Khosravi A, Acharya U R, Makarenkov V, Nahavandi S. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 2021, 76: 243–297. DOI: 10.1016/j.inffus.2021.05.008.
    [25]
    Liu C, Gu J W, Kim K, Narasimhan S G, Kautz J. Neural RGB®D sensing: Depth and uncertainty from a video camera. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.10978–10987. DOI: 10.1109/CVPR.2019.01124.
    [26]
    Song C X, Qi C Y, Song S X, Xiao F. Unsupervised monocular depth estimation method based on uncertainty analysis and retinex algorithm. Sensors, 2020, 20(18): 5389. DOI: 10.3390/s20185389.
    [27]
    Shen Y C, Zhang Z L, Sabuncu M R, Sun L. Real-time uncertainty estimation in computer vision via uncertainty-aware distribution distillation. In Proc. the 2021 IEEE Winter Conference on Applications of Computer Vision, Jan. 2021, pp.707–716. DOI: 10.1109/WACV48630.2021.00075.
    [28]
    Huang G, Li Y X, Pleiss G, Liu Z, Hopcroft J E, Weinberger K Q. Snapshot ensembles: Train 1, get M for free. arXiv: 1704.00109, 2017. https://doi.org/10.48550/arXiv.1704.00109, May 2023.
    [29]
    Koch G, Zemel R, Salakhutdinov R. Siamese neural networks for one-shot image recognition. In Proc. the 32nd International Conference on Machine Learning, Jul. 2015.
    [30]
    Zhao H, Gallo O, Frosio I, Kautz J. Loss functions for image restoration with neural networks. IEEE Trans. Computational Imaging, 2017, 3(1): 47–57. DOI: 10.1109/TCI.2016.2644865.
    [31]
    Wang Z, Bovik A C, Sheikh H R, Simoncelli E P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Processing, 2004, 13(4): 600–612. DOI: 10.1109/TIP.2003.819861.
    [32]
    Hirschmüller H. Accurate and efficient stereo processing by semi-global matching and mutual information. In Proc. the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Jun. 2005, pp.807–814. DOI: 10.1109/CVPR.2005.56.
    [33]
    Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proc. the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2012, pp.3354–3361. DOI: 10.1109/CVPR.2012.6248074.
    [34]
    Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. the 2015 IEEE International Conference on Computer Vision, Dec. 2015, pp.2650–2658. DOI: 10.1109/ICCV.2015.304.
    [35]
    Ranjan A, Jampani V, Balles L, Kim K, Sun D Q, Wulff J, Black M J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.12232–12241. DOI: 10.1109/CVPR.2019.01252.
    [36]
    Casser V, Pirk S, Mahjourian R, Angelova A. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proc. the 33rd AAAI Conference on Artificial Intelligence, Jan. 27–Feb. 1, 2019, pp.8001–8008. DOI: 10.1609/aaai.v33i01.33018001.
    [37]
    Johnston A, Carneiro G. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.4755–4764. DOI: 10.1109/CVPR42600.2020.00481.
    [38]
    Petrovai A, Nedevschi S. Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2022, pp.1568–1578. DOI: 10.1109/CVPR52688.2022.00163.
    [39]
    Mehta I, Sakurikar P, Narayanan P J. Structured adversarial training for unsupervised monocular depth estimation. In Proc. the 2018 International Conference on 3D Vision, Sept. 2018, pp.314–323. DOI: 10.1109/3DV.2018.00044.
    [40]
    Poggi M, Tosi F, Mattoccia S. Learning monocular depth estimation with unsupervised trinocular assumptions. In Proc. the 2018 International Conference on 3D Vision, Sept. 2018, pp.324–333. DOI: 10.1109/3DV.2018.00045.

Catalog

    Article views (470) PDF downloads (67) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return