计算机科学技术学报 ›› 2022,Vol. 37 ›› Issue (3): 719-730.doi: 10.1007/s11390-021-1311-2

所属专题: Artificial Intelligence and Pattern Recognition

• • 上一篇    下一篇

RGB图像遮挡场景中的6D对象姿态估计

  

  • 收稿日期:2021-01-22 修回日期:2021-05-26 接受日期:2021-08-31 出版日期:2022-05-30 发布日期:2022-05-30

6D Object Pose Estimation in Cluttered Scenes from RGB Images

Xiao-Long Yang1,2 (杨小龙), Xiao-Hong Jia1,2,* (贾晓红), Member, CCF, Yuan Liang3 (梁缘), and Lu-Bin Fan3 (樊鲁宾)        

  1. 1Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
    2University of Chinese Academy of Sciences, Beijing 100049, China
    3Alibaba DAMO Academy, Alibaba Group, Hangzhou 311121, China
  • Received:2021-01-22 Revised:2021-05-26 Accepted:2021-08-31 Online:2022-05-30 Published:2022-05-30
  • Contact: Xiao-Hong Jia E-mail:xhjia@amss.ac.cn
  • About author:Xiao-Hong Jia is a professor at Key Laboratory of Mathematics Mechanization, Academy of Mathematics and Systems Science, Chinese Academy of Sciences (CAS), Beijing. She received her Ph.D. and Bachelor's degrees in mathematics from the University of Science and Technology of China, Hefei, in 2009 and 2004, respectively. Her research interests include computer graphics, computer aided geometric design, and computational algebraic geometry.
  • Supported by:
    This work was partially supported by the National Key Research and Development Program of China under Grant No. 2021YFB1715900, the National Natural Science Foundation of China under Grant Nos. 12022117 and 61802406, the Beijing Natural Science Foundation under Grant No. Z190004, the Beijing Advanced Discipline Fund under Grant No. 115200S001, and Alibaba Group through Alibaba Innovative Research Program.

6D对象姿态估计对于许多现实世界的视觉和图形应用至关重要,例如机器人的抓取和操纵,自主导航和增强/混合现实等。理想情况下,好的解决方案应该能处理会产生形变的或各式纹理的对象,对严重的遮挡,传感器噪声和变化的照明条件具有鲁棒性,并达到实时速度。许多基于RGB-D的算法可以准确地推断出无纹理对象的姿态,但这对应用程序有某些限制,例如需要RGB-D传感器等硬件设施,也增加了计算的负担。这也限制了它们在日常生活情形中广泛的应用。 传统的仅依赖于RGB数据的方法对严重遮挡和照明变化剧烈的情况并不适应,难以满足准确的姿势估计要求。 为此,我们提出了一种融合网络来结合几何与纹理的特征,尽可能减少重度遮挡对特征提取的影响。同时,我们将融合网络嵌入由分割流和回归流组成的双流网络,实现高精度的语义分割进行物体检测和高效的PnP算法进行3D-2D坐标对回归。最后,我们在主网络后设计了一个迭代优化模块,可以通过自校正进一步提升姿态估计的精度。我们在两个公开的、广泛使用且具有挑战性的数据集YCB-Video和Occluded-LineMOD上进行了对比实验,在精度和速度上的领先说明了我们提出方法的有效性。此外,我们还讨论了其他潜在的改进,对目前尚不能完美解决的同时重度遮挡加无纹理的情形提供了一个研究方向。也对我们的方法进行了拓展研究,可推广应用到更多实例,例如广告替换和墙面装饰推荐等多个领域。

关键词: 6D姿态估计, 双流网络, 融合特征

Abstract: We propose a feature-fusion network for pose estimation directly from RGB images without any depth information in this study. First, we introduce a two-stream architecture consisting of segmentation and regression streams. The segmentation stream processes the spatial embedding features and obtains the corresponding image crop. These features are further coupled with the image crop in the fusion network. Second, we use an efficient perspective-n-point (E-PnP) algorithm in the regression stream to extract robust spatial features between 3D and 2D keypoints. Finally, we perform iterative refinement with an end-to-end mechanism to improve the estimation performance. We conduct experiments on two public datasets of YCB-Video and the challenging Occluded-LineMOD. The results show that our method outperforms state-of-the-art approaches in both the speed and the accuracy.

Key words: two-stream network, 6D pose estimation, fusion feature

[1] Brachmann E, Krull A, Michel F, Gumhold S, Shotton J, Rother C. Learning 6D object pose estimation using 3D object coordinates. In Proc. the 13th European Conference on Computer Vision, Sept. 2014, pp.536-551. DOI: 10.1007/978-3-319-10605-2.

[2] Hinterstoisser S, Holzer S, Cagniart C, Ilic S, Konolige K, Navab N, Lepetit V. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In Proc. the 2011 IEEE International Conference on Computer Vision, Nov. 2011, pp.858-865. DOI: 10.1109/ICCV.2011.6126326.

[3] Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K, Navab N. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Proc. the 11th Asian Conference on Computer Vision, Nov. 2012, pp.548-562. DOI: 10.1007/978-3-642-37331-2.

[4] Kehl W, Milletari F, Tombari F, Ilic S, Navab N. Deep learning of local RGB-D patches for 3D object detection and 6D pose estimation. In Proc. the 14th European Conference on Computer Vision, Oct. 2016, pp.205-220. DOI: 10.1007/978-3-319-46487-9.

[5] Rios-Cabrera R, Tuytelaars T. Discriminatively trained templates for 3D object detection: A real time scalable approach. In Proc. the 2013 IEEE International Conference on Computer Vision, Dec. 2013, pp.2048-2055. DOI: 10.1109/ICCV.2013.256.

[6] Tejani A, Tang D, Kouskouridas R, Kim T K. Latent-class hough forests for 3D object detection and pose estimation. In Proc. the 13th European Conference on Computer Vision, Sept. 2014, pp.462-477. DOI: 10.1007/978-3-319-10599-4.

[7] Wohlhart P, Lepetit V. Learning descriptors for object recognition and 3D pose estimation. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.3109-3118. DOI: 10.1109/CVPR.2015.7298930.

[8] Cao Y, Ju T, Xu J, Hu S. Extracting sharp features from RGB-D images. Computer Graphics Forum, 2017, 36(8): 138-152. DOI: 10.1111/cgf.13069.

[9] Wang C, Xu D, Zhu Y, Martin R, Lu C, Li F, Savarese S. DenseFusion: 6D object pose estimation by iterative dense fusion. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, pp.3343-3352. DOI: 10.1109/CVPR.2019.00346.

[10] Fischler M A, Bolles R C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981, 24(6): 381-395. DOI: 10.1145/358669.358692.

[11] Xiang Y, Schmidt T, Narayanan V, Fox D. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. In Proc. the 14th Robotics: Science and Systems, June 2018. DOI: 10.15607/RSS.2018.XIV.019.

[12] Krull A, Brachmann E, Michel F, Yang M Y, Gumhold S, Rother C. Learning analysis-by-synthesis for 6D pose estimation in RGB-D images. In Proc. the 2015 IEEE International Conference on Computer Vision, Dec. 2015, pp.954-962. DOI: 10.1109/ICCV.2015.115.

[13] Qi C R, Su H, Mo K, Guibas L J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.652-660. DOI: 10.1109/CVPR.2017.16.

[14] Hu Y, Hugonot J, Fua P, Salzmann M. Segmentation-driven 6D object pose estimation. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, pp.3385-3394. DOI: 10.1109/CVPR.2019.00350.

[15] Xu D, Anguelov D, Jain A. PointFusion: Deep sensor fusion for 3D bounding box estimation. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp.244-253. DOI: 10.1109/CVPR.2018.00033.

[16] Qi C R, Liu W, Wu C, Su H, Guibas L J. Frustum PointNets for 3D object detection from RGB-D data. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp.7918-7927. DOI: 10.1109/CVPR.2018.00102.

[17] Yang X L, Jia X H. 6D pose estimation with two-stream net. In Proc. the 2020 ACM SIGGRAPH, Aug. 2020, Article No. 40. DOI: 10.1145/3388770.3407423.

[18] Song S, Xiao J. Sliding shapes for 3D object detection in depth images. In Proc. the 13th European Conference on Computer Vision, Sept. 2014, pp.634-651. DOI: 10.1007/978-3-319-10599-4.

[19] Song S, Xiao J. Deep sliding shapes for Amodal 3D object detection in RGB-D images. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.808-816. DOI: 10.1109/CVPR.2016.94.

[20] Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proc. the 2012 IEEE Conference on Computer Vision and Pattern Recognition, June 2012, pp.3354-3361. DOI: 10.1109/CVPR.2012.6248074.

[21] Aubry M, Maturana D, Efros A A, Russell B C, Sivic J. Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.3762-3769. DOI: 10.1109/CVPR.2014.487.

[22] Collet A, Martinez M, Srinivasa S S. The MOPED framework: Object recognition and pose estimation for manipulation. International Journal of Robotics Research, 2011, 30(10): 1284-1306. DOI: 10.1177/0278364911401765.

[23] Ferrari V, Tuytelaars T, Gool L V. Simultaneous object recognition and segmentation from single or multiple model views. International Journal of Computer Vision, 2006, 67(2): 159-188. DOI: 10.1007/s11263-005-3964-7.

[24] Rothganger F, Lazebnik S, Schmid C, Ponce J. 3D object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. International Journal of Computer Vision, 2006, 66(3): 231-259. DOI: 10.1007/s11263-005-3674-1.

[25] Zhu M, Derpanis K G, Yang Y, Brahmbhatt S, Zhang M, Phillips C, Lecce M, Daniilidis K. Single image 3D object detection and pose estimation for grasping. In Proc. the 2014 IEEE International Conference on Robotics and Automation, May 31-June 7, 2014, pp.3936-3943. DOI: 10.1109/ICRA.2014.6907430.

[26] Nakajima Y, Saito H. Robust camera pose estimation by viewpoint classification using deep learning. Computational Visual Media, 2017, 3(2): 189-198. DOI: 10.1007/s41095-016-0067-z.

[27] Suwajanakorn S, Snavely N, Tompson J J, Norouzi M. Discovery of latent 3D keypoints via end-to-end geometric reasoning. In Proc. the 2018 Annual Conference on Neural Information Processing Systems, Dec. 2018, pp.2059-2070.

[28] Tekin B, Sinha S N, Fua P. Real-time seamless single shot 6D object pose prediction. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp.292-301. DOI: 10.1109/CVPR.2018.00038.

[29] Tremblay J, To T, Sundaralingam B, Xiang Y, Fox D, Birchfield S. Deep object pose estimation for semantic robotic grasping of household objects. In Proc. the 2nd Conference on Robot Learning, Oct. 2018, pp.306-316.

[30] Schwarz M, Schulz H, Behnke S. RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features. In Proc. the 2015 IEEE International Conference on Robotics and Automation, May 2015, pp.1329-1335. DOI: 10.1109/ICRA.2015.7139363.

[31] Tulsiani S, Malik J. Viewpoints and keypoints. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.1510-1519. DOI: 10.1109/CVPR.2015.7298758.

[32] Mousavian A, Anguelov D, Flynn J, Kosecka J. 3D bounding box estimation using deep learning and geometry. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.7074-7082. DOI: 10.1109/CVPR.2017.597.

[33] Sundermeyer M, Marton Z C, Durner M, Brucker M, Triebel R. Implicit 3D orientation learning for 6D object detection from RGB images. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.699-715. DOI: 10.1007/978-3-030-01231-1.

[34] Billings G, Johnson-Roberson M. SilhoNet: An RGB method for 6D object pose estimation. IEEE Robotics and Automation Letters, 2019, 4(4): 3727-3734. DOI: 10.1109/LRA.2019.2928776.

[35] Park K, Patten T, Vincze M. Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.7668-7677. DOI: 10.1109/ICCV.2019.00776.

[36] Castro P, Armagan A, Kim T K. Accurate 6D object pose estimation by pose conditioned mesh reconstruction. arXiv:1910.10653, 2019. https://arxiv.org/pdf/1910.10653.pdf, Jan. 2022.

[37] Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences. arXiv:1404.2188, 2014. https://arxiv.org/pdf/1404.2188.pdf, Jan. 2022.

[38] Badrinarayanan V, Kendall A, Cipolla R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495. DOI: 10.1109/TPAMI.2016.2644615.

[39] Li C, Bai J, Hager G D. A unified framework for multi-view multi-class object pose estimation. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.254-269. DOI: 10.1007/978-3-030-01270-0.

[40] Redmon J, Farhadi A. YOLOv3: An incremental improvement. arXiv:1804.02767, 2018. https://arxiv.org/pdf/1804.02767.pdf, Jan. 2022.

[41] Bochkovskiy A, Wang C, Liao H M. YOLOv4: Optimal speed and accuracy of object detection. arXiv:2004.10934, 2020. https://arxiv.org/pdf/2004.10934.pdf, Jan. 2022.

[42] Lin T, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object detection. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.2980-2988. DOI: 10.1109/ICCV.2017.324.

[43] Lepetit V, Moreno-Noguer F, Fua P. EPnP: An accurate O(n) solution to the PnP problem. International Journal of Computer Vision, 2009, 81(2): Article No. 155. DOI: 10.1007/s11263-008-0152-6.

[44] Rad M, Lepetit V. BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.3828-3836. DOI: 10.1109/ICCV.2017.413.

[45] Oberweger M, Rad M, Lepetit V. Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.119-134. DOI: 10.1007/978-3-030-01267-0.

[46] Peng S, Liu Y, Huang Q, Zhou X, Bao H. PVNet: Pixel-wise voting network for 6DoF pose estimation. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, pp.4561-4570. DOI: 10.1109/CVPR.2019.00469.

[47] Li Z, Wang G, Ji X. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.7678-7687. DOI: 10.1109/ICCV.2019.00777.

[48] Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88(2): 303-338. DOI: 10.1007/s11263-009-0275-4.

[49] Liang Y, Fan L, Ren P, Xie X, Hua X. DecorIn: An automatic method for plane-based decorating. IEEE Transactions on Visualization and Computer Graphics, 2021, 27(8): 3438-3450. DOI: 10.1109/TVCG.2020.2972897.

[1] Xia-An Bi, Zhao-Xu Xing, Rui-Hui Xu, Xi Hu. 基于影像遗传学数据的发现帕金森症的危险基因和异常脑区的有效WRF框架[J]. 计算机科学技术学报, 2021, 36(2): 361-374.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 王选; 吕之敏; 汤玉海; 向阳;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[4] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[5] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[8] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[9] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[10] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: