|
计算机科学技术学报 ›› 2022,Vol. 37 ›› Issue (3): 652-665.doi: 10.1007/s11390-022-2189-3
所属专题: Artificial Intelligence and Pattern Recognition; Computer Graphics and Multimedia
Yan Tao (陶琰), Yi-Teng Zhang (张翼腾), and Xue-Jin Chen* (陈雪锦), Senior Member, CCF, Member, ACM, IEEE
1、研究背景(Context):建筑立面解析旨在从建筑物立面图像上识别并标注建筑立面所包含的各类立面组件(如窗户、窗台、门、烟囱等),并提取对应组件的位置和尺寸等几何参数,从而得到建筑立面结构化表达。建筑立面解析可广泛应用于计算机视觉和图形学的多个领域,包括建筑物重建、程序建模、城市规划、增强现实/虚拟现实、以及城市导航等。建筑物立面解析,可提供丰富的建筑立面语义和几何细节,对提升建筑物三维模型的视觉质量与真实感有重要作用。传统的立面解析方法往往需要依赖人工设计的先验知识,制定繁琐规则和处理流程,在面对复杂的建筑立面场景时欠缺鲁棒性。近年来,基于深度学习的方法被应用于建筑物立面解析,较大地提升了建筑立面的语义分割效果。然而,大多数方法没有充分考虑建筑立面的结构先验,包括建筑立面部件的矩形形状、对称性,以及立面布局的规律性,这些结构先验对于获得布局合理、结构完整的立面解析结果至关重要。
2、目的(Objective):本文的研究目的在于将建筑立面的结构先验引入到建筑立面解析任务中,基于目标检测网络,建模建筑立面上部件间布局上下文,利用水平方向和竖直方向上立面的空间关联,聚合部件的局部特征和建筑立面的全局特征,提升部件识别和定位的准确率,以得到布局合理、结构完整的建筑立面解析结果。
3、方法(Method):建筑立面部件大多为矩形形状,因此我们的建筑立面解析网络采用目标检测网络为骨架,设计了一个新的部件布局上下文立面解析网络(Element-Arrangement Context Network,EACNet)。为充分利用立面布局上下文信息,我们设计了一种新的部件布局上下文模块(Element-Arrangement Context Module,EACM),由两路并行的单向注意力支路来分别收集空间列上下文和行上下文信息,随后将两路上下文信息与局部图像特征进行聚合,输入到一个检测器,以获得建筑立面部件的语义和几何。这种双路的单项上下文聚合机制可以充分利用建筑立面部件的空间排列规律性和外观相似性。我们在四个公开的数据集(Graz50, ECP, CMP, eTRIMS)上均取得了最优性能,验证了所提方法的有效性和鲁棒性。
4、结果(Result & Findings):我们在四个公开的数据集(Graz50, ECP, CMP, eTRIMS)上对所提建筑立面解析网络EACNet进行了充分的实验验证。与当下最先进的基于分割的建筑立面解析方法相比,我们的元素上下文排布网络在Graz50数据集上取得了最高的平均像素解析精度,在ECP数据集上实现最高的交并比指标。在CMP数据集上,通过大量消融实验,以及与现有基于注意力机制的检测方法对比,验证了所提元素上下文排布网络EACNet的有效性。而对于eTRIMS数据集上倾斜视角的街景图像,通过结合一个成熟的视角校正方法,对比现有最优基于语义分割的立面解析方法,我们EACNet网络可将立面解析的平均精度提升近8%。
5、结论(Conclusions):我们设计的基于部件布局上下文的建筑立面解析网络(EACNet)可以充分利用建筑立面部件的对称性、布局规律性等结构先验,所提EACNet引导解析网络关注于建筑立面上水平方向、竖直方向部件相关性,有效地提升了建筑立面解析的准确性,验证了在深度学习网络中引入建筑立面的结构先验对改善建筑立面解析结果的重要性。
[1] Müller P, Zeng G, Wonka P, van Gool L. Image-based procedural modeling of facades. ACM Transactions on Graphics, 2007, 26(3): Article No. 85. DOI: 10.1145/1276377.1276484. [2] Shen C H, Huang S S, Fu H B, Hu S M. Adaptive partitioning of urban facades. ACM Transactions on Graphics, 2011, 30(6): Article No. 184. DOI: 10.1145/2070781.2024218. [3] Teboul O, Simon L, Koutsourakis P, Paragios N. Segmentation of building facades using procedural shape priors. In Proc. the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2010, pp.3105-3112. DOI: 10.1109/CVPR.2010.5540068. [4] Teboul O, Kokkinos I, Simon L, Koutsourakis P, Paragios N. Shape grammar parsing via reinforcement learning. In Proc. the 2011 IEEE Conference on Computer Vision and Pattern Recognition, June 2011, pp.2273-2280. DOI: 10.1109/CVPR.2011.5995319. [5] Yang C, Han T, Quan L, Tai C L. Parsing façade with rank-one approximation. In Proc. the 2012 IEEE Conference on Computer Vision and Pattern Recognition, June 2012, pp.1720-1727. DOI: 10.1109/CVPR.2012.6247867. [6] Cohen A, Schwing A G, Pollefeys M. Efficient structured parsing of facades using dynamic programming. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.3206-3213. DOI: 10.1109/CVPR.2014.410. [7] Martinović A, Mathias M, Weissenberg J, van Gool L. A three-layered approach to facade parsing. In Proc. the 12th European Conference on Computer Vision, Oct. 2012, pp.416-429. DOI: 10.1007/978-3-642-33786-4. [8] Mathias M, Martinović A, van Gool L. ATLAS: A three-layered approach to facade parsing. International Journal of Computer Vision, 2016, 118(1): 22-48. DOI: 10.1007/s11263-015-0868-z. [9] Schmitz M, Mayer H. A convolutional network for semantic facade segmentation and interpolation. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2016, XLI-B3: 709-715. DOI: 10.5194/isprs-archives-XLI-B3-709-2016. [10] Liu H, Zhang J, Hoi S C H. DeepFacade: A deep learning approach to facade parsing. In Proc. the 26th International Joint Conference on Artificial Intelligence, Aug. 2017, pp.2301-2307. DOI: 10.24963/ijcai.2017/320. [11] Liu H, Xu Y, Zhang J, Zhu J, Li Y, Hoi S C H. DeepFacade: A deep learning approach to facade parsing with symmetric loss. IEEE Transactions on Multimedia, 2020, 22(12): 3153-3165. DOI: 10.1109/TMM.2020.2971431. [12] Riemenschneider H, Krispel U, Thaller W, Donoser M, Havemann S, Fellner D, Bischof H. Irregular lattices for complex shape grammar facade parsing. In Proc. the 2012 IEEE Conference on Computer Vision and Pattern Recognition, June 2012, pp.1640-1647. DOI: 10.1109/CVPR.2012.6247857. [13] Tyleček R, Šára R. Spatial pattern templates for recognition of objects with regular structure. In Proc. the 35th German Conference on Pattern Recognition, Sept. 2013, pp.364-374. DOI: 10.1007/978-3-642-40602-7. [14] Bao F, Schwarz M, Wonka P. Procedural facade variations from a single layout. ACM Transactions on Graphics, 2013, 32(1): Article No. 8. DOI: 10.1145/2421636.2421644. [15] Dang M, Ceylan D, Neubert B, Pauly M. SAFE: Structure-aware facade editing. Computer Graphics Forum, 2014, 33(2): 83-93. DOI: 10.1111/cgf.12313. [16] Ilčı́k M, Musialski P, Auzinger T, Wimmer M. Layer-based procedural design of façades. Computer Graphics Forum, 2015, 34(2): 205-216. DOI: 10.1111/cgf.12553. [17] Han F, Zhu S C. Bottom-up/top-down image parsing by attribute graph grammar. In Proc. the 10th IEEE International Conference on Computer Vision, Oct. 2005, pp.1778-1785. DOI: 10.1109/ICCV.2005.50. [18] Talton J O, Lou Y, Lesser S, Duke J, Měch R, Koltun V. Metropolis procedural modeling. ACM Transactions on Graphics, 2011, 30(2): Article No. 11. DOI: 10.1145/1944846.1944851. [19] Yeh Y T, Breeden K, Yang L, Fisher M, Hanrahan P. Synthesis of tiled patterns using factor graphs. ACM Transactions on Graphics, 2013, 32(1): Article No. 3. DOI: 10.1145/2421636.2421639. [20] Rahmani K, Huang H, Mayer H. Facade segmentation with a structured random forest. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, 2017, IV-1/W1: 175-181. DOI: 10.5194/isprs-annals-IV-1-W1-175-2017. [21] Gaddle R, Jampani V, Marlet R, V Gehler P. Efficient 2D and 3D facade segmentation using auto-context. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(5): 1273-1280. DOI: 10.1109/TPAMI.2017.2696526. [22] Koziński M, Gadde R, Zagoruyko S, Obozinski G, Marlet R. A MRF shape prior for facade parsing with occlusions. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.2820-2828. DOI: 10.1109/CVPR.2015.7298899. [23] Cohen A, R Oswald M, Liu Y, Pollefeys M. Symmetry-aware façade parsing with occlusions. In Proc. the 2017 International Conference on 3D Vision, Oct. 2017, pp.393-401. DOI: 10.1109/3DV.2017.00052. [24] Nan L, Sharf A, Zhang H, Cohen-Or D, Chen B. SmartBoxes for interactive urban reconstruction. ACM Trans. Graph., 2010, 29(4): Article No. 93. DOI: 10.1145/1778765.1778830. [25] Zhang H, Xu K, Jiang W, Lin J, Cohen-Or D, Chen B. Layered analysis of irregular facades via symmetry maximization. ACM Trans. Graph., 2013, 32(4): Article No. 121. DOI: 10.1145/2461912.2461923. [26] Femiani J, Reyaz Para W, Mitra N, Wonka P. Facade segmentation in the wild. arXiv:1805.08634, 2018. https: //arxiv.org/pdf/1805.08634.pdf, Jan. 2022. [27] He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.2961-2969. DOI: 10.1109/ICCV.2017.322. [28] Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.580-587. DOI: 10.1109/CVPR.2014.81. [29] Girshick R. Fast R-CNN. In Proc. the 2015 IEEE International Conference on Computer Vision, Dec. 2015, pp.1440-1448. DOI: 10.1109/ICCV.2015.169. [30] Law H, Deng J. CornerNet: Detecting objects as paired keypoints. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.734-750. DOI: 10.1007/978-3-030-01264-9. [31] Newell A, Huang Z, Deng J. Associative embedding: End-to-end learning for joint detection and grouping. In Proc. the Annual Conference onNeural Information Processing Systems, Dec. 2017, pp.2277-2287. [32] Zhou X, Wang D, Krähenbühl P. Objects as points. arXiv:1904.07850, 2019. https://arxiv.org/pdf/1904.078 50.pdf, Jan. 2022. [33] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the Annual Conference on Neural Information Processing Systems, Dec. 2017, pp.5998-6008. [34] Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Hierarchical attention networks for document classification. In Proc. the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2016, pp.1480-1489. DOI: 10.18653/v1/N16-1174. [35] Roy A, Saffar M, Vaswani A, Grangier D. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 2021, 9: 53-68. DOI: 10.1162/tacla. [36] Sarlin P E, DeTone D, Malisiewicz T, Rabinovich A. SuperGlue: Learning feature matching with graph neural networks. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, pp.4938-4947. DOI: 10.1109/CVPR42600.2020.00499. [37] Kolesnikov A, Dosovitskiy A, Weissenborn D, Heigold G, Uszkoreit J, Beyer L, Minderer M, Dehghani M, Houlsby N, Gelly S, Unterthiner T, Zhai X. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. the 9th International Conference on Learning Representations, May 2021. [38] Wang S, Li B Z, Khabsa M, Fang H, Ma H. Linformer: Self-attention with linear complexity. arXiv:2006.04768, 2020. https://arxiv.org/pdf/2006.04768.pdf, Jan. 2022. [39] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In Proc. the 2018 IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.7132-7141. DOI: 10.1109/CVPR.2018.00745. [40] Zhao H, Zhang Y, Liu S, Shi J, Loy C C, Lin D, Jia J. PSANet: Point-wise spatial attention network for scene parsing. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.267-283. DOI: 10.1007/978-3-030-01240-3. [41] Wang X, Girshick R, Gupta A, He K. Non-local neural networks. In Proc. the 2018 IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.7794-7803. DOI: 10.1109/CVPR.2018.00813. [42] Strudel R, Garcia R, Laptev I, Schmid C. Segmenter: Transformer for semantic segmentation. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.7262-7272. DOI: 10.1109/ICCV48922.2021.00717. [43] Wang W, Xie E, Li X, Fan D P, Song K, Liang D, Lu T, Luo P, Shao L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.568-578. DOI: 10.1109/ICCV48922.2021.00061. [44] Huang Z, Wang X, Wei Y, Huang L, Shi H, Liu W, S Huang T. CCNet: Criss-cross attention for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. DOI: 10.1109/TPAMI.2020.3007032. [45] Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation. In Proc. the 14th European Conference on Computer Vision, Oct. 2016, pp.483-499. DOI: 10.1007/978-3-319-46484-8. [46] Lin T Y, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.2999-3007. DOI: 10.1109/ICCV.2017.324. [47] Koziński M, Obozinski G, Marlet R. Beyond procedural facade parsing: Bidirectional alignment via linear programming. In Proc. the 12th Asian Conference on Computer Vision, Nov. 2015, pp.79-94. DOI: 10.1007/978-3-319-16817-3. [48] Rahmani K, Huang H, Mayer H. High quality facade segmentation base on structured random forest, region proposal network and rectangular fitting. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, 2018, IV-2: 223-230. DOI: 10.5194/isprs-annals-IV-2-223-2018. [49] Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, Sept. 2014, pp.740-755. DOI: 10.1007/978-3-319-10602-1. [50] Zhang Z, Ganesh A, Liang X, Ma Y. TILT: Transform invariant low-rank textures. International Journal of Computer Vision, 2012, 99(1): 1-24. DOI: 10.1007/s11263-012-0515-x. [51] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.770-778. DOI: 10.1109/CVPR.2016.90. [52] Yu F, Wang D, Shelhamer E, Darrell T. Deep layer aggregation. In Proc. the 2018 IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.2403-2412. DOI: 10.1109/CVPR.2018.00255. |
No related articles found! |
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |