计算机科学技术学报 ›› 2022,Vol. 37 ›› Issue (3): 641-651.doi: 10.1007/s11390-022-2146-1

所属专题: Artificial Intelligence and Pattern Recognition Computer Graphics and Multimedia

• • 上一篇    下一篇

在最小模态差异下学习鲁棒行人表示的跨模态行人重识别方法

  

  • 收稿日期:2022-01-06 修回日期:2022-04-14 接受日期:2022-04-14 出版日期:2022-05-30 发布日期:2022-05-30

Learn Robust Pedestrian Representation Within Minimal Modality Discrepancy for Visible-Infrared Person Re-Identification

Yu-Jie Liu (刘玉杰), Member, CCF, Wen-Bin Shao* (邵文斌), and Xiao-Rui Sun (孙晓瑞)        

  1. College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
  • Received:2022-01-06 Revised:2022-04-14 Accepted:2022-04-14 Online:2022-05-30 Published:2022-05-30
  • Contact: Wen-Bin Shao E-mail:wbShao@s.upc.edu.cn
  • About author:Wen-Bin Shao received his B.S. degree in computer science and technology from Shandong University of Technology, Zibo, in 2020. He is currently a Master student at College of Computer Science and Technology, China University of Petroleum (East China), Qingdao. His research interests include person re-identification, computer vision, and multimedia.
  • Supported by:
    This work was supported by the National Key Research and Development Program of China under Grant No. 2019YFF0301800, the National Natural Science Foundation of China under Grant No. 61379106, and the Shandong Provincial Natural Science Foundation under Grant Nos. ZR2013FM036 and ZR2015FM011.

1.研究背景(context):行人重识别是一个图像检索问题,其旨在为一个给定的行人图像从图像库中检索到具有相同身份的图像,这些图像都是多个从不同摄像头下捕获到的。由于行人姿态变化、行人视角变化和遮挡等问题导致的图像差异,给行人重识别带了很大挑战。尽管存在上述的困难,随着深度学习技术的发展,现在基于RGB图像的行人重识别工作已经有了很大的进步并且取得了很高的精度。但是RGB监控摄像头无法在弱光照条件下拍摄到清晰的行人图像,这限制了单模态行人重识别的应用场景。 在现实的应用场景中,智能的监控摄像头可以根据光照条件自动在RGB模式和红外模式之间转换,随之出现的跨模态行人重识别研究领域受到学术界的广泛关注。由于RGB图像和红外图像之间的存在巨大的模态差异,过去单模态的行人重识别方法在跨模态问题中并不适用。模态差异目前成为该研究领域的重点关注问题。
2.目的(Objective):现有的工作主要包括基于生成对抗网络的方法生成假的跨模态图像实现图像级的对齐或者基于度量学习和表征学习的方法精心设计网络结构提取模态共享特征缓解模态差异。但是,这些的方法都忽略了一种计算简单并且可以有效缩小视觉模态差异的方案:将RGB图像直接转换为灰度图像。本文中,我们将跨模态匹配任务从红外图像和RGB图像之间转换到红外图像和灰度图像之间。现有的表征学习和度量学习的方法大都使用卷积神经网络最后一层的特征作为最终的行人表征,虽然具有高度的语义性,但是缺少细节信息,细节信息在红外图像和灰度图像行人重识别中是重要的决定行人身份的线索。此外,行人的姿态变化、视角变化和遮挡等问题也需要图像特征具有鲁棒性。为了解决上述问题,本文提出了一种金字塔特征融合网络,在最小模态差异下挖掘行人图像中具有判别性的细节特征并融合高层语义特征为行人图像构建鲁棒的全局表征。
3.方法(Method):在本文中,我们将跨模态行人匹配从红外图像和RGB图像之间转换到红外图像和灰度图像之间,与RGB图像相比,灰度图像和红外图像之间的模态差异大幅减小,在视觉上两种模态图像也极其相似,所以本文将其称之为最小视觉模态差异。 与基于生成对抗网络的方法相比,直接转换的方式需要的运算开销很小,仅仅需要进行三次乘法操作;同时,这种直接转换生成的图像更加自然而且质量更高,可以有效保留行人身份判别性信息。 尽管将问题转换后存在上述有优点,但是灰度图像与RGB图像相比损失了很多信息,比如颜色信息。在这种条件下,使用现有的特征提取结构不足以捕获身份判别性特征。为了解决该问题,本文提出一种金字塔结构的特征融合网络, 在最小的模态差异下挖掘行人图像中具有判别性的细节特征并融合高层语义特征为行人图像构建鲁棒的全局表征。 输入图像经过转换后由金字塔结构的信息建模模块实现由细到粗的特征提取以及自顶向下的语义传递得到多尺度特征图。各个尺度的特征图分别输入到对应的判别性区域响应模块利用空间注意力机制实现行人身份判别性区域的响应。多尺度特征图融合后作为最终鲁棒的全局行人表示。
4.结果(Result & Findings):本文提出的金字塔结构的特征融合网络在性能上大幅超越了现有最好方法。并且在Multi-shot 和 Indoor-search评估策略上的表现已经达到了单模态行人重识别的水平,取得了91.53%的Rank-1准确率和86.82%的mAP。这是首次在Rank-1准确率上突破90%,证明本文在跨模态行人重识别研究中具有重要意义。本文提出的方法在所有的评估策略上都超越了目前表现最好的SOTA MPANet,尤其是在Single-shot和indoor-search评估策略上Rank-1准确路提高了11.8%,在Multi-shot和All-search评估策略上mAP提高了11.58%。
5、结论(Conclusions):本文提出一种在最小模态差异下学习鲁棒行人表示的方法。首先通过将RGB图像直接转化为灰度图像把跨模态匹配任务从RGB图像和红外图像之间转换到灰度图像和红外图像之间。本文将这一新状态称为最小模态差异。金字塔结构的特征融合网络被提出在最小模态差异下捕获有效的行人身份判别信息去构建鲁棒的行人表征。试验结果证明了本文提出的方法的有效性。尤其是最小模态差异,该方法一直都被研究人员忽略,但实际上不仅计算简单而且可以很大程度上缓解模态差异。我们期待本文提出的方法能够得到跨模态行人重识别领域的关注,给该领域提供新的视野,促进更多优秀的工作产生,进一步推动该技术的实际应用。

关键词: 行人重识别, 模态差异, 身份判别性

Abstract: Visible-infrared person re-identification has attracted extensive attention from the community due to its potential great application prospects in video surveillance. There are huge modality discrepancies between visible and infrared images caused by different imaging mechanisms. Existing studies alleviate modality discrepancies by aligning modality distribution or extracting modality-shared features on the original image. However, they ignore a key solution, i.e., converting visible images to gray images directly, which is efficient and effective to reduce modality discrepancies. In this paper, we transform the cross-modality person re-identification task from visible-infrared images to gray-infrared images, which is named as the minimal modality discrepancy. In addition, we propose a pyramid feature integration network (PFINet) which mines the discriminative refined features of pedestrian images and fuses high-level and semantically strong features to build a robust pedestrian representation. Specifically, PFINet first performs the feature extraction from concrete to abstract and the top-down semantic transfer to obtain multi-scale feature maps. Second, the multi-scale feature maps are inputted to the discriminative-region response module to emphasize the identity-discriminative regions by the spatial attention mechanism. Finally, the pedestrian representation is obtained by the feature integration. Extensive experiments demonstrate the effectiveness of PFINet which achieves the rank-1 accuracy of 81.95% and mAP of 74.49% on the multi-all evaluation mode of the SYSU-MM01 dataset.

Key words: person re-identification, modality discrepancy, discriminative feature

[1] Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi S C. Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence. DOI: 10.1109/TPAMI.2021.3054775.

[2] Zeng M, Yao B, Wang Z J, Shen Y, Li F, Zhang J, Lin H, Guo M. CATIRI: An efficient method for content-and-text based image retrieval. Journal of Computer Science and Technology, 2019, 34(2): 287-304. DOI: 10.1007/s11390-019-1911-2.

[3] Sun Y, Zheng L, Yang Y, Tian Q, Wang S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.480-496. DOI: 10.1007/978-3-030-01225-0.

[4] Zhang X, Luo H, Fan X, Xiang W, Sun Y, Xiao Q, Jiang W, Zhang C, Sun J. AlignedReID: Surpassing human-level performance in person re-identification. arXiv:1711.08184, 2017. https://arxiv.org/pdf/1711.08184.pdf, Jan. 2022.

[5] Zhong Z, Zheng L, Cao D, Li S. Re-ranking person re-identification with k-reciprocal encoding. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1318-1327. DOI: 10.1109/CVPR.2017.389.

[6] Wu A, Zheng W S, Yu H X, Gong S, Lai J. RGB-infrared cross-modality person re-identification. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.5380-5389. DOI: 10.1109/ICCV.2017.575.

[7] Dai P, Ji R, Wang H, Wu Q, Huang Y. Cross-modality person re-identification with generative adversarial training. In Proc. the 27th International Joint Conference on Artificial Intelligence, July 2018, pp.677-683. DOI: 10.24963/ijcai.2018/94.

[8] Wang G A, Zhang T, Cheng J, Liu S, Yang Y, Hou Z. RGB-infrared cross-modality person re-identification via joint pixel and feature alignment. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.3623-3632. DOI: 10.1109/ICCV.2019.00372.

[9] Wang G A, Zhang T, Yang Y, Cheng J, Chang J, Liang X, Hou Z G. Cross-modality paired-images generation for RGB-infrared person re-identification. In Proc. the 34th AAAI Conference on Artificial Intelligence, Feb. 2020, pp.12144-12151. DOI: 10.1609/aaai.v34i07.6894.

[10] Zhao Z, Liu B, Chu Q, Lu Y, Yu N. Joint color-irrelevant consistency learning and identity-aware modality adaptation for visible-infrared cross modality person re-identification. In Proc. the 35th Conference on Artificial Intelligence, Feb. 2021, pp.3520-3528.

[11] Chen Y, Wan L, Li Z, Jing Q, Sun Z. Neural feature search for RGB-infrared person re-identification. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp.587-597. DOI: 10.1109/CVPR46437.2021.00065.

[12] Lu Y, Wu Y, Liu B, Zhang T, Li B, Chu Q, Yu N. Cross-modality person re-identification with shared-specific feature transfer. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, pp.13376-13386. DOI: 10.1109/CVPR42600.2020.01339.

[13] Zhu Y, Yang Z, Wang L, Zhao S, Hu X, Tao D. Hetero-center loss for cross-modality person re-identification. Neurocomputing, 2020, 386: 97-109. DOI: 10.1016/j.neucom.2019.12.100.

[14] Wu Q, Dai P, Chen J, Lin C W, Wu Y, Huang F, Zhong B, Ji R. Discover cross-modality nuances for visible-infrared person re-identification. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp.4330-4339. DOI: 10.1109/CVPR46437.2021.00431.

[15] Ding S, Lin L, Wang G, Chao H. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition, 2015, 48(10): 2993-3003. DOI: 10.1016/j.patcog.2015.04.005.

[16] Chen W, Chen X, Zhang J, Huang K. Beyond triplet loss: A deep quadruplet network for person re-identification. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.403-412. DOI: 10.1109/CVPR.2017.145.

[17] Hermans A, Beyer L, Leibe B. In defense of the triplet loss for person re-identification. arXiv:1703.07737, 2017. https://arxiv.org/pdf/1703.07737.pdf, Jan. 2022.

[18] Zheng L, Zhang H, Sun S, Chandraker M, Yang Y, Tian Q. Person re-identification in the wild. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1367-1376. DOI: 10.1109/CVPR.2017.357.

[19] Qian X, Fu Y, Jiang Y G, Xiang T, Xue X. Multi-scale deep learning architectures for person re-identification. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.5399-5408. DOI: 10.1109/ICCV.2017.577.

[20] Sun Y, Zheng L, Deng W, Wang S. SVDNet for pedestrian retrieval. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.3800-3808. DOI: 10.1109/ICCV.2017.410.

[21] Guo J, Yuan Y, Huang L, Zhang C, Yao J G, Han K. Beyond human parts: Dual part-aligned representations for person re-identification. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.3642-3651. DOI: 10.1109/ICCV.2019.00374.

[22] Zhao H, Tian M, Sun S, Shao J, Yan J, Yi S, Wang X, Tang X. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1077-1085. DOI: 10.1109/CVPR.2017.103.

[23] Gao S, Wang J, Lu H, Liu Z. Pose-guided visible part matching for occluded person ReID. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, pp.11744-11752. DOI: 10.1109/CVPR42600.2020.01176.

[24] Ge Y, Zhu F, Chen D et al. Self-paced contrastive learning with hybrid memory for domain adaptive object Re-ID. In Proc. the Annual Conference on Neural Information Processing Systems, Dec. 2020.

[25] Ge Y, Chen D, Li H. Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification. arXiv:2001.01526, 2020. https://arxiv. org/pdf/2001.01526.pdf, Jan. 2022.

[26] Chen H, Wang Y, Lagadec B, Dantcheva A, Bremond F. Joint generative and contrastive learning for unsupervised person re-identification. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp.2004-2013. DOI: 10.1109/CVPR46437.2021.00204.

[27] Wang Z, Wang Z, Zheng Y, Chuang Y Y, Satoh S. Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, pp.618-626. DOI: 10.1109/CVPR.2019.00071.

[28] Ye M, Lan X, Wang Z, Yuen P C. Bi-directional center-constrained top-ranking for visible thermal person re-identification. IEEE Transactions on Information Forensics and Security, 2020, 15: 407-419. DOI: 10.1109/TIFS.2019.2921454.

[29] Hao Y, Wang N, Li J, Gao X. HSME: Hypersphere manifold embedding for visible thermal person re-identification. In Proc. the AAAI Conference on Artificial Intelligence, January 27-February 1, 2019, pp.8385-8392. DOI: 10.1609/aaai.v33i01.33018385.

[30] Ye M, Lan X, Leng Q, Shen J. Cross-modality person re-identification via modality-aware collaborative ensemble learning. IEEE Transactions on Image Processing, 2020, 29: 9387-9399. DOI: 10.1109/TIP.2020.2998275.

[31] Jia M, Zhai Y, Lu S, Ma S, Zhang J. A similarity inference metric for RGB-infrared cross-modality person re-identification. In Proc. the 29th International Joint Conference on Artificial Intelligence, Jan. 2021, pp.1026-1032. DOI: 10.24963/ijcai.2020/143.

[32] Ye M, Shen J, J Crandall D, Shao L, Luo J. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.229-247. DOI: 10.1007/978-3-030-58520-4.

[33] Li D, Wei X, Hong X, Gong Y. Infrared-visible cross-modal person re-identification with an X modality. In Proc. the 34th AAAI Conference on Artificial Intelligence, Feb. 2020, pp.4610-4617. DOI: 10.1609/aaai.v34i04.5891.

[34] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp.7132-7141. DOI: 10.1109/CVPR.2018.00745.

[35] Woo S, Park J, Lee J Y, Kweon I S. Cbam: Convolutional block attention module. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.3-19. DOI: 10.1007/978-3-030-01234-2.

[36] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In Proc. the 2017 Annual Conference on Neural Information Processing Systems, Dec. 2017, pp.5998-6008.

[37] Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. Dual attention network for scene segmentation. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, pp.3146-3154. DOI: 10.1109/CVPR.2019.00326.

[38] Lin T Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.2117-2125. DOI: 10.1109/CVPR.2017.106.

[39] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.770-778. DOI: 10.1109/CVPR.2016.90.

[40] Nguyen D T, Hong H G, Kim K W, Park K R. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors, 2017, 17(3): Article No. 605. DOI: 10.3390/s17030605.

[41] Kingma D P, Ba J. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014. https://arxiv.org/pdf/ 1412.6980.pdf, Jan. 2022.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[7] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[8] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[9] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[10] 衷仁保; 邢林; 任朝阳;. An Interactive System SDI on Microcomputer[J]. , 1987, 2(1): 64 -71 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: