计算机科学技术学报 ›› 2022,Vol. 37 ›› Issue (3): 626-640.doi: 10.1007/s11390-022-2204-8

所属专题: Artificial Intelligence and Pattern Recognition Computer Graphics and Multimedia

• • 上一篇    下一篇

基于目标中心图网络的一阶段行人多目标检测与跟踪方法

  

  • 收稿日期:2022-02-03 修回日期:2022-04-24 接受日期:2022-05-06 出版日期:2022-05-30 发布日期:2022-05-30

CGTracker: Center Graph Network for One-Stage Multi-Pedestrian-Object Detection and Tracking

Xin Feng (冯欣), Senior Member, CCF, Member, IEEE, Hao-Ming Wu (吴浩铭), Yi-Hao Yin (殷一皓), and Li-Bin Lan (兰利彬), Member, CCF        

  1. College of Computer Science and Engineering, Chongqing University of Technology, Chongqing 400054, China
  • Received:2022-02-03 Revised:2022-04-24 Accepted:2022-05-06 Online:2022-05-30 Published:2022-05-30
  • Contact: Xin Feng E-mail:xfeng@cqut.edu.cn
  • About author:Xin Feng received her B.S. degree in computer science and technology from Chongqing University, Chongqing, in 2004. She got her Ph.D. degree in computer applications from Chongqing University, Chongqing, in 2011. She is currently an associate professor of Chongqing University of Technology, Chongqing. She studied at New York University, New York, as a postdoctor from 2014 to 2016. Her research falls in the area of computer vision, image and video processing.
  • Supported by:
    This work is partially supported by Humanities and Social Sciences of Chinese Ministry of Education Planning under Grant No. 17YJCZH043, the Key Project of Chongqing Technology Innovation and Application Development under Grant No. cstc2021jscx-dxwtBX0018, and the Scientific Research Foundation of Chongqing University of Technology under Grant No. 0103210650.

1、研究背景(context):行人目标作为现实生活场景中最常见和最主要的目标类别具有极大的跟踪价值。而行人目标检测与跟踪技术是许多上层应用的关键技术,如:自动驾驶及视频监控。现有的多目标跟踪方法通常将任务分为三个部分:目标检测、特征提取和目标关联。这些方法往往只是简单地应用通用方法来实现每一步,而没有充分研究目标类别的特征来进行检测和跟踪,导致额外的计算成本和MOT的低效率。
2、目的(Objective):CGTracker旨在实现高效的一阶段联合目标检测与多行人目标跟踪方法,以便于实时跟踪应用中的在线跟踪。
3、方法(Method):考虑到行人是现实世界场景中最常见的目标类别,并且在对象关系和运动模式方面具有特殊性,我们提出了一种新颖而高效的单阶段行人检测和跟踪方法,命名为CGTracker。CGTracker将行人目标检测为对象的中心点,并直接从目标中心的特征表示中提取对象特征,用于预测轴对齐的边界框。同时,将检测到的行人构建为目标图,以促进多目标关联过程,其中使用两个相邻帧之间目标的语义特征、位移信息和相对位置关系来执行可靠的在线跟踪。
4、结果(Result & Findings):我们将该方法在流行的 MOT17 挑战中进行了评估,在 9 FPS 时达到了 69.3% MOTA。在广泛使用的评估指标下的广泛实验结果表明,在提交这项工作时,我们的方法是 MOT17 挑战排行榜上的最佳技术之一。
5、结论(Conclusions):在本文中,我们介绍了一种基于图的单阶段多行人目标检测和跟踪方法,称为中心图网络(CGTracker)。结果表明,行人目标的中心特征以及行人目标之间的空间关系能够对行人目标跟踪的方法产生显著影响。我们的方法不仅达到了最先进的跟踪精度,而且在推理速度方面也非常高效。大量的实验结果表明,CGTracker 在 MOT17 基准测试中实现了最先进的跟踪精度,并且在推理速度方面也非常高效。CGTracker 是一个端到端的框架,联合学习多行人目标检测和跟踪,效率很高,可以应用于实时 MOT 应用,例如自动驾驶。目前CGTracker直接采用对象中心坐标之间的距离来表示关系下一步我们将进一步探索更好的对象关系表示和信息聚合机制,以构建更有效的关系约束。同时,我们还将探索更多有用的对象特征,以改进密集人群跟踪场景中的小目标检测与关联。

关键词: 多目标跟踪, 一阶段, 目标中心, 目标图

Abstract: Most current online multi-object tracking (MOT) methods include two steps: object detection and data association, where the data association step relies on both object feature extraction and affinity computation. This often leads to additional computation cost, and degrades the efficiency of MOT methods. In this paper, we combine the object detection and data association module in a unified framework, while getting rid of the extra feature extraction process, to achieve a better speed-accuracy trade-off for MOT. Considering that a pedestrian is the most common object category in real-world scenes and has particularity characteristics in objects relationship and motion pattern, we present a novel yet efficient one-stage pedestrian detection and tracking method, named CGTracker. In particular, CGTracker detects the pedestrian target as the center point of the object, and directly extracts the object features from the feature representation of the object center point, which is used to predict the axis-aligned bounding box. Meanwhile, the detected pedestrians are constructed as an object graph to facilitate the multi-object association process, where the semantic features, displacement information and relative position relationship of the targets between two adjacent frames are used to perform the reliable online tracking. CGTracker achieves the multiple object tracking accuracy (MOTA) of 69.3% and 65.3% at 9 FPS on MOT17 and MOT20, respectively. Extensive experimental results under widely-used evaluation metrics demonstrate that our method is one of the best techniques on the leader board for the MOT17 and MOT20 challenges at the time of submission of this work.

Key words: pedestrian detection and tracking, object center, object graph

[1] Kim C, Li F, Rehg J M. Multi-object tracking with neural gating using bilinear LSTM. In Proc. the 15th European Conference on Computer Vision, October 2018, pp.208-224. DOI: 10.1007/978-3-030-01237-3.

[2] Bewley A, Ge Z, Ott L, Ramos F, Upcroft B. Simple online and realtime tracking. In Proc. the 2016 IEEE International Conference on Image Processing, September 2016, pp.3464-3468. DOI: 10.1109/ICIP.2016.7533003.

[3] Tang S, Andriluka M, Andres B, Schiele B. Multiple people tracking by lifted multicut and person re-identification. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.3701-3710. DOI: 10.1109/CVPR.2017.394.

[4] Possegger H, Mauthner T, Roth P M, Bischof H. Occlusion geodesics for online multi-object tracking. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.1306-1313. DOI: 10.1109/CVPR.2014.170.

[5] He A, Luo C, Tian X, Zeng W. A twofold Siamese network for real-time object tracking. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp.4834-4843. DOI: 10.1109/CVPR.2018.00508.

[6] Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, June 2016, 39: 1137-1149. DOI: 10.1109/TPAMI.2016.2577031.

[7] Redmon J, Farhadi A. YOLO9000: Better, faster, stronger. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.6517-6525. DOI: 10.1109/CVPR.2017.690.

[8] Redmon J, Farhadi A. YOLOv3: An incremental improvement. arXiv:1804.02767, 2018. https://arxiv.org/ abs/1804.02767, Jan. 2022.

[9] Bochkovskiy A, Wang C Y, Liao H Y M. YOLOv4: Optimal speed and accuracy of object detection. arXiv:2004.10934, 2020. https://arxiv.org/abs/2004.10934, April 2022.

[10] Rosebrock A. Intersection over Union (IoU) for object detection. https://pyimagesearch.com/2016/11/07/intersectionover-union-iou-for-object-detection/, July 2021.

[11] Feng X, Xue Y, Wang Y. An object based graph representation for video comparison. In Proc. the 2017 IEEE International Conference on Image Processing, September 2017, pp.2548-2552. DOI: 10.1109/ICIP.2017.8296742.

[12] Wojke N, Bewley A, Paulus D. Simple online and realtime tracking with a deep association metric. In Proc. the 2017 IEEE International Conference on Image Processing, September 2017, pp.3645-3649. DOI: 10.1109/ICIP.2017.8296962.

[13] Yu F, Li W, Li Q, Liu Y, Shi X, Yan J. POI: Multiple object tracking with high performance detection and appearance feature. In Proc. the 14th European Conference on Computer Vision Workshops, October 2016, pp.36-42. DOI: 10.1007/978-3-319-48881-3.

[14] Sun S, Akhtar N, Song H, Mian A, Shah M. Deep affinity network for multiple object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(1): 104-119. DOI: 10.1109/TPAMI.2019.2929520.

[15] Wang Z, Zheng L, Liu Y, Li Y, Wang S. Towards real-time multi-object tracking. In Proc. the 16th European Conference on Computer Vision, August 2020, pp.107-122. DOI: 10.1007/978-3-030-58621-8.

[16] Lu Z, Rathod V, Votel R, Huang J. RetinaTrack: Online single stage joint detection and tracking. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, pp.14656-14666. DOI: 10.1109/CVPR42600.2020.01468.

[17] Zhu J, Yang H, Liu N, Kim M, Zhang W, Yang M H. Online multi-object tracking with dual matching attention networks. In Proc. the 15th European Conference on Computer Vision, October 2018, pp.379-396. DOI: 10.1007/978-3-030-01228-1.

[18] Peng J, Wang C, Wan F, Wu Y, Wang Y, Tai Y, Fu Y. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In Proc. the 16th European Conference on Computer Vision, August 2020, pp.145-161. DOI: 10.1007/978-3-030-58548-8.

[19] Zhou X, Koltun V, Krähenbühl P. Tracking objects as points. In Proc. the 16th European Conference on Computer Vision, August 2020, pp.474-490. DOI: 10.1007/978-3-030-58548-8.

[20] Zhang Y, Wang C, Wang X, Zeng W, Liu W. FairMOT: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 2021, 129(11): 3069-3087. DOI: 10.1007/s11263-021-01513-4.

[21] Zhou X, Wang D, Krähenbühl P. Objects as points. arXiv:1904.07850, 2019. https://arxiv.org/abs/1904.07850, April 2022.

[22] Yu F, Wang D, Shelhamer E, Darrell T. Deep layer aggregation. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp.2403-2412. DOI: 10.1109/CVPR.2018.00255.

[23] Wang X, Liu Z. Salient object detection by optimizing robust background detection. In Proc. the 18th IEEE International Conference on Communication Technology, October 2018, pp.1164-1168. DOI: 10.1109/ICCT.2018.8600184.

[24] Law H, Deng J. CornerNet: Detecting objects as paired keypoints. In Proc. the 15th European Conference on Computer Vision, October 2018, pp.765-781. DOI: 10.1007/978-3-030-01264-9.

[25] Tian Z, Shen C, Chen H, He T. FCOS: Fully convolutional one-stage object detection. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.9626-9635. DOI: 10.1109/ICCV.2019.00972.

[26] Neubeck A, van Gool L. Efficient non-maximum suppression. In Proc. the 18th International Conference on Pattern Recognition, August 2006, pp.850-855. DOI: 10.1109/ICPR.2006.479.

[27] Milan A, Leal-Taixé L, Reid I, Roth S, Schindler K. MOT16: A benchmark for multi-object tracking. arXiv:1603.00831, 2016. https://arxiv.org/abs/1603.00831, Jan. 2022.

[28] Dendorfer P, Rezatofighi H, Milan A et al. MOT20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003, 2020. https://arxiv.org/abs/ 2003.09003, March 2022.

[29] Felzenszwalb P F, Girshick R B, McAllester D, Ramanan D. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 32(9): 1627-1645. DOI: 10.1109/TPAMI.2009.167.

[30] Yang F, Choi W, Lin Y. Exploit all the layers: Fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.2129-2137. DOI: 10.1109/CVPR.2016.234.

[31] Bernardin K, Stiefelhagen R. Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP Journal on Image and Video Processing, 2008, 2008: Article No. 1. DOI: 10.1155/2008/246309.

[32] Luiten J, Ošep A, Dendorfer P, Torr P, Geiger A, Leal-Taixé L, Leibe B. HOTA: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 2021, 129(2): 548-578. DOI: 10.1007/s11263-020-01375-2.

[33] Paszke A, Gross S, Chintala S et al. Automatic differentiation in PyTorch. In Proc. the 31st Conference on Neural Information Processing Systems Workshop, Dec. 2017.

[34] Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, Sun J. CrowdHuman: A benchmark for detecting human in a crowd. arXiv:1805.00123, 2018. https://arxiv.org/abs/ 1805.00123, Jan. 2022.

[35] Zhang S, Xie Y, Wan J, Xia H, Li S Z, Guo G. WiderPerson: A diverse dataset for dense pedestrian detection in the wild. IEEE Transactions on Multimedia, 2019, 22(2): 380-393. DOI: 10.1109/TMM.2019.2929005.

[36] Zhang S, Benenson R, Schiele B. CityPersons: A diverse dataset for pedestrian detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, June 2017, pp.4457-4465. DOI: 10.1109/CVPR.2017.474.

[37] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014. https://arxiv.org/abs/1409.1556, April 2022.

[38] Pang B, Li Y, Zhang Y, Li M, Lu C. TubeTK: Adopting tubes to track multi-object in a one-step training model. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, pp.6307-6317. DOI: 10.1109/CVPR42600.2020.00634.

[39] Zhang Y, Sheng H, Wu Y, Wang S, Ke W, Xiong Z. Multiplex labeling graph for near-online tracking in crowded scenes. IEEE Internet of Things Journal, 2021, 7(9): 7892-7902. DOI: 10.1109/JIOT.2020.2996609.

[40] Li W, Xiong Y, Yang S, Xu M, Wang Y, Xia W. Semi-TCL: Semi-supervised track contrastive representation learning. arXiv:2107.02396, 2021. https://arxiv.org/abs/2107.02396, Jan. 2022.

[1] Chun-Chao Guo, Xiao-Jun Hu, Jian-Huang Lai, Shi-Chang Shi, Shi-Zhe Chen. 基于断裂与连接的原始轨迹优化方法[J]. , 2015, 30(2): 364-372.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[7] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[8] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[9] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[10] 衷仁保; 邢林; 任朝阳;. An Interactive System SDI on Microcomputer[J]. , 1987, 2(1): 64 -71 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: