计算机科学技术学报 ›› 2020,Vol. 35 ›› Issue (3): 538-550.doi: 10.1007/s11390-020-0405-6

所属专题: Artificial Intelligence and Pattern Recognition Computer Graphics and Multimedia

• Special Section of CVM 2020 • 上一篇    下一篇

双流时间卷积神经网络用于基于骨架的人体动作识别

Jin-Gong Jia1, Yuan-Feng Zhou1,*, Senior Member, CCF, Xing-Wei Hao1, Feng Li1, Christian Desrosiers2, Cai-Ming Zhang1, Senior Member, CCF   

  1. 1 School of Software, Shandong University, Jinan 250101, China;
    2 Department of Software and IT Engineering, University of Quebec, Montreal H3C 3P8, Canada
  • 收稿日期:2020-02-29 修回日期:2020-04-05 出版日期:2020-05-28 发布日期:2020-05-28
  • 通讯作者: Yuan-Feng Zhou E-mail:yfzhou@sdu.edu.cn
  • 作者简介:Jin-Gong Jia received his B.S. degree in software engineering at School of Information and Electrical Engineering, Ludong University, Yantai, in 2018. Currently, he is currently pursuing his Master's degree in software engineering at the School of Software, Shandong University, Jinan. His research interests include computer vision, human action recognition, and human pose estimation.
  • 基金资助:
    The work was supported by the National Natural Science Foundation (NSFC)-Zhejiang Joint Fund of the Integration of Informatization and Industrialization of China under Grant Nos. U1909210 and U1609218, the National Natural Science Foundation of China under Grant No. 61772312, and the Key Research and Development Project of Shandong Province of China under Grant No. 2017GGX10110.

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Jin-Gong Jia1, Yuan-Feng Zhou1,*, Senior Member, CCF, Xing-Wei Hao1, Feng Li1, Christian Desrosiers2, Cai-Ming Zhang1, Senior Member, CCF        

  1. 1 School of Software, Shandong University, Jinan 250101, China;
    2 Department of Software and IT Engineering, University of Quebec, Montreal H3C 3P8, Canada
  • Received:2020-02-29 Revised:2020-04-05 Online:2020-05-28 Published:2020-05-28
  • Contact: Yuan-Feng Zhou E-mail:yfzhou@sdu.edu.cn
  • About author:Jin-Gong Jia received his B.S. degree in software engineering at School of Information and Electrical Engineering, Ludong University, Yantai, in 2018. Currently, he is currently pursuing his Master's degree in software engineering at the School of Software, Shandong University, Jinan. His research interests include computer vision, human action recognition, and human pose estimation.
  • Supported by:
    The work was supported by the National Natural Science Foundation (NSFC)-Zhejiang Joint Fund of the Integration of Informatization and Industrialization of China under Grant Nos. U1909210 and U1609218, the National Natural Science Foundation of China under Grant No. 61772312, and the Key Research and Development Project of Shandong Province of China under Grant No. 2017GGX10110.

随着体感交互设备的日益普及,人体动作识别在许多应用场景中变得越来越流行。基于骨骼的动作识别是非常简单高效的,因为骨架可以代表人体关键点的位置和结构。在本论文中,我们利用骨架序列之间的时空矢量表示作为网络的输入特征,与基于距离和角度特征的表示相比,它对人体骨架的变化更敏感。此外,我们重新设计了在网络不同深度上具有不同时间跨度的残差块,以提高时间卷积网络(TCN)处理长期依赖时间动作的能力。在这项工作中,我们提出了一个双流时间卷积网络(TS-TCN),该网络在时空表示中充分利用了骨架序列的帧间矢量特征和帧内矢量特征。该框架可以整合骨架序列的不同特征表示,以便这两个特征表示可以在识别动作时相互弥补彼此的不足。我们使用了融合的损失函数用于监督两个分支网络的训练参数。在四个大型公共数据集上进行的实验表明,我们提出的双流网络实现了卓越的性能。其中,我们在最广泛使用的NTU RGB+D数据集的测试集上达到了90.2%的识别精度,实验结果进一步证明了我们所设计的网络的可行性。通过整合帧间和帧内向量特征表示是目前提高人体动作识别精度最有效的一种方式,因此该双流网络在工业界上将更具备实用价值。同时本论文的“Failure Cases”章节中也给出了一些基于骨架的动作识别的失败案例,该问题未来可以借助于RGB图像来配合骨架数据来弥补骨架信息的不足,从而能更进一步提升动作识别的准确度,进而推进该研究在工业界上的应用。

关键词: 骨架, 动作识别, 时间卷积神经网络, 矢量特征表示, 神经网络

Abstract: With the growing popularity of somatosensory interaction devices, human action recognition is becoming attractive in many application scenarios. Skeleton-based action recognition is effective because the skeleton can represent the position and the structure of key points of the human body. In this paper, we leverage spatiotemporal vectors between skeleton sequences as input feature representation of the network, which is more sensitive to changes of the human skeleton compared with representations based on distance and angle features. In addition, we redesign residual blocks that have different strides in the depth of the network to improve the processing ability of the temporal convolutional networks (TCNs) for long time dependent actions. In this work, we propose the two-stream temporal convolutional networks (TSTCNs) that take full advantage of the inter-frame vector feature and the intra-frame vector feature of skeleton sequences in the spatiotemporal representations. The framework can integrate different feature representations of skeleton sequences so that the two feature representations can make up for each other’s shortcomings. The fusion loss function is used to supervise the training parameters of the two branch networks. Experiments on public datasets show that our network achieves superior performance and attains an improvement of 1.2% over the recent GCN-based (BGC-LSTM) method on the NTU RGB+D dataset.

Key words: skeleton, action recognition, temporal convolutional network (TCN), vector feature representation, neural network

[1] Aggarwal J K, Xia L. Human activity recognition from 3D data:A review. Pattern Recognition Letters, 2014, 48:70-80.
[2] Weinland D, Ronfard R, Boyer E. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 2011, 115(2):224-241.
[3] Han F, Reily B, Hoff W, Zhang H. Space-time representation of people based on 3D skeletal data:A review. Computer Vision and Image Understanding, 2017, 158:85-105.
[4] Liu H, Liu B, Zhang H, Li L, Qin X, Zhang G. Crowd evacuation simulation approach based on navigation knowledge and two-layer control mechanism. Information Sciences, 2018, 436/437:247-267.
[5] Turaga P, Chellappa R, Subrahmanian V S. Machine recognition of human activities:A survey. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(11):1473-1488.
[6] Herath S, Harandi M, Porikli F. Going deeper into action recognition:A survey. Image and Vision Computing, 2017, 60:4-21.
[7] Hou J H, Chau L P, Thalmann N M, He Y. Compressing 3-D human motions via keyframe-based geometry videos. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 25(1):51-62.
[8] Sermanet P, Lynch C, Hsu J, Levine S. Time-contrastive networks:Self-supervised learning from multi-view observation. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, July 2017, pp.486-487.
[9] Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 2011, 56(1):116-124.
[10] Li S, Fang Z, Song W, Hao A, Qin H. Bidirectional optimization coupled lightweight networks for efficient and robust multi-person 2D pose estimation. Journal of Computer Science and Technology, 2019, 34(3):522-536.
[11] Shahroudy A, Liu J, Ng T T, Gang W. NTU RGB+D:A large scale dataset for 3D human activity analysis. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.1010-1019.
[12] Zhu F, Shao L, Xie J, Fang Y. From handcrafted to learned representations for human action recognition:A survey. Image and Vision Computing, 2016, 55:42-52.
[13] Huang Z W, Wan C, Probst T, Van G L. Deep learning on lie groups for skeleton-based action recognition. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1243-1252.
[14] Ke Q, An S, Bennamoun M, Sohel F, Boussaid F. SkeletonNet:Mining deep part features for 3-D action recognition. IEEE Signal Processing Letters, 2017, 24(6):731-735.
[15] Weng J, Weng C, Yuan J, Liu Z. Discriminative spatiotemporal pattern discovery for 3D action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(4):1077-1089.
[16] Liu J, Shahroudy A, Xu D, Kot A C, Wang G. Skeletonbased action recognition using spatiotemporal LSTM network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(12):3007-3021.
[17] Lee I, Kim D, Kang S, Lee S. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.1012-1020.
[18] Zhang P, Xue J, Lan C, Zeng W, Gao Z, Zheng N. Adding attentiveness to the neurons in recurrent neural networks. In Proc. the 15th European Conference on Computer Vision, September 2018, pp.136-152.
[19] Meng F, Liu H, Liang Y, Tu J, Liu M. Sample fusion network:An end-to-end data augmentation network for skeleton-based human action recognition. IEEE Transactions on Image Processing, 2019, 28(11):5281-5295.
[20] Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.7444-7452.
[21] Shi L, Zhang Y, Cheng J, Lu H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.12026-12035.
[22] Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q. Actionalstructural graph convolutional networks for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.3595-3603.
[23] Si C, Chen W, Wang W, Wang L, Tan T. An attention enhanced graph convolutional LSTM network for skeletonbased action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.1227-1236.
[24] Lea C, Flynn M D, Vidal R, Reiter A, Hager G D. Temporal convolutional networks for action segmentation and detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1003-1012.
[25] Kim T S, Reiter A. Interpretable 3D human action analysis with temporal convolutional networks. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1623-1631.
[26] Liu J, Shahroudy A, Perez M, Wang G, Duan L Y, Kot A C. NTU RGB+D 120:A large-scale benchmark for 3D human activity understanding. arXiv:1905.04757, 2019. https://arxiv.org/pdf/1905.04757.pdf, Jan. 2020.
[27] Jiang W, Nie X, Xia Y, Wu Y, Zhu S C. Cross-view action modeling, learning and recognition. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.2649-2656.
[28] Xia L, Chen C C, Aggarwal J K. View invariant human action recognition using histograms of 3D joints. In Proc. the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2012, pp.20-27.
[29] Liu Z, Zhang C, Tian Y. 3D-based deep convolutional neural network for action recognition with depth sequences. Image and Vision Computing, 2016, 55:93-100.
[30] Wang P, Li W, Wan J, Ogunbona P, Liu X. Cooperative training of deep aggregation networks for RGB-D action recognition. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.7404-7411.
[31] Jiang W, Liu Z, Wu Y, Yuan J. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(5):914-927.
[32] Zhang S, Liu X, Xiao J. On geometric features for skeletonbased action recognition using multilayer LSTM networks. In Proc. the 2017 IEEE Winter Conference on Applications of Computer Vision, March 2017, pp.148-157.
[33] Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.2136-2145.
[34] Ke Q, Bennamoun M, An S, Sohel F, Boussaïd F. A new representation of skeleton sequences for 3D action recognition. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.4570-4579.
[35] Ghorbel E, Boonaert J, Boutteau R, Lecoeuche S, Savatier X. An extension of kernel learning methods using a modified Log-Euclidean distance for fast and accurate skeleton-based human action recognition. Computer Vision and Image Understanding, 2018, 175:32-43.
[36] Yuan J, Liu Z, Wu Y. Discriminative subvolume search for efficient action detection. In Proc. the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2009, pp.2442-2449.
[37] Liu M, Shi Y, Zheng L, Xu K, Huang H, Manocha D. Recurrent 3D attentional networks for end-to-end active object recognition. Computational Visual Media, 2019, 5(1):91-104.
[38] Ioffe S, Szegedy C. Batch normalization:Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.448-456.
[39] He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers:Surpassing human-level performance on imageNet classification. In Proc. the 2015 IEEE International Conference on Computer Vision, December 2015, pp.1026-1034.
[40] Girija S S. TensorFlow:Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016. https://arxiv.org/abs/1603.04467, Jan. 2020.
[41] Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R. Sequence of the most informative joints (SMIJ):A new representation for human skeletal action recognition. In Proc. the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2012, pp.8-13.
[42] Zhao R, Wang K, Su H, Ji Q. Bayesian graph convolution LSTM for skeleton based action recognition. In Proc. the 2019 IEEE Conference on International Conference on Computer Vision, October 2019, pp.6881-6891.
[43] Yu Z, Chen W, Guo G. Fusing spatiotemporal features and joints for 3D action recognition. In Proc. the 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.486-491.
[1] 魏华鹏, 邓盈盈, 唐帆, 潘兴甲, 董未名. 基于卷积神经网络和Transformer的视觉风格迁移的比较研究[J]. 计算机科学技术学报, 2022, 37(3): 601-614.
[2] 陈铮、方晓楠、张松海. 少纹理区域的局部单应性矩阵估计[J]. 计算机科学技术学报, 2022, 37(3): 615-625.
[3] 解晓政, 牛建伟, 刘雪峰, 李青锋, 王勇, 韩洁, 唐少杰. 基于卷积神经网络并融合边界信息的乳腺癌超声图像诊断[J]. 计算机科学技术学报, 2022, 37(2): 277-294.
[4] 王新峰、周翔、饶家华、张柱金、杨跃东. 基于迁移学习的DNA甲基化缺失数据补齐[J]. 计算机科学技术学报, 2022, 37(2): 320-329.
[5] 张鑫, 陆思源, 王水花, 余翔, 王甦菁, 姚仑, 潘毅, 张煜东. 通过新型深度学习架构诊断COVID-19肺炎[J]. 计算机科学技术学报, 2022, 37(2): 330-343.
[6] Dan-Hao Zhu, Xin-Yu Dai, Jia-Jun Chen. 预训练和学习:在图神经网络中保留全局信息[J]. 计算机科学技术学报, 2021, 36(6): 1420-1430.
[7] Yi Zhong, Jian-Hua Feng, Xiao-Xin Cui, Xiao-Le Cui. 机器学习辅助的抗逻辑块加密密钥猜测攻击范式[J]. 计算机科学技术学报, 2021, 36(5): 1102-1117.
[8] Feng Wang, Guo-Jie Luo, Guang-Yu Sun, Yu-Hao Wang, Di-Min Niu, Hong-Zhong Zheng. 在忆阻器中基于模式表示法的二值神经网络权重映射法[J]. 计算机科学技术学报, 2021, 36(5): 1155-1166.
[9] Shao-Jie Qiao, Guo-Ping Yang, Nan Han, Hao Chen, Fa-Liang Huang, Kun Yue, Yu-Gen Yi, Chang-An Yuan. 基数估计器:利用垂直扫描卷积神经网络处理SQL[J]. 计算机科学技术学报, 2021, 36(4): 762-777.
[10] Chen-Chen Sun, De-Rong Shen. 面向深度实体匹配的混合层次网络[J]. 计算机科学技术学报, 2021, 36(4): 822-838.
[11] Yang Liu, Ruili He, Xiaoqian Lv, Wei Wang, Xin Sun, Shengping Zhang. 婴儿的年龄和性别容易被识别吗?[J]. 计算机科学技术学报, 2021, 36(3): 508-519.
[12] Zhang-Jin Huang, Xiang-Xiang He, Fang-Jun Wang, Qing Shen. 基于卷积神经网络的实时多阶段斑马鱼头部姿态估计框架[J]. 计算机科学技术学报, 2021, 36(2): 434-444.
[13] Bo-Wei Zou, Rong-Tao Huang, Zeng-Zhuang Xu, Yu Hong, Guo-Dong Zhou. 基于对抗神经网络的跨语言实体关系分类[J]. 计算机科学技术学报, 2021, 36(1): 207-220.
[14] Wan-Wei Liu, Fu Song, Tang-Hao-Ran Zhang, Ji Wang. 基于模型检验的ReLU神经网络验证[J]. 计算机科学技术学报, 2020, 35(6): 1365-1381.
[15] Bi-Ying Yan, Chao Yang, Pan Deng, Qiao Sun, Feng Chen, Yang Yu. 一种基于时空因果性的城市感知数据治理方法[J]. 计算机科学技术学报, 2020, 35(5): 1084-1098.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] C.Y.Chung; 华宣仁;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[5] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[6] 潘启敬;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .
[7] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[8] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[9] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[10] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: