We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Jin-Gong Jia, Yuan-Feng Zhou, Xing-Wei Hao, Feng Li, Christian Desrosiers, Cai-Ming Zhang. Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition[J]. Journal of Computer Science and Technology, 2020, 35(3): 538-550. DOI: 10.1007/s11390-020-0405-6
Citation: Jin-Gong Jia, Yuan-Feng Zhou, Xing-Wei Hao, Feng Li, Christian Desrosiers, Cai-Ming Zhang. Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition[J]. Journal of Computer Science and Technology, 2020, 35(3): 538-550. DOI: 10.1007/s11390-020-0405-6

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Funds: The work was supported by the National Natural Science Foundation (NSFC)-Zhejiang Joint Fund of the Integration of Informatization and Industrialization of China under Grant Nos. U1909210 and U1609218, the National Natural Science Foundation of China under Grant No. 61772312, and the Key Research and Development Project of Shandong Province of China under Grant No. 2017GGX10110.
More Information
  • Author Bio:

    Jin-Gong Jia received his B.S. degree in software engineering at School of Information and Electrical Engineering, Ludong University, Yantai, in 2018. Currently, he is currently pursuing his Master's degree in software engineering at the School of Software, Shandong University, Jinan. His research interests include computer vision, human action recognition, and human pose estimation.

  • Corresponding author:

    Yuan-Feng Zhou E-mail: yfzhou@sdu.edu.cn

  • Received Date: February 28, 2020
  • Revised Date: April 04, 2020
  • Published Date: May 27, 2020
  • With the growing popularity of somatosensory interaction devices, human action recognition is becoming attractive in many application scenarios. Skeleton-based action recognition is effective because the skeleton can represent the position and the structure of key points of the human body. In this paper, we leverage spatiotemporal vectors between skeleton sequences as input feature representation of the network, which is more sensitive to changes of the human skeleton compared with representations based on distance and angle features. In addition, we redesign residual blocks that have different strides in the depth of the network to improve the processing ability of the temporal convolutional networks (TCNs) for long time dependent actions. In this work, we propose the two-stream temporal convolutional networks (TSTCNs) that take full advantage of the inter-frame vector feature and the intra-frame vector feature of skeleton sequences in the spatiotemporal representations. The framework can integrate different feature representations of skeleton sequences so that the two feature representations can make up for each other’s shortcomings. The fusion loss function is used to supervise the training parameters of the two branch networks. Experiments on public datasets show that our network achieves superior performance and attains an improvement of 1.2% over the recent GCN-based (BGC-LSTM) method on the NTU RGB+D dataset.
  • [1]
    Aggarwal J K, Xia L. Human activity recognition from 3D data:A review. Pattern Recognition Letters, 2014, 48:70-80.
    [2]
    Weinland D, Ronfard R, Boyer E. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 2011, 115(2):224-241.
    [3]
    Han F, Reily B, Hoff W, Zhang H. Space-time representation of people based on 3D skeletal data:A review. Computer Vision and Image Understanding, 2017, 158:85-105.
    [4]
    Liu H, Liu B, Zhang H, Li L, Qin X, Zhang G. Crowd evacuation simulation approach based on navigation knowledge and two-layer control mechanism. Information Sciences, 2018, 436/437:247-267.
    [5]
    Turaga P, Chellappa R, Subrahmanian V S. Machine recognition of human activities:A survey. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(11):1473-1488.
    [6]
    Herath S, Harandi M, Porikli F. Going deeper into action recognition:A survey. Image and Vision Computing, 2017, 60:4-21.
    [7]
    Hou J H, Chau L P, Thalmann N M, He Y. Compressing 3-D human motions via keyframe-based geometry videos. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 25(1):51-62.
    [8]
    Sermanet P, Lynch C, Hsu J, Levine S. Time-contrastive networks:Self-supervised learning from multi-view observation. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, July 2017, pp.486-487.
    [9]
    Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 2011, 56(1):116-124.
    [10]
    Li S, Fang Z, Song W, Hao A, Qin H. Bidirectional optimization coupled lightweight networks for efficient and robust multi-person 2D pose estimation. Journal of Computer Science and Technology, 2019, 34(3):522-536.
    [11]
    Shahroudy A, Liu J, Ng T T, Gang W. NTU RGB+D:A large scale dataset for 3D human activity analysis. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.1010-1019.
    [12]
    Zhu F, Shao L, Xie J, Fang Y. From handcrafted to learned representations for human action recognition:A survey. Image and Vision Computing, 2016, 55:42-52.
    [13]
    Huang Z W, Wan C, Probst T, Van G L. Deep learning on lie groups for skeleton-based action recognition. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1243-1252.
    [14]
    Ke Q, An S, Bennamoun M, Sohel F, Boussaid F. SkeletonNet:Mining deep part features for 3-D action recognition. IEEE Signal Processing Letters, 2017, 24(6):731-735.
    [15]
    Weng J, Weng C, Yuan J, Liu Z. Discriminative spatiotemporal pattern discovery for 3D action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(4):1077-1089.
    [16]
    Liu J, Shahroudy A, Xu D, Kot A C, Wang G. Skeletonbased action recognition using spatiotemporal LSTM network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(12):3007-3021.
    [17]
    Lee I, Kim D, Kang S, Lee S. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.1012-1020.
    [18]
    Zhang P, Xue J, Lan C, Zeng W, Gao Z, Zheng N. Adding attentiveness to the neurons in recurrent neural networks. In Proc. the 15th European Conference on Computer Vision, September 2018, pp.136-152.
    [19]
    Meng F, Liu H, Liang Y, Tu J, Liu M. Sample fusion network:An end-to-end data augmentation network for skeleton-based human action recognition. IEEE Transactions on Image Processing, 2019, 28(11):5281-5295.
    [20]
    Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.7444-7452.
    [21]
    Shi L, Zhang Y, Cheng J, Lu H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.12026-12035.
    [22]
    Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q. Actionalstructural graph convolutional networks for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.3595-3603.
    [23]
    Si C, Chen W, Wang W, Wang L, Tan T. An attention enhanced graph convolutional LSTM network for skeletonbased action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.1227-1236.
    [24]
    Lea C, Flynn M D, Vidal R, Reiter A, Hager G D. Temporal convolutional networks for action segmentation and detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1003-1012.
    [25]
    Kim T S, Reiter A. Interpretable 3D human action analysis with temporal convolutional networks. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1623-1631.
    [26]
    Liu J, Shahroudy A, Perez M, Wang G, Duan L Y, Kot A C. NTU RGB+D 120:A large-scale benchmark for 3D human activity understanding. arXiv:1905.04757, 2019. https://arxiv.org/pdf/1905.04757.pdf, Jan. 2020.
    [27]
    Jiang W, Nie X, Xia Y, Wu Y, Zhu S C. Cross-view action modeling, learning and recognition. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.2649-2656.
    [28]
    Xia L, Chen C C, Aggarwal J K. View invariant human action recognition using histograms of 3D joints. In Proc. the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2012, pp.20-27.
    [29]
    Liu Z, Zhang C, Tian Y. 3D-based deep convolutional neural network for action recognition with depth sequences. Image and Vision Computing, 2016, 55:93-100.
    [30]
    Wang P, Li W, Wan J, Ogunbona P, Liu X. Cooperative training of deep aggregation networks for RGB-D action recognition. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.7404-7411.
    [31]
    Jiang W, Liu Z, Wu Y, Yuan J. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(5):914-927.
    [32]
    Zhang S, Liu X, Xiao J. On geometric features for skeletonbased action recognition using multilayer LSTM networks. In Proc. the 2017 IEEE Winter Conference on Applications of Computer Vision, March 2017, pp.148-157.
    [33]
    Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.2136-2145.
    [34]
    Ke Q, Bennamoun M, An S, Sohel F, Boussaïd F. A new representation of skeleton sequences for 3D action recognition. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.4570-4579.
    [35]
    Ghorbel E, Boonaert J, Boutteau R, Lecoeuche S, Savatier X. An extension of kernel learning methods using a modified Log-Euclidean distance for fast and accurate skeleton-based human action recognition. Computer Vision and Image Understanding, 2018, 175:32-43.
    [36]
    Yuan J, Liu Z, Wu Y. Discriminative subvolume search for efficient action detection. In Proc. the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2009, pp.2442-2449.
    [37]
    Liu M, Shi Y, Zheng L, Xu K, Huang H, Manocha D. Recurrent 3D attentional networks for end-to-end active object recognition. Computational Visual Media, 2019, 5(1):91-104.
    [38]
    Ioffe S, Szegedy C. Batch normalization:Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.448-456.
    [39]
    He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers:Surpassing human-level performance on imageNet classification. In Proc. the 2015 IEEE International Conference on Computer Vision, December 2015, pp.1026-1034.
    [40]
    Girija S S. TensorFlow:Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016. https://arxiv.org/abs/1603.04467, Jan. 2020.
    [41]
    Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R. Sequence of the most informative joints (SMIJ):A new representation for human skeletal action recognition. In Proc. the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2012, pp.8-13.
    [42]
    Zhao R, Wang K, Su H, Ji Q. Bayesian graph convolution LSTM for skeleton based action recognition. In Proc. the 2019 IEEE Conference on International Conference on Computer Vision, October 2019, pp.6881-6891.
    [43]
    Yu Z, Chen W, Guo G. Fusing spatiotemporal features and joints for 3D action recognition. In Proc. the 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.486-491.
  • Related Articles

    [1]Shi-Wei Gan, Ya-Feng Yin, Zhi-Wei Jiang, Lei Xie, Sang-Lu Lu. Vision-Based Sign Language Translation via a Skeleton-Aware Neural Network[J]. Journal of Computer Science and Technology, 2025, 40(2): 378-396. DOI: 10.1007/s11390-024-2978-y
    [2]Ji-Bao Lai, Hui-Qiang Wang, Xiao-Wu Liu, Ying Liang, Rui-Juan Zheng, Guo-Sheng Zhao. WNN-Based Network Security Situation Quantitative Prediction Method and Its Optimization[J]. Journal of Computer Science and Technology, 2008, 23(2): 222-230.
    [3]Zhi-Hua Zhou. Rule Extraction: Using Neural Networks or for Neural Networks?[J]. Journal of Computer Science and Technology, 2004, 19(2).
    [4]FEI Xiang, HE Xiaoyan, LUO Junzhou, WU Jieyi, GU Guanqun. Fuzzy Neural Network Based Traffic Prediction and Congestion Control in High-Speed Networks[J]. Journal of Computer Science and Technology, 2000, 15(2): 144-149.
    [5]Zhou Jingzhou. A Neural Network Model Based on Logical Operations[J]. Journal of Computer Science and Technology, 1998, 13(5): 464-470.
    [6]Wang Aiqun, Zheng Nanning. Multiplicative Inhibitory Velocity Detector and Multi-Velocity Motion Detection Neural Network Model[J]. Journal of Computer Science and Technology, 1998, 13(1): 41-54.
    [7]Qin Kaihuai. Neural Network Methods for NURBS Curve and Surface Interpolation[J]. Journal of Computer Science and Technology, 1997, 12(1): 76-89.
    [8]Zhu Zhigang, Xu Guangyou. Neural Networks for Omni-View Road Image Understanding[J]. Journal of Computer Science and Technology, 1996, 11(6): 570-580.
    [9]Wei Naihong, Yang Shiyuan, Tong Shibai. A Neural Network Appraoch to Fault Diagnosis in Analog Circuits[J]. Journal of Computer Science and Technology, 1996, 11(6): 542-550.
    [10]Zhou Yi, Wu ShiLin. NNF and NNPrF—Fuzzy Petri Nets Based on Neural Network for Knowledge Representation, Reasoning and Learning[J]. Journal of Computer Science and Technology, 1996, 11(2): 133-149.
  • Others

  • Cited by

    Periodical cited type(22)

    1. Amit Baghel, Alok Kumar Singh Kushwaha, Roshan Singh. Automated Human Action Recognition with Improved Graph Convolutional Network-based Pose Estimation. International Journal of Pattern Recognition and Artificial Intelligence, 2025. DOI:10.1142/S0218001424570167
    2. Özge Öztimur Karadağ. SkelVIT: consensus of vision transformers for a lightweight skeleton-based action recognition system. Signal, Image and Video Processing, 2024, 18(8-9): 5619. DOI:10.1007/s11760-024-03259-1
    3. Yuanyuan Tian, Sen Lin, Hejun Xu, et al. A Spatial-Temporal Multi-Feature Network (STMF-Net) for Skeleton-Based Construction Worker Action Recognition. Sensors, 2024, 24(23): 7455. DOI:10.3390/s24237455
    4. Yuanyuan Tian, Jiayu Chen, Jung In Kim, et al. Lightweight deep learning framework for recognizing construction workers' activities based on simplified node combinations. Automation in Construction, 2024, 158: 105236. DOI:10.1016/j.autcon.2023.105236
    5. Jiawei Huang, Ding Zhou. A scalable real-time computer vision system for student posture detection in smart classrooms. Education and Information Technologies, 2024, 29(1): 917. DOI:10.1007/s10639-023-12365-5
    6. Yuanyuan Tian, Yan Liang, Haibin Yang, et al. Multi-Stream Fusion Network for Skeleton-Based Construction Worker Action Recognition. Sensors, 2023, 23(23): 9350. DOI:10.3390/s23239350
    7. Dawei Zhang, Yanming Zhang, Meng Zhou. Skeleton‐Guided Action Recognition with Multistream 3D Convolutional Neural Network for Elderly‐Care Robot. Advanced Intelligent Systems, 2023, 5(12) DOI:10.1002/aisy.202300326
    8. Hanchao Liu, Yuhe Liu, Tai-Jiang Mu, et al. Skeleton-CutMix: Mixing Up Skeleton With Probabilistic Bone Exchange for Supervised Domain Adaptation. IEEE Transactions on Image Processing, 2023, 32: 4046. DOI:10.1109/TIP.2023.3293766
    9. Yuanyuan Tian, Jiayu Chen, Jung In Kim, et al. Multiple-input streams attention (MISA) network for skeleton-based construction workers' action recognition using body-segment representation strategies. Automation in Construction, 2023, 156: 105104. DOI:10.1016/j.autcon.2023.105104
    10. Bin Zhou, Naishi Feng, Hong Wang, et al. Non-invasive dual attention TCN for electromyography and motion data fusion in lower limb ambulation prediction. Journal of Neural Engineering, 2022, 19(4): 046051. DOI:10.1088/1741-2552/ac89b4
    11. Zhixuan Wu, Nan Ma, Yue Gao, et al. Attention Mechanism Based on Improved Spatial-Temporal Convolutional Neural Networks for Traffic Police Gesture Recognition. International Journal of Pattern Recognition and Artificial Intelligence, 2022, 36(08) DOI:10.1142/S0218001422560018
    12. Sabah Abdulazeez Jebur, Khalid A. Hussein, Haider Kadhim Hoomod, et al. Review on Deep Learning Approaches for Anomaly Event Detection in Video Surveillance. Electronics, 2022, 12(1): 29. DOI:10.3390/electronics12010029
    13. Zhize Wu, Huanyi Li, Xiaofeng Wang, et al. New Benchmark for Household Garbage Image Recognition. Tsinghua Science and Technology, 2022, 27(5): 793. DOI:10.26599/TST.2021.9010072
    14. Bruno Degardin, Hugo Proença. Human Behavior Analysis: A Survey on Action Recognition. Applied Sciences, 2021, 11(18): 8324. DOI:10.3390/app11188324
    15. Zhiwei Shi, Weimin Shi, Junru Wang. The Detection of Thread Roll’s Margin Based on Computer Vision. Sensors, 2021, 21(19): 6331. DOI:10.3390/s21196331
    16. Mihai Nan, Mihai Trăscău, Adina Magda Florea, et al. Comparison between Recurrent Networks and Temporal Convolutional Networks Approaches for Skeleton-Based Action Recognition. Sensors, 2021, 21(6): 2051. DOI:10.3390/s21062051
    17. Weiyao Xu, Muqing Wu, Jie Zhu, et al. Multi-scale skeleton adaptive weighted GCN for skeleton-based human action recognition in IoT. Applied Soft Computing, 2021, 104: 107236. DOI:10.1016/j.asoc.2021.107236
    18. Feng-Cheng Lin, Huu-Huy Ngo, Chyi-Ren Dow, et al. Student Behavior Recognition System for the Classroom Environment Based on Skeleton Pose Estimation and Person Detection. Sensors, 2021, 21(16): 5314. DOI:10.3390/s21165314
    19. Yutong Liu, Xiaolong Qian, Jiaming Chen. Cyber Security Intelligence and Analytics. Lecture Notes on Data Engineering and Communications Technologies, DOI:10.1007/978-3-030-97874-7_13
    20. Yitian Chen, Yuchen Xu, Qianglai Xie, et al. A Spatial-Temporal Feature Fusion Strategy for Skeleton-Based Action Recognition. 2023 International Conference on Data Security and Privacy Protection (DSPP), DOI:10.1109/DSPP58763.2023.10405203
    21. Weichao Zhao, Jingliang Peng, Na Lv. Advances in Computer Graphics. Lecture Notes in Computer Science, DOI:10.1007/978-3-031-50075-6_9
    22. Zhixuan Wu, Nan Ma, Yiu-ming Cheung, et al. Improved Spatio-Temporal Convolutional Neural Networks for Traffic Police Gestures Recognition. 2020 16th International Conference on Computational Intelligence and Security (CIS), DOI:10.1109/CIS52066.2020.00032

    Other cited types(0)

Catalog

    Article views (85) PDF downloads (0) Cited by(22)
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return