We use cookies to improve your experience with our site.

双流时间卷积神经网络用于基于骨架的人体动作识别

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

  • 摘要: 随着体感交互设备的日益普及,人体动作识别在许多应用场景中变得越来越流行。基于骨骼的动作识别是非常简单高效的,因为骨架可以代表人体关键点的位置和结构。在本论文中,我们利用骨架序列之间的时空矢量表示作为网络的输入特征,与基于距离和角度特征的表示相比,它对人体骨架的变化更敏感。此外,我们重新设计了在网络不同深度上具有不同时间跨度的残差块,以提高时间卷积网络(TCN)处理长期依赖时间动作的能力。在这项工作中,我们提出了一个双流时间卷积网络(TS-TCN),该网络在时空表示中充分利用了骨架序列的帧间矢量特征和帧内矢量特征。该框架可以整合骨架序列的不同特征表示,以便这两个特征表示可以在识别动作时相互弥补彼此的不足。我们使用了融合的损失函数用于监督两个分支网络的训练参数。在四个大型公共数据集上进行的实验表明,我们提出的双流网络实现了卓越的性能。其中,我们在最广泛使用的NTU RGB+D数据集的测试集上达到了90.2%的识别精度,实验结果进一步证明了我们所设计的网络的可行性。通过整合帧间和帧内向量特征表示是目前提高人体动作识别精度最有效的一种方式,因此该双流网络在工业界上将更具备实用价值。同时本论文的“Failure Cases”章节中也给出了一些基于骨架的动作识别的失败案例,该问题未来可以借助于RGB图像来配合骨架数据来弥补骨架信息的不足,从而能更进一步提升动作识别的准确度,进而推进该研究在工业界上的应用。

     

    Abstract: With the growing popularity of somatosensory interaction devices, human action recognition is becoming attractive in many application scenarios. Skeleton-based action recognition is effective because the skeleton can represent the position and the structure of key points of the human body. In this paper, we leverage spatiotemporal vectors between skeleton sequences as input feature representation of the network, which is more sensitive to changes of the human skeleton compared with representations based on distance and angle features. In addition, we redesign residual blocks that have different strides in the depth of the network to improve the processing ability of the temporal convolutional networks (TCNs) for long time dependent actions. In this work, we propose the two-stream temporal convolutional networks (TSTCNs) that take full advantage of the inter-frame vector feature and the intra-frame vector feature of skeleton sequences in the spatiotemporal representations. The framework can integrate different feature representations of skeleton sequences so that the two feature representations can make up for each other’s shortcomings. The fusion loss function is used to supervise the training parameters of the two branch networks. Experiments on public datasets show that our network achieves superior performance and attains an improvement of 1.2% over the recent GCN-based (BGC-LSTM) method on the NTU RGB+D dataset.

     

/

返回文章
返回