基于骨架行为识别的多尺度自适应大核图卷积网络

张雨晴; 庞辰; 耿佩; 鲁学权; 吕蕾

doi:10.1007/s11390-025-5287-1

摘要:

研究背景 行为识别已经成为计算机视觉中的一项重要任务，旨在从视频中识别出人类的行为。骨架数据将人体表示为一系列二维（2D）或三维（3D）关键点坐标，简化了计算过程并且增强了对遮挡和背景干扰的鲁棒性，在行为识别中被广泛应用。目前图卷积网络已成为骨架行为识别的主流方法，将骨架结构建模为时空图，通过堆叠多层图卷积来捕捉节点间的长距离依赖关系。

目的传统图卷积网络方法通常通过堆叠多层图卷积来捕捉节点间的长距离依赖关系，这不仅增加了计算负担，还容易引发过平滑效应，导致关键局部动作特征被忽视。为此，本文提出一种新的多尺度自适应大核图卷积网络，旨在保持计算效率的同时，有效聚合局部和全局时空相关性，以克服传统图卷积网络在骨架动作识别任务中的局限性。

方法如图1所示，本文提出了一种多尺度自适应大核图卷积网络MSLK-GCN，在保持计算效率的同时，有效地聚合局部和全局时空相关性。MSLK-GCN的核心部件包括多尺度大核图卷积模块、多通道自适应图卷积模块和多尺度时间自注意力模块。多尺度大核图卷积模块利用大卷积核和门控机制自适应地聚焦于运动区域，有效捕获骨架序列内的长距离依赖关系。同时，多通道自适应图卷积模块通过调整节点间的连接权重来动态学习不同关节点之间的关系。为了进一步增强模型捕捉时间动态的能力，多尺度时间自注意力模块通过融合高效的通道注意力和多尺度卷积有效地聚合了时间信息。此外，文章采用多流融合策略，充分利用不同模态的骨架数据，包括骨骼、关节，以及骨骼运动和关节运动。

结果如表2和表3所示，作者在三个公共数据集NTU-RGB+D 60、NTU-RGB+D 120和Northwestern-UCLA的四个基准测试上进行了实验，结果表明提出的模型MSLK-GCN进一步提高了行为识别性能，并超过了现有模型。此外，作者还通过可视化实验验证了MSLK-GCN的分类性能，实验结果表明MSLK-GCN可以有效地聚合局部和全局时空相关性，同时克服了传统图卷积网络在基于骨架的动作识别中固有的高计算负担和过平滑风险。

结论本文所提出的多尺度自适应大核图卷积网络模型MSLK-GCN在保持计算效率的同时，有效地聚合局部和全局时空相关性。此外，可视化分析验证了该模型可以有效地缓解过平滑风险，并有效分类相似动作。未来工作将探索更高效的图卷积操作和轻量级模型设计，并且进一步探索双人交互行为中复杂的逻辑关系。

Abstract: Graph convolutional networks (GCNs) have become a dominant approach for skeleton-based action recognition tasks. Although GCNs have made significant progress in modeling skeletons as spatial-temporal graphs, they often require stacking multiple graph convolution layers to effectively capture long-distance relationships among nodes. This stacking not only increases computational burdens but also raises the risk of over-smoothing, which can lead to the neglect of crucial local action features. To address this issue, we propose a novel multi-scale adaptive large kernel graph convolutional network (MSLK-GCN) to effectively aggregate local and global spatio-temporal correlations while maintaining the computational efficiency. The core components of the network include two multi-scale large kernel graph convolution (LKGC) modules, a multi-channel adaptive graph convolution (MAGC) module, and a multi-scale temporal self-attention convolution (MSTC) module. The LKGC module adaptively focuses on active motion regions by utilizing a large convolution kernel and a gating mechanism, effectively capturing long-distance dependencies within the skeleton sequence. Meanwhile, the MAGC module dynamically learns relationships between different joints by adjusting connection weights between nodes. To further enhance the ability to capture temporal dynamics, the MSTC module effectively aggregates the temporal information by integrating Efficient Channel Attention (ECA) with multi-scale convolution. In addition, we use a multi-stream fusion strategy to make full use of different modal skeleton data, including bone, joint, joint motion, and bone motion. Exhaustive experiments on three scale-varying datasets, i.e., NTU-60, NTU-120, and NW-UCLA, demonstrate that our MSLK-GCN can achieve state-of-the-art performance with fewer parameters.

基于骨架行为识别的多尺度自适应大核图卷积网络

Multi-Scale Adaptive Large Kernel Graph Convolutional Network for Skeleton-Based Action Recognition