We use cookies to improve your experience with our site.

基于骨架感知神经网络的视觉手语翻译技术

Vision-Based Sign Language Translation via a Skeleton-Aware Neural Network

  • 摘要:
    研究背景 手语是通过手臂、手和手指等人体动作来表达的,是听障人士交流表达的主要手段。为了搭建与听障人士的沟通桥梁,手语识别和手语翻译任务引起了人们的注意。 手语识别又分为独立手语识别和连续手语识别,独立手语识别旨在将单独的手语识别为对应的单词,连续手语识别旨在将连续的手语识别为对应的手语单词序列。由于手语与自然语言的语法规则存在差异,所以连续手语识别的结果不符合自然语言规则。而手语翻译任务旨在将连续手语翻译为自然语言文本,其结果满足自然语言的语法规则。现有手语识别与手语翻译工作通常侧重于提取手语视频帧中的局部区域或全帧特征,而忽略了骨骼特征;而骨骼信息可以反映人体姿势动作,进而可以提供用于区分手语动作的重要线索,因此骨骼特征可以用于提升手语模型的性能。
    目的 本文的研究目标是利用骨骼特征来增强手语识别与翻译模型中的视觉特征提取,从而提高手语识别与翻译模型的性能。
    方法 本文提出了一种用于基于骨骼特征的手语识别与翻译网络。首先为了获取骨骼信息,我们设计了一个独立的网络分支用于骨骼提取。为了让骨骼特征有效地指导模型提取手语相关特征,我们连接了每个帧的骨架通道和 RGB 通道以进行特征提取。为了区分不同视频片段的重要性,我们构建了一个基于骨架的图卷积网络,为每个片段赋予不同的权重。最终,我们提供了端到端的手语翻译模型和两阶段的手语翻译模型。此外,我们还提供了智能手机上的手语翻译解决方案,从而将所提出的模型部署在智能手机上,以便于听障人士与普通人的沟通。
    结果 我们在三个公共数据集上对所做工作进行了实验,结果表明提出的手语模型进一步提高了手语翻译性能,并超过了现有的手语模型性能。此外,我们还在真实场景进行了案例测试,实验结果表明了我们的模型在真实场景下具备很强的鲁棒性。
    结论 所提出的模型通过利用骨骼模态提高手语识别与翻译性能。此外,通过可视化分析,我们可以看出基于骨骼的手语识别与翻译模型可以更加有效地聚焦手语相关的特征,而且基于骨骼的图卷积网络可以有效地区别每个视频片段的重要性,从而关注手势的动态特征。

     

    Abstract: Sign languages are mainly expressed by human actions, such as arm, hand, and finger motions. Thus a skeleton which reflects human pose information can provide an important cue for distinguishing signs (i.e., human actions), and can be used for sign language translation (SLT), which aims to translate sign language to spoken language. However, the recent neural networks typically focus on extracting local-area or full-frame features, while ignoring informative skeleton features. Therefore, this paper proposes a novel skeleton-aware neural network, SANet, for vision-based SLT. Specifically, to introduce skeleton modality, we design a self-contained branch for skeleton extraction. To efficiently guide the feature extraction from videos with skeletons, we concatenate the skeleton channel and RGB channels of each frame for feature extraction. To distinguish the importance of clips (i.e., segmented short videos), we construct a skeleton-based graph convolutional network, GCN, for feature scaling, i.e., giving an importance weight for each clip. Finally, to generate spoken language from features, we provide an end-to-end method and a two-stage method for SLT. Besides, based on SANet, we provide an SLT solution on the smartphone for benefiting communication between hearing-impaired people and normal people. Extensive experiments on three public datasets and case studies in real scenarios demonstrate the effectiveness of our method, which outperforms existing methods.

     

/

返回文章
返回