基于自监督音频动作同步性学习的音乐驱动的指挥动作生成

刘凡; 陈德龙; 周睿志; 杨赛; 许峰

doi:10.1007/s11390-022-2030-z

基于自监督音频动作同步性学习的音乐驱动的指挥动作生成

Self-Supervised Music Motion Synchronization Learning for Music-Driven Conducting Motion Generation

摘要

摘要: 1、研究背景（Context）：音乐与人体动作之间的内在关联性一直以来都在被广泛研究。最近，许多学者成功地使用深度学习模型进行了舞蹈动作或乐器演奏动作的生成，但很少有人关注乐队指挥的动作。指挥动作需要同时表达节拍、演奏法、音乐情绪等多个方面的信息，而现有的方法大都基于人工设定的规则，生成的指挥动作不自然，且只能表达很简单的语义。
2、目的（Objective）：本文试图借助深度神经网络的强大学习能力完成音乐驱动的指挥动作生成，即以音乐为条件控制信号，生成与之节奏同步、语义相关，且自然优美的指挥动作。
3、方法（Method）：本文提出了一个包括两阶段的学习框架。在第一阶段中，基于自监督对比学习训练一个包含音乐编码器与动作编码器的音频-动作同步网络（Music Motion Synchronization Network，M2S-Net）。在第二阶段中，使用训练好的动作编码器比对生成动作与真实动作的语义相似度计算同步损失，使用判别器判别生成动作的真实度计算对抗损失，联合训练一个音频-动作同步的生成对抗网络（Music Motion Synchronized Generative Adversarial Network，M2S-GAN）。此外本文还基于目标检测与姿态估计算法从在线视频平台收集并构建了一个大规模指挥动作数据集ConductorMotion100，以提供可靠的数据支撑。
4、结果（Result & Findings）：在ConductorMotion100数据集上，无论是基于多种评价指标的定量对比，还是基于动作可视化的定性分析，本文提出方法的准确性、多样性与真实性都超过了多个对比的方法。同时，实验发现，在音视频自监督任务上性能良好的负样本采样策略会导致较多的错误负样本，并不适用于音频-指挥动作的数据。
5、结论（Conclusions）：本文实现了首个基于深度学习的指挥动作生成算法，并首次将多模态自监督对比学习应用至音乐-动作数据上。本文提出的方法能够生成准确、多样且美观的指挥动作。本文的两阶段学习过程有希望被推广为一个通用的跨模态条件生成框架，同时，本文构建的ConductorMotion100数据集可以被拓展用作音乐信息检索任务（如节拍检测）的大规模预训练数据集。

Abstract: The correlation between music and human motion has attracted widespread research attention. Although recent studies have successfully generated motion for singers, dancers, and musicians, few have explored motion generation for orchestral conductors. The generation of music-driven conducting motion should consider not only the basic music beats, but also mid-level music structures, high-level music semantic expressions, and hints for different parts of orchestras (strings, woodwind, etc.). However, most existing conducting motion generation methods rely heavily on human-designed rules, which significantly limits the quality of generated motion.Therefore, we propose a novel Music Motion Synchronized Generative Adversarial Network (M²S-GAN), which generates motions according to the automatically learned music representations. More specifically, M²S-GAN is a cross-modal generative network comprising four components: 1) a music encoder that encodes the music signal; 2) a generator that generates conducting motion from the music codes; 3) a motion encoder that encodes the motion; 4) a discriminator that differentiates the real and generated motions. These four components respectively imitate four key aspects of human conductors: understanding music, interpreting music, precision and elegance. The music and motion encoders are first jointly trained by a self-supervised contrastive loss, and can thus help to facilitate the music motion synchronization during the following adversarial learning process. To verify the effectiveness of our method, we construct a large-scale dataset, named ConductorMotion100, which consists of unprecedented 100 hours of conducting motion data. Extensive experiments on ConductorMotion100 demonstrate the effectiveness of M²S-GAN. Our proposed approach outperforms various comparison methods both quantitatively and qualitatively. Through visualization, we show that our approach can generate plausible, diverse, and music-synchronized conducting motion.

HTML全文

参考文献()

施引文献

资源附件()