FürElise:捕捉和物理合成钢琴演奏的手部动作
FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance
October 8, 2024
作者: Ruocheng Wang, Pei Xu, Haochen Shi, Elizabeth Schumann, C. Karen Liu
cs.AI
摘要
钢琴演奏需要灵活、精准、协调的手部控制,这超越了灵巧的极限。具备足够复杂性的手部运动模型能够准确再现钢琴演奏,在角色动画、体验型人工智能、生物力学以及虚拟/增强现实等领域有广泛应用。本文构建了一个独一无二的大规模数据集,包含约10小时的来自15名顶尖钢琴家演奏153首古典音乐作品的3D手部运动和音频。为了捕捉自然的表演,我们设计了一个无需标记的设置,通过多视角视频重建运动,使用最先进的姿势估计模型。运动数据进一步通过逆向运动学进行优化,使用从专用Yamaha Disklavier钢琴传感器获取的高分辨率MIDI按键数据。利用收集的数据集,我们开发了一个流程,可以为数据集之外的乐谱合成出物理上合理的手部运动。我们的方法结合了模仿学习和强化学习,获得了涉及手部和钢琴键之间互动的基于物理的双手控制策略。为了解决大规模运动数据集的采样效率问题,我们使用扩散模型生成自然参考运动,提供高级轨迹和指法(手指顺序和位置)信息。然而,仅凭生成的参考运动并不能为钢琴演奏建模提供足够的准确性。然后,我们通过使用音乐相似性进一步增强数据,从捕获的数据集中检索类似运动,以提高强化学习策略的精确性。通过提出的方法,我们的模型生成自然、灵巧的运动,可以泛化到训练数据集之外的音乐。
English
Piano playing requires agile, precise, and coordinated hand control that
stretches the limits of dexterity. Hand motion models with the sophistication
to accurately recreate piano playing have a wide range of applications in
character animation, embodied AI, biomechanics, and VR/AR. In this paper, we
construct a first-of-its-kind large-scale dataset that contains approximately
10 hours of 3D hand motion and audio from 15 elite-level pianists playing 153
pieces of classical music. To capture natural performances, we designed a
markerless setup in which motions are reconstructed from multi-view videos
using state-of-the-art pose estimation models. The motion data is further
refined via inverse kinematics using the high-resolution MIDI key-pressing data
obtained from sensors in a specialized Yamaha Disklavier piano. Leveraging the
collected dataset, we developed a pipeline that can synthesize
physically-plausible hand motions for musical scores outside of the dataset.
Our approach employs a combination of imitation learning and reinforcement
learning to obtain policies for physics-based bimanual control involving the
interaction between hands and piano keys. To solve the sampling efficiency
problem with the large motion dataset, we use a diffusion model to generate
natural reference motions, which provide high-level trajectory and fingering
(finger order and placement) information. However, the generated reference
motion alone does not provide sufficient accuracy for piano performance
modeling. We then further augmented the data by using musical similarity to
retrieve similar motions from the captured dataset to boost the precision of
the RL policy. With the proposed method, our model generates natural, dexterous
motions that generalize to music from outside the training dataset.Summary
AI-Generated Summary