FürElise:捕捉和物理合成鋼琴演奏的手部動作
FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance
October 8, 2024
作者: Ruocheng Wang, Pei Xu, Haochen Shi, Elizabeth Schumann, C. Karen Liu
cs.AI
摘要
彈鋼琴需要靈活、精確且協調的手部控制,挑戰了靈巧性的極限。具備足夠複雜度以準確重現鋼琴演奏的手部運動模型在角色動畫、具身式人工智慧、生物力學以及虛擬/擴增實境等領域有廣泛應用。本文構建了一個獨一無二的大規模數據集,包含約10小時的15位頂尖鋼琴家演奏153首古典音樂作品的3D手部運動和音頻。為了捕捉自然的演奏表現,我們設計了一個無標記的設置,通過多視角視頻重建運動,並使用最先進的姿勢估計模型。運動數據進一步通過逆向運動學進行精煉,利用從專用Yamaha Disklavier鋼琴的傳感器獲得的高分辨率MIDI鍵盤敲擊數據。利用收集的數據集,我們開發了一個流程,可以為數據集之外的樂譜合成物理上合理的手部運動。我們的方法結合模仿學習和強化學習,獲取涉及手部和鋼琴鍵盤互動的基於物理的雙手控制策略。為了解決大型運動數據集的採樣效率問題,我們使用擴散模型生成自然參考運動,提供高水準的軌跡和指法(手指順序和位置)信息。然而,僅憑生成的參考運動無法提供足夠的準確性進行鋼琴演奏建模。然後,我們通過使用音樂相似性從捕獲的數據集中檢索相似運動,以增強RL策略的精確性。通過提出的方法,我們的模型生成自然、靈巧的運動,可以泛化到訓練數據集之外的音樂。
English
Piano playing requires agile, precise, and coordinated hand control that
stretches the limits of dexterity. Hand motion models with the sophistication
to accurately recreate piano playing have a wide range of applications in
character animation, embodied AI, biomechanics, and VR/AR. In this paper, we
construct a first-of-its-kind large-scale dataset that contains approximately
10 hours of 3D hand motion and audio from 15 elite-level pianists playing 153
pieces of classical music. To capture natural performances, we designed a
markerless setup in which motions are reconstructed from multi-view videos
using state-of-the-art pose estimation models. The motion data is further
refined via inverse kinematics using the high-resolution MIDI key-pressing data
obtained from sensors in a specialized Yamaha Disklavier piano. Leveraging the
collected dataset, we developed a pipeline that can synthesize
physically-plausible hand motions for musical scores outside of the dataset.
Our approach employs a combination of imitation learning and reinforcement
learning to obtain policies for physics-based bimanual control involving the
interaction between hands and piano keys. To solve the sampling efficiency
problem with the large motion dataset, we use a diffusion model to generate
natural reference motions, which provide high-level trajectory and fingering
(finger order and placement) information. However, the generated reference
motion alone does not provide sufficient accuracy for piano performance
modeling. We then further augmented the data by using musical similarity to
retrieve similar motions from the captured dataset to boost the precision of
the RL policy. With the proposed method, our model generates natural, dexterous
motions that generalize to music from outside the training dataset.Summary
AI-Generated Summary