橋渡し行動としての変換：人間からロボットへの操作スキルの転移

要旨

我々は、平行グリッパを備えた両腕ロボットに対して、人間の行動から新たな操作スキルを学習可能かどうかを検討する。人間の行動データは低コストで豊富かつ多様であり、ロボット学習のスケールアップにおいて最も有望な資源の一つである。しかしながら、スキルを人間からロボットへ転移することは依然として困難である。従来の研究の多くは人間を単なるもう一つの両腕6自由度の身体とみなし、手の姿勢推定はノイズが多く、人間の指の接触パターンは平行グリッパのものと根本的に異なる。したがって、人間のデータから回転を含む行動信号を学習することは最適ではなく、代わりに我々は橋渡し的行動表現として、初期ヘッドカメラフレーム内での相対的な手首並進（人間とロボットに共有される行動空間）を提案する。異なる身体における特定の行動成分の欠落可能性に対処するため、我々はπ_0に類似した、インターリーブ型行動トークンとアテンションマスキングを備えた視覚-言語-行動モデルを構築する。新規な両腕操作タスク群において、我々の橋渡し的行動は、ノイズの多い6自由度の人間行動よりもはるかに効果的に人間の操作知識をロボットに転移し、かつ人間データ量に対してスケールする。

English

We study whether we can learn novel manipulation skills from human actions to a bi-manual robot with parallel grippers. Human action data is cheap, abundant, and diverse, making it one of the most promising resources for scaling up robot learning. Yet transferring skills from humans to robots remains hard: most prior work treats humans as just another bi-manual 6DoF embodiment, where hand-pose estimates are noisy and the contact patterns of human fingers differ fundamentally from those of a parallel gripper. We argue that learning rotation-inclusive action signals from human data is therefore sub-optimal, and instead propose a bridging action representation: the relative wrist translation within the initial head-camera frame, an action space shared by humans and robots. To handle the potential absence of certain action components in different embodiments, we build a π_0-like vision-language-action model with interleaved action tokens and attention masking. On a suite of novel bi-manual manipulation tasks, our bridging action transfers human manipulation knowledge to robots far more effectively than noisy 6DoF human actions and scales with the amount of human data.