VITA-E: 視覚・聴覚・発話・行動を同時に行う自然な身体性インタラクション

要旨

現在のVision-Language-Action（VLA）モデルは、しばしば硬直的で静的な相互作用パラダイムに制約されており、環境の観察、ユーザー音声の聴取、応答発話、動作実行を同時並行的に処理し、リアルタイムのユーザー割り込みを動的に扱う能力を欠いています。これは没入型協調作業のシームレスな実現を妨げ、柔軟性と応答性に欠けるユーザー体験をもたらしています。これらの課題を解決するため、我々は行動の並行性と準リアルタイム割り込みの両立を目指した新しい没入型相互作用フレームワーク「VITA-E」を提案します。本手法の中核は、2つの並列VLAインスタンスが「能動モデル」と「待機モデル」として機能するデュアルモデルアーキテクチャであり、没入型エージェントが人間のようなマルチタスク能力を模倣し、すべての処理を並行的かつ割り込み可能に実行することを可能にします。さらに、VLMを微調整して特殊トークンを生成し、それを直接的なシステムレベルコマンドとして機能させる「モデル即コントローラ」パラダイムを提案します。これによりモデルの推論とシステムの挙動が緊密に連携します。物理的人型プラットフォームでの実験により、VITA-Eが複雑な対話シナリオを確実に処理できることを実証しました。本フレームワークは様々なデュアルシステムVLAモデルと互換性があり、緊急停止と音声割り込みで極めて高い成功率を達成するとともに、音声と動作の並行実行にも成功しています。これは、より自然で高能力な没入型アシスタントへの重要な一歩を示すものです。

English

Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an ``Active Model'' and a ``Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.

VITA-E: 視覚・聴覚・発話・行動を同時に行う自然な身体性インタラクション

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

要旨

Support