視覚-言語-行動モデルに関する調査：行動トークン化の観点から

要旨

視覚と言語の基盤モデルが、マルチモーダル理解、推論、生成において目覚ましい進展を遂げたことで、その知能を物理世界に拡張しようとする取り組みが活発化し、視覚-言語-行動（VLA）モデルの発展が加速している。一見多様なアプローチが存在するように見えるが、現在のVLAモデルは単一のフレームワークの下で統合可能であることが観察される。すなわち、視覚と言語の入力は一連のVLAモジュールによって処理され、次第に具体的で実行可能な情報をエンコードする一連のアクショントークンを生成し、最終的に実行可能な行動を生成する。さらに、VLAモデルを区別する主要な設計選択は、アクショントークンがどのように形成されるかにあり、それは言語記述、コード、アフォーダンス、軌跡、目標状態、潜在表現、生の行動、推論に分類できることが明らかとなった。しかし、アクショントークンに関する包括的な理解は依然として不足しており、効果的なVLA開発を大きく妨げ、将来の方向性を曖昧にしている。したがって、本調査は、アクショントークン化の観点から既存のVLA研究を分類・解釈し、各トークンタイプの長所と限界を抽出し、改善すべき領域を特定することを目的とする。この体系的なレビューと分析を通じて、VLAモデルのより広範な進化に関する統合的な展望を提供し、未開拓ながら有望な方向性を強調し、将来の研究に対する指針を提供することで、この分野が汎用人工知能に近づくことを期待する。

English

The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of action tokens that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.

視覚-言語-行動モデルに関する調査：行動トークン化の観点から

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

要旨

Support