IVRA: トレーニング不要なヒントベースガイダンスによるロボット行動ポリシーの視覚-トークン関係改善

要旨

多くのVision-Language-Action（VLA）モデルは、画像パッチを1次元のトークン系列に平坦化するため、精密な操作に必要な2次元空間的手がかりが弱まってしまう。本論文ではIVRAを提案する。これは軽量で訓練不要な手法であり、外部エンコーダや再訓練を必要とせず、モデル内蔵の視覚エンコーダに既に存在する親和性ヒントを利用して空間理解を改善する。IVRAはこれらの親和性信号を、インスタンスレベルの特徴が存在する言語モデル層に選択的に注入する。この推論時介入により、視覚トークン間の相互作用が再調整され、全てのモデルパラメータを固定したまま幾何学的構造がより良く保持される。我々はIVRAの汎用性を、様々なVLAアーキテクチャ（LLaRA、OpenVLA、FLOWER）に適用し、2Dおよび3D操作（VIMAとLIBERO）を含むシミュレーションベンチマークと実ロボットタスクで実証する。2D VIMAでは、データ量が少ない条件下で、IVRAはベースラインのLLaRAよりも平均成功率を+4.2%向上させた。3D LIBEROでは、OpenVLAおよびFLOWERのベースラインに対し一貫した性能向上をもたらし、ベースライン精度が飽和に近い場合（96.3%から97.1%）でも改善が見られた。全てのコードとモデルは公開予定である。ビジュアライゼーションはjongwoopark7978.github.io/IVRAで閲覧可能。

English

Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% to 97.1%). All code and models will be released publicly. Visualizations are available at: jongwoopark7978.github.io/IVRA

IVRA: トレーニング不要なヒントベースガイダンスによるロボット行動ポリシーの視覚-トークン関係改善

IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

要旨

Support