OTTER：具備文本感知視覺特徵提取能力的視覺-語言-動作模型

摘要

視覺-語言-動作（VLA）模型旨在基於視覺觀察和語言指令預測機器人動作。現有方法需要對預訓練的視覺語言模型（VLM）進行微調，因為視覺和語言特徵被獨立輸入下游策略，這會降低預訓練的語義對齊效果。我們提出了OTTER，一種新穎的VLA架構，它通過顯式的、文本感知的視覺特徵提取來利用這些現有的對齊關係。OTTER並非處理所有視覺特徵，而是選擇性地提取並僅傳遞與語言指令語義對齊的任務相關視覺特徵給策略變換器。這使得OTTER能夠保持預訓練的視覺語言編碼器凍結不變。因此，OTTER保留並利用了從大規模預訓練中學到的豐富語義理解，實現了強大的零樣本泛化能力。在模擬和真實世界的實驗中，OTTER顯著優於現有的VLA模型，展示了對新物體和環境的強大零樣本泛化能力。視頻、代碼、檢查點和數據集：https://ottervla.github.io/。

English

Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained visionlanguage models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zeroshot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.

OTTER：具備文本感知視覺特徵提取能力的視覺-語言-動作模型

OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

摘要

Support