ChatPaper.aiChatPaper

VER:基於基礎蒸餾與動態路由的機器人學習視覺專家變換器

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

October 6, 2025
作者: Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka
cs.AI

摘要

預訓練視覺基礎模型(VFMs)通過豐富的視覺表徵推動了機器人學習的進步,然而單一的VFM通常僅在特定領域表現卓越,這限制了其跨任務的通用性。將多個VFMs蒸餾成一個統一的策略表徵可以緩解這一限制,但往往導致任務特徵選擇的僵化,並需要耗費大量資源進行全面重新訓練以融入機器人領域知識。我們提出了VER,一種用於機器人學習的視覺專家變換器。在預訓練階段,VER將多個VFMs蒸餾成一個視覺專家庫。隨後,它僅對一個輕量級路由網絡(參數少於0.4%)進行微調,以動態地從預訓練庫中選擇與任務相關的專家,用於下游機器人任務。我們進一步引入了基於課程Top-K退火的局部專家路由,以提高動態專家選擇的靈活性與精確度。此外,VER支持參數高效的微調,實現了專家的可擴展利用及機器人領域知識的自適應整合。在17項多樣化的機器人任務及多種策略頭中,VER達到了最先進的性能。我們發現,VER減少了與任務無關區域(如背景)中的大範數異常值,並集中於任務關鍵區域。視覺化展示與代碼可於https://yixiaowang7.github.io/ver_page/查閱。
English
Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.
PDF42October 14, 2025