VER:基於基礎蒸餾與動態路由的機器人學習視覺專家變換器
VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing
October 6, 2025
作者: Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka
cs.AI
摘要
預訓練視覺基礎模型(VFMs)通過豐富的視覺表徵推動了機器人學習的進步,然而單一的VFM通常僅在特定領域表現卓越,這限制了其跨任務的通用性。將多個VFMs蒸餾成一個統一的策略表徵可以緩解這一限制,但往往導致任務特徵選擇的僵化,並需要耗費大量資源進行全面重新訓練以融入機器人領域知識。我們提出了VER,一種用於機器人學習的視覺專家變換器。在預訓練階段,VER將多個VFMs蒸餾成一個視覺專家庫。隨後,它僅對一個輕量級路由網絡(參數少於0.4%)進行微調,以動態地從預訓練庫中選擇與任務相關的專家,用於下游機器人任務。我們進一步引入了基於課程Top-K退火的局部專家路由,以提高動態專家選擇的靈活性與精確度。此外,VER支持參數高效的微調,實現了專家的可擴展利用及機器人領域知識的自適應整合。在17項多樣化的機器人任務及多種策略頭中,VER達到了最先進的性能。我們發現,VER減少了與任務無關區域(如背景)中的大範數異常值,並集中於任務關鍵區域。視覺化展示與代碼可於https://yixiaowang7.github.io/ver_page/查閱。
English
Pretrained vision foundation models (VFMs) advance robotic learning via rich
visual representations, yet individual VFMs typically excel only in specific
domains, limiting generality across tasks. Distilling multiple VFMs into a
unified representation for policy can mitigate this limitation but often yields
inflexible task-specific feature selection and requires costly full re-training
to incorporate robot-domain knowledge. We propose VER, a Vision Expert
transformer for Robot learning. During pretraining, VER distills multiple VFMs
into a vision expert library. It then fine-tunes only a lightweight routing
network (fewer than 0.4% of parameters) to dynamically select task-relevant
experts from the pretrained library for downstream robot tasks. We further
introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve
both flexibility and precision of dynamic expert selection. Moreover, VER
supports parameter-efficient finetuning for scalable expert utilization and
adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks
and multiple policy heads, VER achieves state-of-the-art performance. We find
that VER reduces large-norm outliers in task-irrelevant regions (e.g.,
background) and concentrates on task-critical regions. Visualizations and codes
can be found in https://yixiaowang7.github.io/ver_page/.