VER：基於基礎蒸餾與動態路由的機器人學習視覺專家變換器

摘要

預訓練視覺基礎模型（VFMs）通過豐富的視覺表徵推動了機器人學習的進步，然而單一的VFM通常僅在特定領域表現卓越，這限制了其跨任務的通用性。將多個VFMs蒸餾成一個統一的策略表徵可以緩解這一限制，但往往導致任務特徵選擇的僵化，並需要耗費大量資源進行全面重新訓練以融入機器人領域知識。我們提出了VER，一種用於機器人學習的視覺專家變換器。在預訓練階段，VER將多個VFMs蒸餾成一個視覺專家庫。隨後，它僅對一個輕量級路由網絡（參數少於0.4%）進行微調，以動態地從預訓練庫中選擇與任務相關的專家，用於下游機器人任務。我們進一步引入了基於課程Top-K退火的局部專家路由，以提高動態專家選擇的靈活性與精確度。此外，VER支持參數高效的微調，實現了專家的可擴展利用及機器人領域知識的自適應整合。在17項多樣化的機器人任務及多種策略頭中，VER達到了最先進的性能。我們發現，VER減少了與任務無關區域（如背景）中的大範數異常值，並集中於任務關鍵區域。視覺化展示與代碼可於https://yixiaowang7.github.io/ver_page/查閱。

English

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.

VER：基於基礎蒸餾與動態路由的機器人學習視覺專家變換器

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

摘要

Support