ChatPaper.aiChatPaper

VER:基于基础模型蒸馏与动态路由的机器人学习视觉专家Transformer

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

October 6, 2025
作者: Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka
cs.AI

摘要

预训练视觉基础模型(VFMs)通过丰富的视觉表征推动了机器人学习的发展,然而单个VFM通常在特定领域表现出色,限制了其在跨任务中的通用性。将多个VFM蒸馏为统一的策略表示可以缓解这一局限,但往往导致任务特定的特征选择缺乏灵活性,并且需要昂贵的全面重新训练以融入机器人领域知识。我们提出了VER,一种用于机器人学习的视觉专家Transformer。在预训练阶段,VER将多个VFM蒸馏为一个视觉专家库。随后,它仅微调一个轻量级路由网络(参数少于0.4%),以从预训练库中动态选择与任务相关的专家,用于下游机器人任务。我们进一步引入了基于课程Top-K退火的逐块专家路由,以提高动态专家选择的灵活性和精确度。此外,VER支持参数高效的微调,以实现可扩展的专家利用和自适应的机器人领域知识整合。在17项多样化的机器人任务和多种策略头中,VER实现了最先进的性能。我们发现,VER减少了任务无关区域(如背景)中的大范数异常值,并聚焦于任务关键区域。可视化效果和代码可在https://yixiaowang7.github.io/ver_page/查看。
English
Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.
PDF42October 14, 2025