ChatPaper.aiChatPaper

视觉-语言-动作模型的任务适应性:2025年BEHAVIOR挑战赛冠军方案

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

December 7, 2025
作者: Ilia Larchenko, Gleb Zarin, Akash Karnatak
cs.AI

摘要

我们提出的视动策略在2025年BEHAVIOR挑战赛中荣获冠军——该大规模基准测试包含50项多样化的长周期家庭任务,在逼真模拟环境中要求双手操作、导航和情境感知决策能力。基于Pi0.5架构,我们引入多项创新:核心贡献是提出用于流匹配的关联噪声技术,既提升训练效率,又能通过关联感知修复生成流畅动作序列;同时采用可学习混合层注意力机制与系统二阶段追踪来解决任务歧义。训练阶段使用多样本流匹配降低方差,推理阶段则采用动作压缩和针对挑战赛的修正规则。该方法在公开与私有排行榜的50项任务中均取得26%的综合q分数。
English
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
PDF32December 17, 2025