ChatPaper.aiChatPaper

视觉-语言-动作模型的任务适应性:2025年BEHAVIOR挑战赛冠军方案

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

December 7, 2025
作者: Ilia Larchenko, Gleb Zarin, Akash Karnatak
cs.AI

摘要

我们提出了一种视觉-动作策略,该策略在2025年BEHAVIOR挑战赛中荣获冠军。该挑战赛采用大规模基准测试,包含50项多样化的长周期家庭任务,在照片级真实模拟环境中要求执行双手操作、导航及情境感知决策。基于Pi0.5架构,我们引入了多项创新:核心贡献是提出用于流匹配的关联噪声技术,既提升了训练效率,又通过关联感知修复实现流畅动作序列;同时采用可学习混合层注意力机制与系统二级阶段追踪来解决模糊决策问题。训练阶段使用多样本流匹配以降低方差,推理阶段则结合动作压缩与挑战赛专用修正规则。该方法在公开及非公开排行榜的50项任务中均获得26%的q-score综合评分。
English
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
PDF32December 17, 2025