ChatPaper.aiChatPaper

Flash-WAM:世界行動模型的模態感知蒸餾

Flash-WAM: Modality-Aware Distillation for World Action Models

June 3, 2026
作者: Arman Akbari, Ci Zhang, Arash Akbari, Lin Zhao, Yixiao Chen, Weiwei Chen, Xuan Zhang, Geng Yuan, Yanzhi Wang
cs.AI

摘要

世界-动作模型(WAMs)通过迭代扩散联合生成未来视频和机器人动作,在操控基准测试中表现优异,但需要数十次去噪步骤,这一成本使其无法实现实时控制。步蒸馏成为自然解决方案,但现成方法在联合视频-动作设定中失效,原因在于视频和动作流采用不同的信噪比偏移噪声调度,并在训练时具有显著不同的边际噪声分布——这种非对称性是单模态蒸馏方法无法应对的。我们提出Flash-WAM,一种受一致性蒸馏启发的模态感知步蒸馏框架,其为每种模态选择与噪声机制相匹配的一致性函数:为动作流的低噪声机制采用线性梯度缩放参数化,为视频流的高噪声机制采用方差保持参数化,这一设计基于对一致性函数族的结构性分析——该分析刻画了在一致性边界条件下可实现的梯度缩放特性。在LingBot-VA上实例化后,Flash-WAM将推理压缩至每种模态单步完成。在RoboTwin 2.0上,单块延迟从8.1秒降至NVIDIA L40S上的348毫秒,实现23倍加速,从而支持实时推理。Flash-WAM在仿真基准测试中保持任务成功率(RoboTwin 2.0上85.5%,LIBERO上95.7%),并在真实世界性能上显著恢复(宇树G1人形机器人平均60%),而朴素一致性蒸馏在相同步数下仅能达到24%。
English
World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce Flash-WAM, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from 8.1 seconds to 348 ms on NVIDIA L40S, a 23{times} speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks (85.5% RoboTwin 2.0, 95.7% LIBERO) and substantially recovers real-world performance (60% average on a Unitree G1 humanoid robot), while naive consistency distillation drops to 24% at the same step budget.