Flash-WAM: 世界行動モデルのためのモダリティ認識蒸留

要旨

ワールドアクションモデル（WAM）は、反復拡散を通じて将来の動画とロボット動作を同時に生成し、操作ベンチマークで高い性能を示すが、数十のノイズ除去ステップを必要とし、そのコストがリアルタイム制御を妨げる。ステップ蒸留が自然な解決策として登場したが、既製の手法は動画と動作の共同設定では機能しない。なぜなら、動画ストリームと動作ストリームは異なるSNRシフト付きノイズスケジュールを使用し、訓練時には大幅に異なる限界ノイズ分布に達するため、単一モダリティの蒸留手法では対応できない非対称性が生じるからである。本稿ではFlash-WAMを紹介する。これは一致性蒸留に着想を得たモダリティ認識型ステップ蒸留フレームワークであり、各モダリティのノイズ状況に合わせて一致性関数を選択する。具体的には、動作ストリームの低ノイズ領域には線形勾配スケーリングパラメータ化を、動画ストリームの高ノイズ領域には分散保存パラメータ化を組み合わせる。これは、一致性境界条件の下で達成可能な勾配スケーリングを特徴付ける一致性関数ファミリーの構造解析に基づいている。Flash-WAMはLingBot-VA上で実装され、各モダリティの推論を単一ステップに圧縮する。RoboTwin 2.0では、これによりNVIDIA L40S上でチャンクあたりのレイテンシが8.1秒から348ミリ秒に短縮され、23倍の高速化によりリアルタイム推論が可能となる。Flash-WAMはシミュレーションベンチマークでのタスク成功率を維持し（RoboTwin 2.0で85.5%、LIBEROで95.7%）、実世界性能も大幅に回復させる（Unitree G1ヒューマノイドロボットで平均60%）。一方、単純な一致性蒸留では同じステップ予算で24%に低下する。

English

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce Flash-WAM, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from 8.1 seconds to 348 ms on NVIDIA L40S, a 23{times} speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks (85.5% RoboTwin 2.0, 95.7% LIBERO) and substantially recovers real-world performance (60% average on a Unitree G1 humanoid robot), while naive consistency distillation drops to 24% at the same step budget.