Flash-WAM: 세계 행동 모델을 위한 모달리티 인식 증류

초록

세계-행동 모델(WAM)은 반복적 확산을 통해 미래 비디오와 로봇 행동을 공동으로 생성하여 조작 벤치마크에서 강력한 성능을 달성하지만, 수십 번의 잡음 제거 단계를 필요로 하여 실시간 제어를 불가능하게 하는 비용을 초래한다. 단계 증류가 자연스러운 해결책으로 부상했지만, 기성 방법들은 비디오와 행동 스트림이 서로 다른 SNR 이동 잡음 스케줄을 사용하고 훈련 시 현저히 다른 주변 잡음 분포에 도달하기 때문에, 단일 양식 증류 방법이 이러한 비대칭성을 수용할 수 없는 공동 비디오-행동 설정에서 작동하지 않는다. 우리는 Flash-WAM을 제안한다. 이는 일관성 증류에서 영감을 받은 양식 인식 단계 증류 프레임워크로, 각 양식에 대해 일관성 함수를 선택하여 해당 잡음 체계에 맞춘다: 행동 스트림의 저잡음 체계에는 선형 기울기 스케일링 매개변수화를, 비디오 스트림의 고잡음 체계에는 분산 보존 매개변수화를 짝지으며, 이는 일관성 경계 조건 하에서 달성 가능한 기울기 스케일링을 특성화하는 일관성 함수군의 구조적 분석에 기반한다. LingBot-VA에 적용된 Flash-WAM은 각 양식에서 추론을 단일 단계로 압축한다. RoboTwin 2.0에서 이는 NVIDIA L40S에서 청크당 지연 시간을 8.1초에서 348ms로 줄여 23배의 속도 향상을 이루며 실시간 추론을 가능하게 한다. Flash-WAM은 시뮬레이션 벤치마크에서 작업 성공률을 유지하고(RoboTwin 2.0 85.5%, LIBERO 95.7%), 실제 세계 성능을 상당 부분 회복하며(Unitree G1 휴머노이드 로봇 평균 60%), 동일한 단계 예산에서 순진한 일관성 증류는 24%로 하락한다.

English

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce Flash-WAM, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from 8.1 seconds to 348 ms on NVIDIA L40S, a 23{times} speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks (85.5% RoboTwin 2.0, 95.7% LIBERO) and substantially recovers real-world performance (60% average on a Unitree G1 humanoid robot), while naive consistency distillation drops to 24% at the same step budget.