Flash-WAM: Modaliteitsbewuste distillatie voor Wereldactiemodellen

Samenvatting

Wereld-actiemodellen (WAM's) genereren gezamenlijk toekomstige video en robotacties door middel van iteratieve diffusie, behalen sterke prestaties op manipulatiebenchmarks, maar vereisen tientallen denoisingstappen, een kostprijs die real-time besturing uitsluit. Stapdestillatie is naar voren gekomen als de natuurlijke remedie, maar standaardmethoden falen in de gecombineerde video-actieomgeving omdat video- en actiestromen verschillende SNR-verschoven ruisschema's gebruiken en de training bereiken met aanzienlijk verschillende marginale ruisverdelingen, een asymmetrie die enkelvoudige-modaliteitsdestillatiemethoden niet kunnen opvangen. We introduceren Flash-WAM, een modaliteitsbewust stapdestillatieraamwerk geïnspireerd op consistentiedestillatie dat de consistentiefunctie voor elke modaliteit selecteert om overeen te komen met het ruisregime: een lineaire gradiëntschalingsparametrisatie voor het laagruisregime van de actiestroom, gekoppeld aan een variantiebehoudende parametrisatie voor het hoogruisregime van de videostroom, gebaseerd op een structurele analyse van de consistentiefunctiefamilie die de haalbare gradiëntschaling onder de consistentierandvoorwaarde karakteriseert. Geïmplementeerd op LingBot-VA comprimeert Flash-WAM de inferentie tot een enkele stap in elke modaliteit. Op RoboTwin 2.0 reduceert dit de latentie per chunk van 8,1 seconden naar 348 ms op NVIDIA L40S, een 23× versnelling die real-time inferentie mogelijk maakt. Flash-WAM behoudt taaksucces op simulatiebenchmarks (85,5% RoboTwin 2.0, 95,7% LIBERO) en herstelt aanzienlijk de prestaties in de echte wereld (gemiddeld 60% op een Unitree G1 humanoïde robot), terwijl naïeve consistentiedestillatie daalt tot 24% bij hetzelfde stapbudget.

English

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce Flash-WAM, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from 8.1 seconds to 348 ms on NVIDIA L40S, a 23{times} speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks (85.5% RoboTwin 2.0, 95.7% LIBERO) and substantially recovers real-world performance (60% average on a Unitree G1 humanoid robot), while naive consistency distillation drops to 24% at the same step budget.