VLASH: Real-time VLA's via toekomstbewuste asynchrone inferentie

Samenvatting

Vision-Language-Action-modellen (VLA's) worden steeds capabeler in uiteenlopende robottaken. Hun inzet in de praktijk verloopt echter nog traag en inefficiënt: demonstratievideo's worden vaak 5-10x versneld om vloeiend over te komen, met merkbare actiepauzes en vertraagde reacties op omgevingsveranderingen. Asynchrone inferentie biedt een veelbelovende oplossing voor continue en latentiearme besturing door robots in staat te stellen acties uit te voeren en tegelijkertijd inferentie uit te voeren. Omdat de robot en omgeving zich echter tijdens de inferentie blijven ontwikkelen, ontstaat er een temporele verschuiving tussen de voorspellings- en uitvoeringsintervallen. Dit leidt tot aanzienlijke actie-instabiliteit, terwijl bestaande methodes de nauwkeurigheid verminderen of runtime-overhead introduceren om dit te mitigeren. Wij stellen VLASH voor, een algemeen asynchrone inferentieraamwerk voor VLA's dat vloeiende, nauwkeurige en snelle reactiebesturing biedt zonder extra overhead of architectuurwijzigingen. VLASH schat de toekomstige uitvoeringstoestand door de robotstatus vooruit te rollen met de eerder gegenereerde actiechunk, waardoor de kloof tussen voorspelling en uitvoering wordt overbrugd. Experimenten tonen aan dat VLASH een versnelling tot 2,03x bereikt en de reactielatentie tot 17,4x vermindert in vergelijking met synchrone inferentie, waarbij de oorspronkelijke nauwkeurigheid volledig behouden blijft. Bovendien stelt het VLA's in staat om snelle-reactie, hoogprecisietaken uit te voeren, zoals tafeltennissen en whack-a-mole spelen, waar traditionele synchrone inferentie faalt. Code is beschikbaar op https://github.com/mit-han-lab/vlash.

English

Vision-Language-Action models (VLAs) are becoming increasingly capable across diverse robotic tasks. However, their real-world deployment remains slow and inefficient: demonstration videos are often sped up by 5-10x to appear smooth, with noticeable action stalls and delayed reactions to environmental changes. Asynchronous inference offers a promising solution to achieve continuous and low-latency control by enabling robots to execute actions and perform inference simultaneously. However, because the robot and environment continue to evolve during inference, a temporal misalignment arises between the prediction and execution intervals. This leads to significant action instability, while existing methods either degrade accuracy or introduce runtime overhead to mitigate it. We propose VLASH, a general asynchronous inference framework for VLAs that delivers smooth, accurate, and fast reaction control without additional overhead or architectural changes. VLASH estimates the future execution-time state by rolling the robot state forward with the previously generated action chunk, thereby bridging the gap between prediction and execution. Experiments show that VLASH achieves up to 2.03x speedup and reduces reaction latency by up to 17.4x compared to synchronous inference while fully preserving the original accuracy. Moreover, it empowers VLAs to handle fast-reaction, high-precision tasks such as playing ping-pong and playing whack-a-mole, where traditional synchronous inference fails. Code is available at https://github.com/mit-han-lab/vlash

VLASH: Real-time VLA's via toekomstbewuste asynchrone inferentie

VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

Samenvatting

Support