Erlernen nativer Fortsetzungen für Aktionssegmentierungs-Flussrichtlinien

papers.abstract

Action Chunking ermöglicht es Vision-Language-Action (VLA)-Modellen, in Echtzeit zu arbeiten, doch naive, segmentierte Ausführung zeigt häufig Diskontinuitäten an den Segmentgrenzen. Real-Time Chunking (RTC) mildert dieses Problem, ist jedoch extern zur Policy, was zu unechtem multimodalen Wechseln und Trajektorien führt, die nicht intrinsisch glatt sind. Wir schlagen Legato vor, eine Continuation-Methode zur Trainingszeit für aktionssegmentierte, flussbasierte VLA-Policies. Konkret initialisiert Legato die Entrauschung aus einer zeitplan-geformten Mischung bekannter Aktionen und Rauschen, wodurch das Modell teilweisen Aktionsinformationen ausgesetzt wird. Darüber hinaus formt Legato die gelernten Flussdynamiken um, um sicherzustellen, dass der Entrauschungsprozess während Training und Inferenz unter schrittweiser Führung konsistent bleibt. Legato verwendet zudem randomisierte Zeitplanbedingungen während des Trainings, um variable Inferenzverzögerungen zu unterstützen und kontrollierbare Glattheit zu erreichen. Empirisch erzeugt Legato glattere Trajektorien und reduziert unechtes multimodales Wechseln während der Ausführung, was zu weniger Zögern und kürzerer Aufgabenbearbeitungszeit führt. Umfangreiche Experimente in der realen Welt zeigen, dass Legato RTC bei fünf Manipulationsaufgaben konsistent übertrifft und dabei etwa 10 % Verbesserungen sowohl bei der Trajektorienglattheit als auch bei der Aufgabenbearbeitungszeit erzielt.

English

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

Erlernen nativer Fortsetzungen für Aktionssegmentierungs-Flussrichtlinien

Learning Native Continuation for Action Chunking Flow Policies

papers.abstract

Support