OxyGen: Geünificeerd KV-cachebeheer voor Vision-Language-Action-modellen onder Multi-Task Parallelisme

Samenvatting

Geïncarneerde AI-agenten vereisen steeds vaker parallelle uitvoering van meerdere taken, zoals manipulatie, conversatie en geheugenconstructie, op basis van gedeelde observaties onder verschillende temporele beperkingen. Recente Mixture-of-Transformers (MoT) Vision-Language-Action-modellen (VLA's) ondersteunen dergelijke heterogene outputs architecturaal, maar bestaande inferentiesystemen slagen er niet in efficiënte multi-task-paralleliteit te bereiken voor on-device-implementatie vanwege redundante berekeningen en resourceconflicten. Wij identificeren geïsoleerd KV-cachebeheer als de hoofdoorzaak. Om dit aan te pakken, stellen we unified KV cache management voor, een inferentieparadigma dat KV-cache behandelt als een first-class gedeelde resource tussen taken en over tijd. Deze abstractie maakt twee cruciale optimalisaties mogelijk: KV-deling tussen taken elimineert redundante prefill van gedeelde observaties, terwijl continuous batching over frames de variabele-lengte taaldecoupling ontkoppelt van vaste-snelheid actiegeneratie over besturingscycli. We implementeren dit paradigma voor π_{0.5}, de populairste MoT VLA, en evalueren deze onder representatieve robotconfiguraties. OxyGen behaalt een versnelling tot 3,7× ten opzichte van geïsoleerde uitvoering, en levert simultaan meer dan 200 tokens/s taaldoorvoer en 70 Hz actiefrequentie zonder kwaliteitsverlies van acties.

English

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment due to redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference paradigm that treats KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this paradigm for π_{0.5}, the most popular MoT VLA, and evaluate under representative robotic configurations. OxyGen achieves up to 3.7times speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without action quality degradation.

OxyGen: Geünificeerd KV-cachebeheer voor Vision-Language-Action-modellen onder Multi-Task Parallelisme

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

Samenvatting

Support