OxyGen：多任务并行下视觉-语言-动作模型的统一KV缓存管理方案

摘要

随着具身智能体日益需要从共享观察中并行执行具有不同时间约束的多个任务（如操作控制、对话交互和记忆构建），混合Transformer架构的视觉-语言-动作模型虽在结构上支持此类异构输出，但现有推理系统因冗余计算和资源竞争难以实现适用于终端部署的高效多任务并行。我们发现孤立管理的KV缓存是问题根源。为此提出统一KV缓存管理范式，将KV缓存视为跨任务、跨时序的一级共享资源。该抽象实现两大优化：跨任务KV共享消除共享观察的重复预填充，而跨帧连续批处理则将变长语言解码与固定频率的动作生成在控制周期内解耦。我们在最流行的混合Transformer VLA模型π_{0.5}上实现该范式，并在典型机器人配置下评估。OxyGen相比孤立执行最高可实现3.7倍加速，在保持动作质量的同时达成超过200词/秒的语言吞吐量与70赫兹的动作频率。

English

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment due to redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference paradigm that treats KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this paradigm for π_{0.5}, the most popular MoT VLA, and evaluate under representative robotic configurations. OxyGen achieves up to 3.7times speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without action quality degradation.

OxyGen：多任务并行下视觉-语言-动作模型的统一KV缓存管理方案

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

摘要

Support