VoxServe：面向语音大模型的流式传输核心服务系统

摘要

在流式场景下部署现代语音语言模型（SpeechLMs）需要系统具备低延迟、高吞吐量及强流式保障能力。现有系统难以灵活高效地支持多样化模型。我们提出VoxServe——一个为SpeechLMs优化的统一流式服务系统，通过模型执行抽象层将模型架构与系统级优化解耦，从而在单一框架内支持多种SpeechLM架构。基于此抽象层，VoxServe实现了流式感知调度与异步推理流水线，以提升端到端效率。在多个现代SpeechLM上的评估表明，在保持相当延迟的情况下，VoxServe的吞吐量较现有实现提升10-20倍，同时具备优异的流式可用性。项目代码已开源：https://github.com/vox-serve/vox-serve。

English

Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model-execution abstraction that decouples model architecture from system-level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming-aware scheduling and an asynchronous inference pipeline to improve end-to-end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at https://github.com/vox-serve/vox-serve.

VoxServe：面向语音大模型的流式传输核心服务系统

VoxServe: Streaming-Centric Serving System for Speech Language Models

摘要

Support