VoxServe: 音声言語モデルのためのストリーミング中心サービスシステム

要旨

ストリーミング環境における現代的な音声言語モデル（SpeechLM）の導入には、低遅延、高スループット、および強力なストリーミング性の保証を提供するシステムが求められる。既存のシステムは、多様なモデルを柔軟かつ効率的にサポートする点で不十分である。本研究では、SpeechLMのストリーミング性能を最適化する統合サービスシステム、VoxServeを提案する。VoxServeは、モデルアーキテクチャとシステムレベルの最適化を分離するモデル実行抽象化を導入し、単一フレームワーク内で多様なSpeechLMアーキテクチャのサポートを可能にする。この抽象化に基づき、VoxServeはストリーミングを意識したスケジューリングと非同期推論パイプラインを実装し、エンドツーエンドの効率改善を図っている。複数の現代的なSpeechLMを用いた評価により、VoxServeは同等の遅延において既存の実装比で10～20倍高いスループットを達成しつつ、優れたストリーミング実現性を維持することを示した。VoxServeのコードはhttps://github.com/vox-serve/vox-serve で公開されている。

English

Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model-execution abstraction that decouples model architecture from system-level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming-aware scheduling and an asynchronous inference pipeline to improve end-to-end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at https://github.com/vox-serve/vox-serve.

VoxServe: 音声言語モデルのためのストリーミング中心サービスシステム

VoxServe: Streaming-Centric Serving System for Speech Language Models

要旨

Support