超越統一模型:面向實時TTS的低延遲情境感知音素轉換服務化方案
Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS
December 8, 2025
作者: Mahta Fetrat, Donya Navabi, Zahra Dehghanian, Morteza Abolghasemi, Hamid R. Rabiee
cs.AI
摘要
輕量級即時文字轉語音系統對於無障礙應用至關重要。然而最高效的TTS模型通常依賴輕量級音素轉換器,這類轉換器難以應對上下文相關的語音挑戰。相比之下,具備更深層語言理解能力的高級音素轉換器往往伴隨高昂計算成本,導致其實時性能受限。本文研究G2P輔助TTS系統中音素轉換質量與推理速度的權衡關係,提出實用框架以彌合此差距。我們創建了面向上下文感知的輕量級音素轉換策略,並設計服務導向的TTS架構,將這些模塊作為獨立服務運行。該設計使重載的上下文感知組件與核心TTS引擎解耦,成功突破延遲瓶頸,實現高質量音素轉換模型的實時應用。實驗結果證實,該系統在保持實時響應的同時,能有效提升發音合理性與語言準確性,特別適用於離線及終端設備的TTS應用場景。
English
Lightweight, real-time text-to-speech systems are crucial for accessibility. However, the most efficient TTS models often rely on lightweight phonemizers that struggle with context-dependent challenges. In contrast, more advanced phonemizers with a deeper linguistic understanding typically incur high computational costs, which prevents real-time performance.
This paper examines the trade-off between phonemization quality and inference speed in G2P-aided TTS systems, introducing a practical framework to bridge this gap. We propose lightweight strategies for context-aware phonemization and a service-oriented TTS architecture that executes these modules as independent services. This design decouples heavy context-aware components from the core TTS engine, effectively breaking the latency barrier and enabling real-time use of high-quality phonemization models. Experimental results confirm that the proposed system improves pronunciation soundness and linguistic accuracy while maintaining real-time responsiveness, making it well-suited for offline and end-device TTS applications.