語音推理能力評估:診斷模態誘導的性能差距
Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
September 30, 2025
作者: Yueqian Lin, Zhengmian Hu, Qinsi Wang, Yudong Liu, Hengfan Zhang, Jayakumar Subramanian, Nikos Vlassis, Hai Helen Li, Yiran Chen
cs.AI
摘要
我們提出了語音推理能力評估(VERA),這是一個用於評估語音互動系統在即時對話限制下推理能力的基準。VERA包含2,931個源自現有文本基準的語音原生片段,並分為五個類別(數學、網絡、科學、長上下文、事實)。每個項目都針對語音互動進行了改編,同時保留了推理難度。VERA使得模型家族內能夠直接進行文本與語音的比較,並支持分析架構選擇如何影響可靠性。我們評估了12個當代語音系統以及強大的文本基線,觀察到顯著且一致的模態差距:在競賽數學中,領先的文本模型達到74.8%的準確率,而其語音對應版本僅達到6.1%;在宏觀平均的各類別中,最佳文本模型達到54.0%,而語音模型僅為11.3%。延遲-準確率分析揭示了一個低延遲平台期,快速語音系統的準確率集中在約10%,而要接近文本性能則需要犧牲即時互動。診斷實驗表明,常見的緩解措施效果有限。增加“思考時間”帶來的收益微乎其微;將推理與敘述分離的解耦級聯系統提高了準確率,但仍遠不及文本水平,並引入了典型的接地/一致性錯誤。失敗分析進一步顯示,原生流式、端到端和級聯設計之間存在不同的錯誤特徵。VERA為將思考與說話分離的架構提供了一個可重複的測試平台和針對性診斷,為衡量實現既流暢又可靠推理的即時語音助手的進展提供了一種原則性的方法。
English
We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for
evaluating reasoning ability in voice-interactive systems under real-time
conversational constraints. VERA comprises 2,931 voice-native episodes derived
from established text benchmarks and organized into five tracks (Math, Web,
Science, Long-Context, Factual). Each item is adapted for speech interaction
while preserving reasoning difficulty. VERA enables direct text-voice
comparison within model families and supports analysis of how architectural
choices affect reliability. We assess 12 contemporary voice systems alongside
strong text baselines and observe large, consistent modality gaps: on
competition mathematics a leading text model attains 74.8% accuracy while its
voice counterpart reaches 6.1%; macro-averaged across tracks the best text
models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a
low-latency plateau, where fast voice systems cluster around ~10% accuracy,
while approaching text performance requires sacrificing real-time interaction.
Diagnostic experiments indicate that common mitigations are insufficient.
Increasing "thinking time" yields negligible gains; a decoupled cascade that
separates reasoning from narration improves accuracy but still falls well short
of text and introduces characteristic grounding/consistency errors. Failure
analyses further show distinct error signatures across native streaming,
end-to-end, and cascade designs. VERA provides a reproducible testbed and
targeted diagnostics for architectures that decouple thinking from speaking,
offering a principled way to measure progress toward real-time voice assistants
that are both fluent and reliably reasoned.