ChatPaper.aiChatPaper

语音推理能力评估:诊断模态引发的性能差距

Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

September 30, 2025
作者: Yueqian Lin, Zhengmian Hu, Qinsi Wang, Yudong Liu, Hengfan Zhang, Jayakumar Subramanian, Nikos Vlassis, Hai Helen Li, Yiran Chen
cs.AI

摘要

我们推出了语音推理能力评估基准(VERA),这是一个在实时对话约束下评估语音交互系统推理能力的标准。VERA包含2,931个源自现有文本基准的语音原生场景,分为五个领域(数学、网络、科学、长上下文、事实)。每个项目均针对语音交互进行了适配,同时保留了推理难度。VERA支持在模型家族内直接进行文本与语音的对比,并有助于分析架构选择如何影响可靠性。我们评估了12个当代语音系统,并与强大的文本基线进行了比较,观察到显著且一致的模态差距:在竞赛数学领域,领先的文本模型准确率达到74.8%,而其语音对应模型仅为6.1%;跨领域宏观平均,最佳文本模型准确率为54.0%,而语音模型仅为11.3%。延迟-准确性分析揭示了一个低延迟平台期,快速语音系统准确率集中在约10%,而要接近文本性能则需牺牲实时交互。诊断实验表明,常见的缓解措施效果有限。增加“思考时间”带来的提升微乎其微;将推理与叙述分离的解耦级联策略虽提高了准确性,但仍远不及文本水平,并引入了特有的基础/一致性错误。失败分析进一步揭示了原生流式、端到端及级联设计之间不同的错误特征。VERA为解耦思考与说话的架构提供了可复现的测试平台和针对性诊断,为衡量向既流畅又推理可靠的实时语音助手迈进提供了原则性方法。
English
We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing "thinking time" yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.
PDF11October 1, 2025