推論能力の音声評価：モダリティ誘発性能ギャップの診断

要旨

本論文では、リアルタイム会話制約下における音声インタラクティブシステムの推論能力を評価するためのベンチマーク「Voice Evaluation of Reasoning Ability (VERA)」を提案する。VERAは、確立されたテキストベンチマークから派生した2,931の音声ネイティブなエピソードで構成され、5つのトラック（数学、ウェブ、科学、長文脈、事実）に分類されている。各項目は推論の難易度を維持しつつ、音声インタラクション向けに適応されている。VERAは、モデルファミリー内でのテキストと音声の直接比較を可能にし、アーキテクチャの選択が信頼性に与える影響の分析を支援する。我々は12の最新音声システムを強力なテキストベースラインとともに評価し、大きな一貫したモダリティギャップを観察した：競技数学において、主要なテキストモデルは74.8%の精度を達成するのに対し、その音声版は6.1%に留まる；全トラックのマクロ平均では、最良のテキストモデルは54.0%を達成するのに対し、音声は11.3%である。レイテンシーと精度の分析から、低レイテンシープラトーが明らかになり、高速な音声システムは約10%の精度に集中する一方、テキスト性能に近づくにはリアルタイムインタラクションを犠牲にする必要がある。診断実験から、一般的な緩和策では不十分であることが示された。「思考時間」を増やしても、わずかな改善しか得られない；推論とナレーションを分離したデカップルドカスケードは精度を向上させるが、テキストには遠く及ばず、特徴的なグラウンディング/一貫性エラーを導入する。失敗分析からは、ネイティブストリーミング、エンドツーエンド、カスケード設計それぞれに異なるエラー特性が示された。VERAは、思考と発話を分離するアーキテクチャのための再現可能なテストベッドとターゲット診断を提供し、流暢かつ信頼性のある推論を実現するリアルタイム音声アシスタントに向けた進捗を測定するための原理的な方法を提供する。

English

We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing "thinking time" yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.

推論能力の音声評価：モダリティ誘発性能ギャップの診断

Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

要旨

Support