음성 기반 추론 능력 평가: 모달리티 유발 성능 격차 진단

초록

우리는 실시간 대화 제약 하에서 음성 상호작용 시스템의 추론 능력을 평가하기 위한 벤치마크인 Voice Evaluation of Reasoning Ability(VERA)를 소개한다. VERA는 기존 텍스트 벤치마크에서 도출된 2,931개의 음성 중심 에피소드로 구성되며, 수학(Math), 웹(Web), 과학(Science), 장문맥(Long-Context), 사실(Factual)의 다섯 가지 트랙으로 조직된다. 각 항목은 추론 난이도를 유지하면서 음성 상호작용에 맞게 조정되었다. VERA는 모델 패밀리 내에서 텍스트와 음성 간의 직접적인 비교를 가능하게 하며, 아키텍처 선택이 신뢰성에 미치는 영향을 분석할 수 있도록 지원한다. 우리는 12개의 현대 음성 시스템을 강력한 텍스트 기준선과 함께 평가했으며, 큰 일관된 양식 간 격차를 관찰했다: 경쟁 수학 문제에서 선두 텍스트 모델은 74.8%의 정확도를 달성한 반면, 해당 음성 모델은 6.1%에 그쳤다; 모든 트랙을 매크로 평균했을 때 최고의 텍스트 모델은 54.0%를 달성한 반면, 음성 모델은 11.3%에 그쳤다. 지연시간-정확도 분석은 저지연 플래토를 보여주는데, 빠른 음성 시스템은 약 10% 정확도 주변에 모여 있는 반면, 텍스트 성능에 접근하려면 실시간 상호작용을 희생해야 한다. 진단 실험은 일반적인 완화 조치들이 불충분함을 나타낸다. "생각 시간"을 늘리는 것은 미미한 이득만을 가져오며, 추론과 내레이션을 분리한 디커플드 캐스케이드는 정확도를 향상시키지만 여전히 텍스트에 크게 못 미치고 특징적인 접지/일관성 오류를 도입한다. 실패 분석은 더 나아가 네이티브 스트리밍, 엔드투엔드, 캐스케이드 설계 간에 뚜렷한 오류 패턴을 보여준다. VERA는 생각과 말하기를 분리하는 아키텍처를 위한 재현 가능한 테스트베드와 표적 진단을 제공함으로써, 유창하고 신뢰할 수 있는 추론을 수행하는 실시간 음성 어시스턴트를 향한 진전을 측정하는 원칙적인 방법을 제시한다.

English

We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing "thinking time" yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.

음성 기반 추론 능력 평가: 모달리티 유발 성능 격차 진단

Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

초록

Support