迈向类人交互式语音识别：结合代理纠错与语义评估

摘要

自动语音识别（ASR）是人机交互的核心组成部分，也是基于大语言模型的智能助手与智能体日益重要的前端。然而，当前大多数ASR系统仍遵循单遍处理范式，这与人类交流方式存在显著差异——在人类交流中，误解通过迭代式澄清与修正得以解决。这种失配导致一旦发生关键语义错误便难以纠正。与此同时，词错误率（WER）或字符错误率（CER）等词元级指标也无法充分反映这一问题。针对这些局限，我们将交互式ASR形式化为多轮修正任务，并提出闭环框架Agentic ASR——该框架将单遍ASR前端与语义纠正、意图路由及基于推理的编辑相结合。我们进一步引入基于大语言模型的语义评估指标：句子级语义错误率（S²ER），并配套开发了可扩展且可复现的交互仿真系统。在多语言、命名实体密集及语码转换基准测试上的实验表明，迭代交互能够持续降低语义错误，且S²ER指标的提升幅度远超传统词元级指标。人机对齐实验与消融研究进一步验证了语义评判器的可靠性及所提框架的稳健性。代码已开源：https://interactiveasr.github.io/，在线演示地址：https://i-asr.sjtuxlance.com/

English

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S^2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/