以自主校正與語義評估邁向類人互動式語音辨識
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
May 28, 2026
作者: Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen
cs.AI
摘要
自動語音辨識(ASR)是人機互動的核心組成部分,也是基於大型語言模型之助理與代理日益重要的前端技術。然而,目前多數ASR系統仍遵循單次通過範式,與人類溝通中透過迭代澄清與精煉來解決誤解的方式存在顯著落差。這種不匹配使得一旦發生關鍵語義錯誤便難以修正,同時詞元層級指標(如詞錯誤率WER或字元錯誤率CER)也無法充分反映此問題。為解決上述限制,我們將互動式語音辨識(Interactive ASR)形式化為一項多輪精煉任務,並提出Agentic ASR——一個結合單次通過ASR前端與語義校正、意圖路由及基於推理之編輯的閉環架構。我們進一步引入句子層級語義錯誤率(Sentence-level Semantic Error Rate, S²ER),這是一項基於大型語言模型的語義評估指標,同時搭配互動模擬系統(Interactive Simulation System),以實現可擴展且可重現的基準測試。在多語言、命名實體密集及語碼切換基準上的實驗顯示,迭代互動能持續降低語義錯誤,且S²ER的降幅遠大於傳統詞元層級指標的改進幅度。人機對齊研究與消融實驗進一步驗證了語義評判器的可靠性及所提架構的強健性。程式碼請參閱:https://interactiveasr.github.io/,即時展示請見:https://i-asr.sjtuxlance.com/
English
Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S^2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/