에이전트적 교정과 의미 평가를 통한 인간과 유사한 상호작용적 음성 인식

초록

자동 음성 인식(ASR)은 인간-컴퓨터 상호작용의 핵심 구성 요소이며, LLM 기반 어시스턴트 및 에이전트를 위한 점점 더 중요한 프론트엔드입니다. 그러나 현재 대부분의 ASR 시스템은 여전히 단일 패스(single-pass) 패러다임을 따르고 있어, 반복적인 명확화 및 정제를 통해 오해를 해결하는 인간의 의사소통 방식과 잘 맞지 않습니다. 이러한 불일치로 인해 의미에 치명적인 오류가 발생했을 때 이를 수정하기 어렵습니다. 한편, WER이나 CER과 같은 토큰 수준 지표는 이러한 문제를 적절히 반영하지 못합니다. 이러한 한계를 해결하기 위해, 본 연구에서는 대화형 ASR(Interactive ASR)을 다중 턴 정제(multi-turn refinement) 작업으로 정식화하고, 단일 패스 ASR 프론트엔드와 의미 보정, 의도 라우팅, 추론 기반 편집을 결합한 폐루프(closed-loop) 프레임워크인 Agentic ASR을 제안합니다. 또한 확장 가능하고 재현 가능한 벤치마킹을 위한 대화형 시뮬레이션 시스템과 함께 LLM 기반 의미 평가 지표인 문장 수준 의미 오류율(Sentence-level Semantic Error Rate, S²ER)을 소개합니다. 다국어, 개체명 집약적, 코드 스위칭 벤치마크에 대한 실험 결과, 반복적인 상호작용이 의미 오류를 일관되게 줄이며, 기존 토큰 수준 지표에 비해 S²ER에서 훨씬 더 큰 개선을 보여줍니다. 인간-AI 정렬 및 절제 연구(ablation study)는 의미 판단자의 신뢰성과 제안된 프레임워크의 견고성을 추가로 검증합니다. 코드는 https://interactiveasr.github.io/ 에서, 라이브 데모는 https://i-asr.sjtuxlance.com/ 에서 확인할 수 있습니다.

English

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S^2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/