Rumo ao Reconhecimento de Fala Interativo Semelhante ao Humano com Correção com Agência e Avaliação Semântica

Resumo

O Reconhecimento Automático de Fala (ASR) é um componente central da interação humano-computador e um front-end cada vez mais importante para assistentes e agentes baseados em LLM. No entanto, a maioria dos sistemas atuais de ASR ainda segue um paradigma de passagem única, que está pouco alinhado com a comunicação humana, onde os mal-entendidos são resolvidos por meio de esclarecimento e refinamento iterativos. Essa incompatibilidade torna difícil corrigir erros críticos de significado uma vez que ocorrem. Enquanto isso, métricas em nível de token, como WER ou CER, não conseguem refletir adequadamente esse problema. Para lidar com essas limitações, formulamos o ASR Interativo como uma tarefa de refinamento em múltiplas rodadas e propomos o Agentic ASR, uma estrutura de malha fechada que combina um front-end de ASR de passagem única com correção semântica, roteamento de intenção e edição baseada em raciocínio. Introduzimos também a Taxa de Erro Semântico em Nível de Sentença (S^2ER), uma métrica de avaliação semântica baseada em LLM, juntamente com um Sistema de Simulação Interativa para benchmarking escalável e reproduzível. Experimentos em benchmarks multilíngues, intensivos em entidades nomeadas e de alternância de código mostram que a interação iterativa reduz consistentemente os erros semânticos, com ganhos muito maiores em S^2ER do que em métricas convencionais em nível de token. Estudos de alinhamento Humano-IA e de ablação validam ainda mais a confiabilidade do juiz semântico e a robustez da estrutura proposta. O código está disponível em: https://interactiveasr.github.io/ e a demonstração ao vivo está disponível em https://i-asr.sjtuxlance.com/

English

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S^2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/