대화 응축: LLM 기반 음성 인식을 위한 대화 오디오 컨텍스트의 추상적 압축

초록

표준 LLM 기반 음성 인식 시스템은 일반적으로 발화를 독립적으로 처리하여 대화 맥락을 활용하는 능력이 제한됩니다. 본 연구에서는 이전 차례의 다중 모달리티 맥락이 LLM 기반 ASR 성능을 향상시키는지, 그리고 그러한 맥락을 효율적으로 표현하는 방법을 탐구합니다. 지도 학습 기반 다중 차례 훈련 후, 대화 맥락이 주로 맥락적 개체명 인식 개선에 도움을 준다는 것을 확인했습니다. 그러나 원본 맥락을 직접 활용하는 것은 이전 차례의 오디오 토큰 시퀀스가 대화 길이에 따라 급격히 증가하기 때문에 계산 비용이 높습니다. 이를 해결하기 위해 Abstract Compression 기법을 제안합니다. 이 방법은 이전 차례의 오디오 부분을 소수의 학습된 잠재 토큰으로 대체하되 해당 전사본은 명시적으로 유지합니다. 인-도메인 및 아웃-오브-도메인 테스트 세트에서 모두, 압축 모델은 더 작은 이전 차례 오디오 공간 사용량으로 원본 맥락 활용의 성능 향상 효과를 부분적으로 재현했습니다. 또한 압축 설정과 그에 따른 성능 trade-off에 대한 세부 분석을 제공합니다.

English

Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.

대화 응축: LLM 기반 음성 인식을 위한 대화 오디오 컨텍스트의 추상적 압축

Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

초록

Support