SHANKS: 음성 언어 모델을 위한 동시 청취 및 사고

초록

현재의 대형 언어 모델(LLM)과 음성 언어 모델(SLM)은 사용자가 자신의 차례를 마친 후에야 사고하고 행동을 시작합니다. 이로 인해 모델은 사용자의 차례 중에 상호작용할 수 없으며, 사고를 위해 기다리는 동안 높은 응답 지연이 발생할 수 있습니다. 결과적으로, 전체 입력을 받은 후에 사고하는 방식은 실시간, 저지연 교환이 중요한 음성 대 음성 상호작용에는 적합하지 않습니다. 우리는 인간이 자연스럽게 "듣는 동안 사고한다"는 점에 주목하여 이 문제를 해결하고자 합니다. 본 논문에서는 SLM이 사용자 입력을 들으면서 말로 표현되지 않은 사고의 연쇄적 추론을 생성할 수 있도록 하는 일반적인 추론 프레임워크인 SHANKS를 제안합니다. SHANKS는 입력 음성을 고정된 길이의 청크로 스트리밍하고, 청크를 받자마자 이전의 모든 음성과 추론을 기반으로 말로 표현되지 않은 추론을 생성합니다. 이때 사용자는 계속해서 말을 이어갑니다. SHANKS는 이 말로 표현되지 않은 추론을 사용하여 사용자를 중단할지 여부를 결정하고, 작업을 완료하기 위해 도구 호출을 수행합니다. 우리는 SHANKS가 두 가지 시나리오에서 실시간 사용자-SLM 상호작용을 향상시킨다는 것을 보여줍니다: (1) 사용자가 수학 문제에 대한 단계별 해결책을 제시할 때, SHANKS는 듣고, 추론하며, 사용자가 실수를 했을 때 중단할 수 있어, 사고 없이 중단하는 기준선보다 37.1% 더 높은 중단 정확도를 달성합니다; (2) 도구가 보강된 대화에서, SHANKS는 사용자가 자신의 차례를 마치기 전에 56.9%의 도구 호출을 완료할 수 있습니다. 전반적으로, SHANKS는 대화가 끝난 후뿐만 아니라 대화 전체에 걸쳐 사고를 계속하는 모델로 나아갑니다. SHANKS의 애니메이션 예시는 https://d223302.github.io/SHANKS/에서 확인할 수 있습니다.

English

Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/

SHANKS: 음성 언어 모델을 위한 동시 청취 및 사고

SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

초록

Support