SHANKS: 音声言語モデルのための同時聴取・思考フレームワーク

要旨

現在の大規模言語モデル（LLM）および音声言語モデル（SLM）は、ユーザーの発話が終了した後にのみ思考を開始し、行動を取る。これにより、ユーザーの発話中にモデルが相互作用することが妨げられ、思考を待つ間に高い応答遅延が生じる。その結果、完全な入力を受信してから思考を行うことは、リアルタイムで低遅延の交換が重要となる音声対音声の相互作用には適していない。この問題に対処するため、我々は人間が自然に「聞きながら考える」ことに着目した。本論文では、SLMがユーザーの入力音声を聞きながら、発話されない連鎖的思考（chain-of-thought）推論を生成することを可能にする一般的な推論フレームワーク「SHANKS」を提案する。SHANKSは、入力音声を固定時間のチャンクでストリーミングし、チャンクを受信すると、ユーザーが話し続けている間に、それまでのすべての音声と推論に基づいて発話されない推論を生成する。SHANKSはこの発話されない推論を用いて、ユーザーを中断するかどうかを判断し、タスクを完了するためのツール呼び出しを行う。我々は、SHANKSがリアルタイムのユーザー-SLM相互作用を強化することを2つのシナリオで実証した：（1）ユーザーが数学の問題に対する段階的な解法を提示している際に、SHANKSは聞き、推論し、ユーザーがミスをした際に中断することができ、思考せずに中断するベースラインよりも37.1%高い中断精度を達成した；（2）ツール拡張対話において、SHANKSはユーザーの発話が終了する前に56.9%のツール呼び出しを完了することができた。全体として、SHANKSは、会話の終了後だけでなく、会話全体を通じて思考を続けるモデルに向けた一歩を踏み出している。SHANKSのアニメーション図解はhttps://d223302.github.io/SHANKS/で確認できる。

English

Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/

SHANKS: 音声言語モデルのための同時聴取・思考フレームワーク

SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

要旨

Support