ChatPaper.aiChatPaper

SHANKS:面向口语语言模型的同步听觉与思考机制

SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

October 8, 2025
作者: Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang
cs.AI

摘要

当前的大型语言模型(LLMs)和口语模型(SLMs)仅在用户完成其发言后才开始思考并采取行动。这种做法阻碍了模型在用户发言期间的互动能力,并可能导致高响应延迟,因为它需等待完整输入后再进行思考。因此,在接收完整输入后进行思考的模式并不适用于语音对语音的交互场景,其中实时、低延迟的交流至关重要。我们通过观察人类自然“边听边想”的现象来解决这一问题。本文中,我们提出了SHANKS,一个通用的推理框架,使SLMs能够在听取用户输入的同时生成未说出的思维链推理。SHANKS将输入语音以固定时长的片段流式传输,一旦接收到一个片段,便基于之前的所有语音和推理生成未说出的推理,而用户则继续发言。SHANKS利用这些未说出的推理来决定是否打断用户,并调用工具以完成任务。我们展示了SHANKS在两种场景下增强了用户与SLM的实时互动:(1)当用户逐步展示数学问题的解答时,SHANKS能够倾听、推理,并在用户犯错时打断,其打断准确率比不进行思考的基线模型高出37.1%;(2)在工具增强的对话中,SHANKS能在用户结束发言前完成56.9%的工具调用。总体而言,SHANKS推动了模型在整个对话过程中持续思考,而不仅仅是在一轮对话结束后。SHANKS的动画演示可访问https://d223302.github.io/SHANKS/查看。
English
Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/
PDF342October 9, 2025