STITCH：面向口语模型的同步思维与表达的分块推理机制

摘要

语音语言模型（SLMs）旨在接收语音输入并生成语音响应。然而，现有的SLMs缺乏在回应前进行内部无声思维过程的能力。相比之下，人类通常会在内部进行复杂的心理推理，从而能够清晰简洁地传达思想。因此，将无声思维过程整合到SLMs中显得尤为重要。虽然简单地在开始说话前生成完整的思维链（CoT）推理可以让SLMs具备思考能力，但这会导致语音响应的额外延迟，因为CoT推理可能任意长。为解决这一问题，我们提出了Stitch，一种新颖的生成方法，它在无声推理片段和语音响应片段的生成之间交替进行。由于一段语音响应的音频时长远长于生成该段语音响应中词元所需的时间，我们利用剩余的空闲时间生成无声推理词元。当一段音频播放给用户时，模型继续生成下一个无声推理片段，实现了思考与说话的同步进行。值得注意的是，Stitch在数学推理数据集上比那些设计上无法生成无声CoT的基线模型延迟相当，但性能却高出15%；同时，在非推理数据集上，Stitch的表现也与这些基线模型相当。项目页面上提供了一些动画和演示：https://d223302.github.io/STITCH。

English

Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.

STITCH：面向口语模型的同步思维与表达的分块推理机制

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

摘要

Support