STITCH：基於分塊推理的同步思維與對話技術在語音語言模型中的應用

摘要

語音語言模型（SLMs）旨在接收語音輸入並產生語音回應。然而，現有的SLMs缺乏在回應前進行內部無聲思考的能力。相比之下，人類通常會在內部進行複雜的心理推理，從而能夠清晰簡潔地表達想法。因此，將無聲思考過程整合到SLMs中是非常理想的。雖然在開始說話前單純生成完整的思維鏈（CoT）推理可以讓SLMs進行思考，但這會增加語音回應的延遲，因為CoT推理可能任意長。為了解決這個問題，我們提出了Stitch，這是一種新穎的生成方法，它在生成無聲推理片段和語音回應片段之間交替進行。由於一段語音回應的音頻持續時間遠長於生成該語音回應片段所需的時間，我們利用剩餘的空閒時間來生成無聲推理片段。當一段音頻播放給用戶時，模型繼續生成下一個無聲推理片段，實現了同時思考和說話。值得注意的是，Stitch在設計上無法生成無聲CoT的基線模型的延遲相匹配，同時在數學推理數據集上比這些基線模型高出15%；Stitch在非推理數據集上的表現也與這些基線模型相當。一些動畫和演示可在項目頁面上查看：https://d223302.github.io/STITCH。

English

Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.

STITCH：基於分塊推理的同步思維與對話技術在語音語言模型中的應用

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

摘要

Support