群體思維：多個並行推理代理在詞元級粒度上協作

摘要

近期大型語言模型（LLMs）的進展展示了通過自我生成的思維鏈進行推理的強大能力。多個推理代理可以協作，將聯合推理質量提升至超越個體結果的水平。然而，這些代理通常以輪流方式互動，以增加延遲為代價來提升質量。在本文中，我們提出了「群體思維」（Group Think）——一個單一的LLM，扮演多個並發推理代理或思考者的角色。通過共享彼此部分生成進程的可見性，群體思維引入了一種新的並發推理範式，其中多個推理軌跡在令牌級別上動態相互適應。例如，一個推理線程在檢測到另一個線程更適合繼續時，可能會在中途改變其生成內容。這種細粒度的、令牌級別的協作使群體思維能夠減少冗餘推理，提升質量，同時顯著降低延遲。此外，其並發性質允許高效利用閒置的計算資源，使其特別適合邊緣推理，在這種場景下，極小的批次大小往往導致本地GPU的利用率不足。我們提供了一種簡單且可普遍應用的修改方案，使任何現有的LLM都能在本地GPU上執行群體思維。我們還提出了一種評估策略來基準測試推理延遲，並使用未經群體思維專門訓練的開源LLM實證展示了延遲的改善。我們希望這項工作能為未來LLM展現更複雜、更高效的協作行為，以實現更高質量的生成鋪平道路。

English

Recent advances in large language models (LLMs) have demonstrated the power of reasoning through self-generated chains of thought. Multiple reasoning agents can collaborate to raise joint reasoning quality above individual outcomes. However, such agents typically interact in a turn-based manner, trading increased latency for improved quality. In this paper, we propose Group Think--a single LLM that acts as multiple concurrent reasoning agents, or thinkers. With shared visibility into each other's partial generation progress, Group Think introduces a new concurrent-reasoning paradigm in which multiple reasoning trajectories adapt dynamically to one another at the token level. For example, a reasoning thread may shift its generation mid-sentence upon detecting that another thread is better positioned to continue. This fine-grained, token-level collaboration enables Group Think to reduce redundant reasoning and improve quality while achieving significantly lower latency. Moreover, its concurrent nature allows for efficient utilization of idle computational resources, making it especially suitable for edge inference, where very small batch size often underutilizes local~GPUs. We give a simple and generalizable modification that enables any existing LLM to perform Group Think on a local GPU. We also present an evaluation strategy to benchmark reasoning latency and empirically demonstrate latency improvements using open-source LLMs that were not explicitly trained for Group Think. We hope this work paves the way for future LLMs to exhibit more sophisticated and more efficient collaborative behavior for higher quality generation.

群體思維：多個並行推理代理在詞元級粒度上協作

Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity

摘要

Support