群体思维：多个并发推理代理在令牌级别粒度上进行协作

摘要

近期，大型语言模型（LLMs）的进展展现了通过自我生成的思维链进行推理的强大能力。多个推理代理可以协作，将联合推理质量提升至超越个体成果的水平。然而，这类代理通常以轮替方式交互，以增加延迟为代价换取质量的提升。本文提出“群体思维”（Group Think）——一个作为多个并发推理代理或思考者运作的单一LLM。通过共享彼此部分生成进度的可见性，群体思维引入了一种新的并发推理范式，其中多个推理轨迹在令牌级别上动态相互适应。例如，一个推理线程在检测到另一线程更适合继续时，可能会在句子中间调整其生成。这种细粒度的、令牌级别的协作使群体思维能够减少冗余推理，在显著降低延迟的同时提高质量。此外，其并发特性允许高效利用闲置计算资源，使其特别适合边缘推理场景，在那里，极小的批量大小往往导致本地GPU利用率不足。我们提供了一种简单且可推广的修改方法，使任何现有LLM都能在本地GPU上执行群体思维。我们还提出了一种评估策略来基准测试推理延迟，并实证展示了使用未针对群体思维显式训练的开源LLM实现的延迟改进。我们希望这项工作为未来LLM展现更复杂、更高效的协作行为，以实现更高质量的生成铺平道路。

English

Recent advances in large language models (LLMs) have demonstrated the power of reasoning through self-generated chains of thought. Multiple reasoning agents can collaborate to raise joint reasoning quality above individual outcomes. However, such agents typically interact in a turn-based manner, trading increased latency for improved quality. In this paper, we propose Group Think--a single LLM that acts as multiple concurrent reasoning agents, or thinkers. With shared visibility into each other's partial generation progress, Group Think introduces a new concurrent-reasoning paradigm in which multiple reasoning trajectories adapt dynamically to one another at the token level. For example, a reasoning thread may shift its generation mid-sentence upon detecting that another thread is better positioned to continue. This fine-grained, token-level collaboration enables Group Think to reduce redundant reasoning and improve quality while achieving significantly lower latency. Moreover, its concurrent nature allows for efficient utilization of idle computational resources, making it especially suitable for edge inference, where very small batch size often underutilizes local~GPUs. We give a simple and generalizable modification that enables any existing LLM to perform Group Think on a local GPU. We also present an evaluation strategy to benchmark reasoning latency and empirically demonstrate latency improvements using open-source LLMs that were not explicitly trained for Group Think. We hope this work paves the way for future LLMs to exhibit more sophisticated and more efficient collaborative behavior for higher quality generation.

群体思维：多个并发推理代理在令牌级别粒度上进行协作

Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity

摘要

Support