グループ思考：トークンレベルの粒度で協調する複数の並行推論エージェント

要旨

大規模言語モデル（LLM）の最近の進展は、自己生成された思考の連鎖を通じた推論の力を示しています。複数の推論エージェントが協力することで、個々の結果を超える共同推論の質を高めることが可能です。しかし、このようなエージェントは通常、ターンベースで相互作用し、品質向上のためにレイテンシを犠牲にします。本論文では、Group Thinkを提案します。これは、複数の並行推論エージェント（または思考者）として機能する単一のLLMです。Group Thinkは、互いの部分的な生成進捗を共有することで、トークンレベルで複数の推論軌跡が動的に適応する新しい並行推論パラダイムを導入します。例えば、ある推論スレッドは、別のスレッドが続けるのに適していると検出した場合、文中で生成をシフトすることができます。このきめ細かいトークンレベルの協力により、Group Thinkは冗長な推論を減らし、品質を向上させながら、大幅に低いレイテンシを実現します。さらに、その並行性により、アイドル状態の計算リソースを効率的に利用できるため、非常に小さなバッチサイズがローカルGPUを十分に活用しないエッジ推論に特に適しています。既存のLLMがローカルGPUでGroup Thinkを実行できるようにするためのシンプルで汎用的な修正を提供します。また、推論レイテンシをベンチマークするための評価戦略を提示し、Group Thinkのために明示的に訓練されていないオープンソースのLLMを使用してレイテンシ改善を実証します。この研究が、将来のLLMがより洗練され、効率的な協調行動を示し、より高品質な生成を実現するための道を開くことを期待しています。

English

Recent advances in large language models (LLMs) have demonstrated the power of reasoning through self-generated chains of thought. Multiple reasoning agents can collaborate to raise joint reasoning quality above individual outcomes. However, such agents typically interact in a turn-based manner, trading increased latency for improved quality. In this paper, we propose Group Think--a single LLM that acts as multiple concurrent reasoning agents, or thinkers. With shared visibility into each other's partial generation progress, Group Think introduces a new concurrent-reasoning paradigm in which multiple reasoning trajectories adapt dynamically to one another at the token level. For example, a reasoning thread may shift its generation mid-sentence upon detecting that another thread is better positioned to continue. This fine-grained, token-level collaboration enables Group Think to reduce redundant reasoning and improve quality while achieving significantly lower latency. Moreover, its concurrent nature allows for efficient utilization of idle computational resources, making it especially suitable for edge inference, where very small batch size often underutilizes local~GPUs. We give a simple and generalizable modification that enables any existing LLM to perform Group Think on a local GPU. We also present an evaluation strategy to benchmark reasoning latency and empirically demonstrate latency improvements using open-source LLMs that were not explicitly trained for Group Think. We hope this work paves the way for future LLMs to exhibit more sophisticated and more efficient collaborative behavior for higher quality generation.

グループ思考：トークンレベルの粒度で協調する複数の並行推論エージェント

Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity

要旨

Support