그룹 싱크: 토큰 수준의 세분화에서 협력하는 다중 동시 추론 에이전트

초록

최근 대규모 언어 모델(LLM)의 발전은 자체 생성된 사고 사슬을 통한 추론의 힘을 입증해 왔다. 다수의 추론 에이전트가 협력하여 개별 결과보다 높은 공동 추론 품질을 달성할 수 있다. 그러나 이러한 에이전트들은 일반적으로 턴제 방식으로 상호작용하며, 품질 향상을 위해 지연 시간을 늘리는 방식을 취한다. 본 논문에서는 다수의 동시 추론 에이전트 또는 사고자(thinker)로 작동하는 단일 LLM인 Group Think를 제안한다. Group Think는 서로의 부분적 생성 진행 상황을 공유함으로써, 다수의 추론 궤적이 토큰 수준에서 동적으로 서로 적응하는 새로운 동시 추론 패러다임을 도입한다. 예를 들어, 한 추론 스레드는 다른 스레드가 더 나은 위치에 있음을 감지하면 문장 중간에 생성 방식을 전환할 수 있다. 이러한 세밀한 토큰 수준의 협업은 Group Think가 중복 추론을 줄이고 품질을 향상시키면서도 상당히 낮은 지연 시간을 달성할 수 있게 한다. 또한, 동시성 특성은 유휴 컴퓨팅 자원을 효율적으로 활용할 수 있게 하여, 매우 작은 배치 크기로 인해 로컬 GPU가 제대로 활용되지 않는 에지 추론에 특히 적합하다. 우리는 기존의 모든 LLM이 로컬 GPU에서 Group Think를 수행할 수 있도록 간단하고 일반화 가능한 수정 방식을 제시한다. 또한, 추론 지연 시간을 벤치마크하기 위한 평가 전략을 제시하고, Group Think를 위해 명시적으로 훈련되지 않은 오픈소스 LLM을 사용하여 지연 시간 개선을 실증적으로 입증한다. 이 연구가 향후 LLM이 더 정교하고 효율적인 협업 행동을 통해 더 높은 품질의 생성을 달성할 수 있는 길을 열어주기를 바란다.

English

Recent advances in large language models (LLMs) have demonstrated the power of reasoning through self-generated chains of thought. Multiple reasoning agents can collaborate to raise joint reasoning quality above individual outcomes. However, such agents typically interact in a turn-based manner, trading increased latency for improved quality. In this paper, we propose Group Think--a single LLM that acts as multiple concurrent reasoning agents, or thinkers. With shared visibility into each other's partial generation progress, Group Think introduces a new concurrent-reasoning paradigm in which multiple reasoning trajectories adapt dynamically to one another at the token level. For example, a reasoning thread may shift its generation mid-sentence upon detecting that another thread is better positioned to continue. This fine-grained, token-level collaboration enables Group Think to reduce redundant reasoning and improve quality while achieving significantly lower latency. Moreover, its concurrent nature allows for efficient utilization of idle computational resources, making it especially suitable for edge inference, where very small batch size often underutilizes local~GPUs. We give a simple and generalizable modification that enables any existing LLM to perform Group Think on a local GPU. We also present an evaluation strategy to benchmark reasoning latency and empirically demonstrate latency improvements using open-source LLMs that were not explicitly trained for Group Think. We hope this work paves the way for future LLMs to exhibit more sophisticated and more efficient collaborative behavior for higher quality generation.

그룹 싱크: 토큰 수준의 세분화에서 협력하는 다중 동시 추론 에이전트

Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity

초록

Support