SciOrch: 최첨단 멀티모달 과학적 추론 과제 해결을 위한 전문가 LLM 조율 학습

초록

첨단 과학 추론은 대규모 언어 모델(LLM)에게 여전히 주요 과제로 남아 있으며, 가장 강력한 상용 시스템조차도 전문가 수준의 성능에 미치지 못한다. 모델 행동을 면밀히 살펴보면 단일 모델 평가로는 드러나지 않는 상당한 상호보완성이 존재한다: 서로 다른 최첨단 모델이 각기 다른 질문 유형에서 뛰어난 성능을 보이며, 어떤 단일 모델도 전체 그림을 포착하지 못한다. 본 논문에서는 가벼운 8B 모델을 훈련시켜 최첨단 LLM을 과학 추론에 활용하도록 조정하는 프레임워크인 SciOrch를 제안한다. 조정자는 각 질문을 분해하고, API 호출을 통해 선택된 상용 모델에 하위 문제를 위임하며, 최종 답변을 종합한다. 이러한 조정자를 훈련하는 것은 기존의 에이전트 강화 학습보다 근본적으로 더 어렵다: 각 행동은 비용과 지연 시간 측면에서 모두 비싼 API 호출을 유발하므로, 표준 온라인 롤아웃이 불가능하다. 본 연구는 MCTS 기반 접근법으로 이 문제를 해결하여 다양한 조정 궤적을 생성하고, 노드별 단일 샘플을 추출한 후 GRPO 방식 훈련을 통해 조정자를 최적화한다. SGI-Reasoning과 Scientists' First Exam을 포괄하는 240개 질문 테스트 세트에서 SciOrch는 평균 정확도 56.66%를 달성하여, 가장 강력한 단일 상용 모델보다 3.74%, 가장 강력한 다중 에이전트 기준선보다 3.33% 더 높은 성능을 보였다. 또한 SGI와 SFE 모두에서 최고 정확도를 기록했으며, 일반적인 다중 에이전트 방법보다 절반 미만의 API 비용을 사용했다.

English

Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.