ChatPaper.aiChatPaper

SciOrch:学习编排专家LLMs以解决前沿多模态科学推理任务

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

June 14, 2026
作者: Jingru Guo, Xiangyuan Xue, Lian Zhang, Wanghan Xu, Siki Chen, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin
cs.AI

摘要

前沿科学推理仍然是大型语言模型(LLMs)面临的重大挑战,即便是最强大的商业系统也难以达到专家级水平。深入研究模型行为会发现,单模型评估所掩盖的显著互补性:不同前沿模型在不同类型问题上各有优势,没有单一模型能全面把握问题全貌。我们提出SciOrch框架,该框架训练一个轻量级的8B模型来协调前沿LLMs进行科学推理。该调度模型将问题分解,通过API调用将子问题分配至选定的商业模型,并综合生成最终答案。训练这样的调度模型本质上比传统的智能体强化学习更为困难:每个动作都会触发API调用,既产生高昂的经济成本,又带来显著的延迟,使得标准的在线滚动训练不可行。我们采用基于MCTS的方法解决了这一问题,该方法生成多样化的调度轨迹,提取每个节点的单轮样本,并通过GRPO风格的训练来优化调度模型。在涵盖SGI-Reasoning和Scientists' First Exam的240题测试集上,SciOrch达到56.66%的平均准确率,超过最强单一商业模型3.74%,并超越最强多智能体基线3.33%。同时,它在SGI和SFE两项测试上均取得最佳准确率,且API调用成本不到典型多智能体方法的一半。
English
Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.