ChatPaper.aiChatPaper

SciOrch:學習編排專家大型語言模型以解決前沿多模態科學推理任務

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

June 14, 2026
作者: Jingru Guo, Xiangyuan Xue, Lian Zhang, Wanghan Xu, Siki Chen, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin
cs.AI

摘要

前沿科學推理仍是大型語言模型(LLMs)的一大挑戰,即便最強大的商業系統也未能達到專家級表現。仔細觀察模型行為後會發現,單一模型評估所掩蓋的顯著互補性:不同前沿模型擅長不同問題類型,沒有任何單一模型能掌握全貌。我們提出 SciOrch 框架,訓練一個輕量級 8B 模型來協調前沿 LLMs 進行科學推理。該協調器將問題拆解,透過 API 呼叫將子問題委派給選定商業模型,並整合出最終答案。訓練這類協調器本質上比傳統的智能體強化學習更困難:每個動作都會觸發一次 API 呼叫,既耗費金錢成本又增加延遲,使得標準的在線推廣不可行。我們採用基於 MCTS 的方法來解決此問題,生成多樣化的協調軌跡、提取每個節點的單輪樣本,並以 GRPO 風格的訓練來最佳化協調器。在涵蓋 SGI-Reasoning 與 Scientists' First Exam 的 240 題測試集上,SciOrch 達到平均 56.66% 的準確率,超越最強單一商業模型 3.74%,並超越最強多智能體基線 3.33%。它在 SGI 與 SFE 上也取得最佳準確率,且 API 成本不到典型多智能體方法的一半。
English
Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.