SciOrch：學習編排專家大型語言模型以解決前沿多模態科學推理任務

摘要

前沿科學推理仍是大型語言模型（LLMs）的一大挑戰，即便最強大的商業系統也未能達到專家級表現。仔細觀察模型行為後會發現，單一模型評估所掩蓋的顯著互補性：不同前沿模型擅長不同問題類型，沒有任何單一模型能掌握全貌。我們提出 SciOrch 框架，訓練一個輕量級 8B 模型來協調前沿 LLMs 進行科學推理。該協調器將問題拆解，透過 API 呼叫將子問題委派給選定商業模型，並整合出最終答案。訓練這類協調器本質上比傳統的智能體強化學習更困難：每個動作都會觸發一次 API 呼叫，既耗費金錢成本又增加延遲，使得標準的在線推廣不可行。我們採用基於 MCTS 的方法來解決此問題，生成多樣化的協調軌跡、提取每個節點的單輪樣本，並以 GRPO 風格的訓練來最佳化協調器。在涵蓋 SGI-Reasoning 與 Scientists' First Exam 的 240 題測試集上，SciOrch 達到平均 56.66% 的準確率，超越最強單一商業模型 3.74%，並超越最強多智能體基線 3.33%。它在 SGI 與 SFE 上也取得最佳準確率，且 API 成本不到典型多智能體方法的一半。

English

Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.