SciOrch: 専門家LLMを調整して最先端のマルチモーダル科学推論タスクを解決する学習

要旨

フロンティア科学的推論は、大規模言語モデル（LLM）にとって依然として大きな課題であり、最も強力な商用システムでさえ専門家レベルの性能には及ばない。モデルの振る舞いを詳細に観察すると、単一モデルの評価では隠れていた顕著な相補性が明らかになる。すなわち、異なるフロンティアモデルは異なる種類の問題に優れており、単一のモデルでは全体像を捉えられない。本稿では、SciOrchを提案する。これは、軽量な8Bモデルを訓練し、科学的推論のためにフロンティアLLMをオーケストレーションするフレームワークである。オーケストレータは各質問を分解し、API呼び出しを通じて選択した商用モデルにサブ問題を委譲し、最終回答を統合する。このようなオーケストレータの訓練は、従来のエージェント型強化学習よりも根本的に難しい。各アクションがAPI呼び出しを引き起こし、コストとレイテンシの両面で高額になるため、標準的なオンラインロールアウトは実行不可能だからである。これに対し、我々はMCTSベースの手法を採用し、多様なオーケストレーション軌跡を生成し、各ノードから単一ターンのサンプルを抽出し、GRPOスタイルの訓練でオーケストレータを最適化する。SGI-ReasoningとScientists' First Examにわたる240問のテストセットにおいて、SciOrchは平均精度56.66%を達成し、最も強力な単一商用モデルを3.74%、最も強力なマルチエージェントベースラインを3.33%上回った。また、SGIとSFEの両方で最高精度を達成し、典型的なマルチエージェント手法の半分以下のAPIコストでこれを実現した。

English

Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.