ACC: エージェントトラジェクトリのコンパイルによる長文脈学習

要旨

近年のエージェントの発展により、LLMの長文脈推論能力への需要が再び高まっている。しかし、この能力をLLMに訓練するには、コストのかかる長文書のキュレーションやヒューリスティックな文脈合成が必要となる。我々は、エージェントが問題解決時に、ツールの呼び出しと環境観測を多くのターンにわたって行うことで、大量の軌跡を生成することを観察した。そのため、元の質問に答えるために必要な証拠はこれらのターン全体に散らばっており、遠く離れた文脈セグメントの統合が必要となる。にもかかわらず、標準的なエージェントの教師ありファインチューニング（SFT）ではツール応答がマスクされ、ターンレベルのツール選択のみが訓練されるため、これらの散在する信号が使われないという監視の死角が生じる。我々は、エージェント文脈コンパイル（ACC）を提案する。これは、検索、ソフトウェア工学、データベースクエリのエージェントからの軌跡を長文脈QAペアに変換する。このQAペアは、元の質問と、複数ターンにわたって収集されたツール応答や環境観測を組み合わせ、ツールを使用せずに直接回答するようにモデルを訓練する。これにより、質問と証拠の間の依存関係が明示的になり、追加のアノテーションなしで遠く離れたセグメントにわたる長文脈推論の直接的な監視が可能となる。ACCはシンプルだが効果的なアプローチであり、既存のあらゆる長文脈拡張手法や訓練手法と組み合わせることができ、スケーラブルな教師ありファインチューニングデータを提供する。我々はACCを、MRCRとGraphWalksという、ターン横断的な照応解決や拡張文脈にわたるグラフ探索を必要とする挑戦的なベンチマークを用いて、長距離依存関係モデリングタスクで検証した。Qwen3-30B-A3BをACCで訓練した結果、MRCRで68.3（+18.1）、GraphWalksで77.5（+7.6）を達成し、これはQwen3-235B-A22Bに匹敵する結果であり、同時にGPQA、MMLU-Pro、AIME、IFEvalにおける一般的な性能も維持した。さらにメカニズム解析により、ACCで訓練されたモデルはタスク適応的な注意再構成と専門家特化を示すことが明らかになった。

English

Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.