ACC：为长上下文训练编译智能体轨迹

摘要

近期智能体领域的进展重新激发了对大语言模型长上下文推理能力的需求。然而，训练大语言模型具备这一能力需要昂贵的长文档整理或启发式上下文合成。我们观察到，智能体在解决问题时会产生大量轨迹，跨多轮调用工具并接收环境观测结果。回答原始问题所需的证据因此分散在这些轮次中，需要整合远距离的上下文片段。然而，标准智能体监督微调会屏蔽工具响应，仅训练轮次级的工具选择，这造成了监督盲区，使得这些分散的信号未被利用。我们提出智能体上下文编译（ACC），该方法将来自搜索、软件工程和数据库查询智能体的轨迹转换为长上下文问答对，将原始问题与跨多轮收集的工具响应和环境观测结果相结合，训练模型直接回答而不使用工具。这使得问题与证据之间的依赖关系显式化，从而能够在不额外标注的情况下直接监督远距离片段的长上下文推理。ACC是一种简单而有效的方法，可与任何现有的长上下文扩展或训练方法结合，提供可扩展的监督微调数据。我们通过MRCR和GraphWalks验证ACC在长距离依赖建模任务上的效果，这些基准测试要求跨轮次的核心指代消解和图遍历。使用ACC训练Qwen3-30B-A3B在MRCR上达到68.3（+18.1），在GraphWalks上达到77.5（+7.6），结果与Qwen3-235B-A22B相当，同时在GPQA、MMLU-Pro、AIME和IFEval上保持通用能力。进一步的机制分析表明，ACC训练的模型展现出任务自适应的注意力重构和专家特化。

English

Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.