ACC：編譯智能體軌跡用於長上下文訓練

摘要

近年来，智能体的发展重新激发了对大语言模型长上下文推理能力的需求。然而，训练大语言模型具备这种能力需要昂贵的长文档整理或启发式上下文合成。我们观察到，智能体在解决问题时会产生大量轨迹，在多轮交互中调用工具并接收环境反馈。回答原始问题所需的证据因此分散在这些交互轮次中，需要整合远距离的上下文片段。然而，标准的智能体SFT会屏蔽工具响应，仅训练轮次级别的工具选择，造成监督盲区，使得这些分散的信号未被利用。我们提出智能体上下文编译（ACC），该方法将来自搜索、软件工程和数据库查询智能体的轨迹转换为长上下文问答对，将原始问题与多轮交互中收集的工具响应和环境观察结合，训练模型在不使用工具的情况下直接回答问题。这使问题与证据之间的依赖关系变得明确，从而能够直接监督跨越远距离片段的长上下文推理，且无需额外标注。ACC是一种简单但有效的方法，可与任何现有的长上下文扩展或训练方法结合，提供可扩展的监督微调数据。我们通过MRCR和GraphWalks验证了ACC在长程依赖建模任务上的效果，这两个基准测试要求跨轮次共指消解和长上下文图遍历。使用ACC训练的Qwen3-30B-A3B在MRCR上达到68.3（+18.1），在GraphWalks上达到77.5（+7.6），结果与Qwen3-235B-A22B相当，同时在GPQA、MMLU-Pro、AIME和IFEval上保持了通用能力。进一步的机理分析表明，ACC训练的模型表现出任务自适应的注意力重组和专家专业化。

English

Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.