ACC: 장문맥 훈련을 위한 에이전트 궤적 컴파일링

초록

최근 에이전트의 발전으로 LLM의 장문맥 추론 능력에 대한 수요가 다시 증가하고 있다. 그러나 이러한 능력을 위해 LLM을 훈련하려면 비용이 많이 드는 장문서 큐레이션 또는 경험적 문맥 합성이 필요하다. 우리는 에이전트가 문제를 해결할 때 여러 턴에 걸쳐 도구를 호출하고 환경 관찰을 수신하며 방대한 궤적을 생성한다는 점을 관찰하였다. 따라서 원래 질문에 답하는 데 필요한 증거는 이러한 털들에 분산되어 있으며, 멀리 떨어진 문맥 세그먼트들의 통합이 필요하다. 그럼에도 불구하고 표준 에이전트 SFT는 도구 응답을 마스킹하고 턴 수준의 도구 선택만 훈련하여, 이러한 분산된 신호가 활용되지 않는 감독 사각지대를 만든다. 본 논문에서는 에이전트 문맥 컴파일(ACC)을 제안한다. ACC는 검색, 소프트웨어 엔지니어링, 데이터베이스 질의 에이전트로부터의 궤적을 원래 질문과 여러 턴에 걸쳐 수집된 도구 응답 및 환경 관찰을 결합한 장문맥 QA 쌍으로 변환하여, 모델이 도구 사용 없이 직접 답변하도록 훈련한다. 이를 통해 질문과 증거 간의 의존 관계를 명시적으로 만들어 추가 주석 없이 먼 세그먼트에 걸친 장문맥 추론에 대한 직접적인 감독을 가능하게 한다. ACC는 간단하면서도 효과적인 접근 방식으로, 기존의 모든 장문맥 확장 또는 훈련 방법과 결합 가능하며 확장 가능한 지도 미세 조정 데이터를 제공한다. 우리는 MRCR과 GraphWalks를 통해 ACC를 장거리 의존성 모델링 작업에서 검증한다. 이들은 교차 턴 상호참조 해결과 확장된 문맥에 걸친 그래프 탐색을 요구하는 까다로운 벤치마크이다. ACC로 Qwen3-30B-A3B를 훈련한 결과 MRCR에서 68.3(+18.1), GraphWalks에서 77.5(+7.6)를 달성하여 Qwen3-235B-A22B와 유사한 성능을 보였으며, GPQA, MMLU-Pro, AIME, IFEval에서 일반 능력은 유지되었다. 추가적인 메커니즘 분석 결과, ACC로 훈련된 모델은 작업 적응적 주의 재구성 및 전문가 전문화를 나타냄을 확인하였다.

English

Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.