관측적 맥락 압축을 통한 효율적 터미널 에이전트의 자기 진화 프레임워크

초록

모델 성능이 발전함에 따라 연구는 장기적이고 다중 턴의 터미널 중심 에이전트 작업으로 점차 전환되고 있으며, 이러한 작업에서는 원시 환경 피드백이 향후 의사 결정을 지원하기 위해 상호작용 기록에 보존되는 경우가 많습니다. 그러나 이러한 피드백을 반복적으로 보관하면 상당한 중복이 발생하고 누적 토큰 비용이 단계 수에 따라 제곱으로 증가하여 장기 추론을 저해합니다. 관측 압축으로 이 문제를 완화할 수 있지만, 터미널 환경의 이질성으로 인해 휴리스틱 기반 또는 고정 프롬프트 방식의 일반화가 어렵습니다. 우리는 기존 터미널 에이전트를 위해 상호작용 궤적에서 압축 규칙을 자동으로 발견 및 정제하는 플러그 앤 플레이 방식의 자가 진화형 터미널 에이전트 압축 프레임워크인 TACO를 제안합니다. TerminalBench(TB 1.0 및 TB 2.0)와 4개의 추가 터미널 관련 벤치마크(즉, SWE-Bench Lite, CompileBench, DevEval, CRUST-Bench)에서의 실험 결과, TACO는 주류 에이전트 프레임워크와 강력한 백본 모델 전반에 걸쳐 성능을 지속적으로 향상시키는 것으로 나타났습니다. MiniMax-2.5를 사용 시 대부분의 벤치마크에서 성능을 개선하면서 토큰 오버헤드를 약 10% 감소시켰습니다. TerminalBench에서는 강력한 에이전트 모델 전반에 걸쳐 1%-4%의 지속적인 성능 향상을 가져왔으며, 동일한 토큰 예산 내에서 정확도를 약 2%-3% 추가로 향상시켰습니다. 이러한 결과는 터미널 에이전트를 위한 작업 인식형 자가 진화 압축의 효과성과 일반화 능력을 입증합니다.

English

As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogeneity of terminal environments makes heuristic-based or fixed-prompt methods difficult to generalize. We propose TACO, a plug-and-play, self-evolving Terminal Agent Compression framework that automatically discovers and refines compression rules from interaction trajectories for existing terminal agents. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks (i.e., SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench) show that TACO consistently improves performance across mainstream agent frameworks and strong backbone models. With MiniMax-2.5, it improves performance on most benchmarks while reducing token overhead by around 10%. On TerminalBench, it brings consistent gains of 1%-4% across strong agentic models, and further improves accuracy by around 2%-3% under the same token budget. These results demonstrate the effectiveness and generalization of self-evolving, task-aware compression for terminal agents.

관측적 맥락 압축을 통한 효율적 터미널 에이전트의 자기 진화 프레임워크

A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

초록

Support