Squeez: 코딩 에이전트를 위한 작업 조건 기반 도구 출력 정제

초록

코딩 에이전트는 각 관측값의 극히 일부만 다음 단계에 중요함에도 불구하고 긴 도구 관측값을 반복적으로 소비합니다. 본 연구에서는 작업 기반 도구 출력 생략을 다룹니다: 집중된 쿼리와 하나의 도구 출력이 주어졌을 때, 에이전트가 다음으로 검토해야 할 최소한의 문자 그대로의 증거 블록을 반환하는 문제입니다. SWE-bench 저장소 상호작용과 합성적 다중 생태계 도구 출력으로부터 구축된 11,477개 예제의 벤치마크를 소개하며, 이 중 618개 예제는 수동으로 선별된 테스트 세트입니다. Qwen 3.5 2B 모델을 LoRA로 미세 조정하고, 더 큰 제로샷 모델 및 휴리스틱 생략 기준선과 비교합니다. 우리 모델은 입력 토큰의 92%를 제거하면서 0.86의 재현율과 0.80의 F1 점수를 달성하여, 제로샷 Qwen 3.5 35B A3B보다 재현율에서 11점 높은 성능을 보였으며 모든 휴리스틱 기준선을 큰 차이로 앞섰습니다.

English

Coding agents repeatedly consume long tool observations even though only a small fraction of each observation matters for the next step. We study task-conditioned tool-output pruning: given a focused query and one tool output, return the smallest verbatim evidence block the agent should inspect next. We introduce a benchmark of 11,477 examples built from SWE-bench repository interactions and synthetic multi-ecosystem tool outputs, with a manually curated 618-example test set. We fine-tune Qwen 3.5 2B with LoRA and compare it against larger zero-shot models and heuristic pruning baselines. Our model reaches 0.86 recall and 0.80 F1 while removing 92% of input tokens, outperforming zero-shot Qwen 3.5 35B A3B by 11 recall points and all heuristic baselines by a wide margin.

Squeez: 코딩 에이전트를 위한 작업 조건 기반 도구 출력 정제

Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents

초록

Support