당신과의 협업 능력 향상: 사용자 수정사항을 코딩 에이전트를 위한 런타임 강제 실행으로 컴파일하기

초록

대화형 LLM 에이전트는 일상 업무의 일부가 되고 있지만, 시간이 지남에 따라 신뢰할 수 있을 정도로 작업이 쉬워지지는 않는다. 한 세션에서 기억된 수정 사항이 다음 세션에서 여전히 위반될 수 있기 때문이다. 본 연구에서는 선호 접근(preference access)과 선호 준수(preference compliance) 사이의 이러한 격차를 살펴본다. 익명화된 실제 사용자 불편 사례에서 파생된 작업에서 Mem0 메모리는 여전히 적용 가능한 선호 검사 중 57.5%가 위반된 상태로 남아 있다. 우리는 테스트 시간 규칙 획득 및 컴파일된 강제 실행(TRACE)을 도입한다. 이는 코딩 에이전트 런타임을 위한 드롭인(drop-in) 스킬 계층 파이프라인으로, 사용자 수정 사항을 마이닝하고 이를 원자적 규칙으로 재작성한 후, 에이전트가 향후 작업을 완료하기 전에 반드시 통과해야 하는 런타임 검사로 컴파일한다. 개발자가 사전에 작성한 런타임 검사와 달리, TRACE 스킬은 사용자 자신의 채팅 수정 사항에서 비롯된다. 우리는 ClawArena 코딩 에이전트 작업과 MemoryArena에서 파생된 메모리 집약적 작업에 대해 시뮬레이션된 사용자-인-더-루프 실험으로 TRACE를 평가한다. ClawArena에서 TRACE는 보류된 선호 위반을 분포 내 작업에서 100.0%에서 37.6%로, 분포 외 작업에서 100.0%에서 2.0%로 감소시킨다. MemoryArena 파생 작업에서 TRACE는 분포 내 위반을 100.0%에서 60.5%로 줄이면서 작업 통과에서 가장 강력한 메모리 기준선과 동등하거나 더 나은 성능을 보인다. 이러한 결과는 수정 사항을 런타임 강제 실행으로 컴파일하는 것이 메모리 단독으로는 신뢰성 있게 해결하지 못하는 반복적인 마찰 실패 모드를 해결할 수 있음을 시사하며, 사용자가 향후 세션에서 동일한 수정 사항을 반복해야 하는 필요성을 줄여준다. 실험 코드는 https://github.com/YujunZhou/TRACE_exp 에서 확인할 수 있으며, 배포 가능한 스킬은 https://github.com/YujunZhou/tellonce 에서 확인할 수 있다.

English

Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.