あなたとの協働を向上させる: ユーザーの修正をコーディングエージェントのランタイム強制にコンパイルする

要旨

対話型LLMエージェントは日常業務の一部になりつつあるが、時間の経過とともにより扱いやすくなるとは限らない。あるセッションで記憶された修正が、次のセッションでは守られないこともある。本研究では、選好へのアクセスと選好の順守との間にあるこのギャップを調査する。匿名化された実際のユーザーの摩擦事例から派生したタスクでは、Mem0の記憶を用いても、該当する選好チェックの57.5%が違反されたままである。我々は、Test-time Rule Acquisition and Compiled Enforcement (TRACE) を導入する。これは、コード作成エージェントの実行環境向けの差し込み可能なスキル層パイプラインであり、ユーザーの修正を抽出し、原子ルールとして書き換え、将来のタスク完了前に合格しなければならない実行時チェックにコンパイルする。開発者が事前に作成した実行時チェックとは異なり、TRACEのスキルはユーザー自身のチャットでの修正に由来する。我々は、ClawArenaのコード作成エージェントタスクとMemoryArena由来のメモリ集約型タスクを用いて、シミュレートされたユーザーインザループ実験でTRACEを評価する。ClawArenaでは、TRACEにより保持外選好違反が分布内タスクで100.0%から37.6%に、分布外タスクで100.0%から2.0%に減少した。MemoryArena由来のタスクでは、TRACEは分布内の違反を100.0%から60.5%に減少させると同時に、タスク合格率では最強のメモリベースラインと同等かそれを上回った。これらの結果は、修正を実行時の強制にコンパイルすることで、記憶だけでは確実に解決できない繰り返し発生する摩擦の障害モードに対処でき、将来のセッションでユーザーが同じ修正を言い直す必要性を減らせることを示唆している。実験コードはhttps://github.com/YujunZhou/TRACE_exp、デプロイ可能なスキルはhttps://github.com/YujunZhou/tellonceで入手可能である。

English

Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.