ハーネス更新はハーネスベネフィットではない：自己進化型LLMエージェントにおける進化能力の解明

要旨

LLMエージェントは、プロンプト、スキル、記憶、ツールといった編集可能な外部ハーネスを中心に構築されたシステムとしてますます展開されており、これらはモデルパラメータを変更することなくタスク実行を形成する。ハーネス自己進化は、実行証拠からこれらのハーネスを更新することによって、そのようなエージェントを適応させる。しかし、タスク解決におけるモデルの基本能力が、ハーネス自己進化におけるその能力を予測するかどうかは依然として不明である。すなわち、どのモデルが有用なハーネス更新を生成し、どのモデルが実際にその恩恵を受けるのか？我々は2つのハーネス自己進化能力を分析する：(i) ハーネス更新能力、すなわち実行証拠から有用な永続的ハーネス更新を生成する能力、(ii) ハーネス恩恵能力、すなわちタスク解決中に更新されたハーネスから恩恵を受ける能力。分析により2つの知見が明らかになった。第一に、ハーネス更新能力は基本能力に対して平坦である：異なる能力階層のモデルが生成するハーネス更新は、驚くほど類似した改善をもたらす。Qwen3.5-9Bの更新でさえ、Claude Opus ~4.6と同等の改善を示す。第二に、ハーネス恩恵能力は基本能力に対して非単調である：低能力階層のモデルは更新されたハーネスからほとんど恩恵を受けず、中能力階層のモデルが最も恩恵を受け、高能力階層のモデルは中能力階層よりも恩恵が少ない。我々は低能力階層における低い改善を2つの失敗モードに起因づける：低能力階層のモデルは関連するハーネス成果物を活性化できないか、または活性化してもそれらに忠実に従うことができない可能性がある。これらの知見は、能力予算を進化させるものではなくタスク解決エージェントに投資し、エージェント訓練においてハーネスの呼び出しと長期指示追従を目標とすることを示唆する。我々のソースコードはhttps://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolutionで公開されている。

English

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.