LLMエージェントはコードリポジトリを参照できる

要旨

大規模言語モデルを基盤とするコーディングエージェントは、ソフトウェア工学分野のタスクにおいて優れた性能を示している。しかし、ほとんどのエージェントはリポジトリをほぼ完全にテキストとして処理しており、人間の開発者がフォルダ階層や依存関係などの視覚的構造を利用して大規模コードベース内での方向感覚を得る方法とは異なっている。マルチモーダル大規模言語モデルを用いる場合、エージェントがリポジトリの視覚的表現から効果的に利益を得られるかどうかは未解決の課題である。本論文では、リポジトリレベルの課題解決におけるLLMベースのエージェント向けの視覚的リポジトリ表現に関する初の系統的実証研究を提示する。我々は4つの最近のマルチモーダルモデルを評価する。その結果、厳密に視覚のみの設定では、エージェントに十分な記号的詳細が欠如しており、繰り返しの視覚的クエリでそれを補うため、精度が低下しトークンコストが増加することが示された。対照的に、標準的なテキストインターフェースに加えて補助的モダリティとしてリポジトリ構造の視覚的グラフを統合することで、エージェントは構造をより効率的に理解できるようになる。入力トークン消費量は最大26%削減される一方、課題解決精度は維持または向上する。可視化は、欠陥特定時やエージェントが探索深度を自律的に制御する場合に最も有用である。これらの知見は、次世代コーディングエージェントのための実用的なテキストとビジョンのハイブリッド設計を示唆している。

English

Coding agents powered by large language models have demonstrated strong performance on software engineering tasks. Yet most agents consume repositories almost entirely as text, which differs from how human developers use visual structure such as folder hierarchies and dependency relationships to orient themselves in large codebases. With multimodal large language models (MLLMs), it is an open question whether agents can effectively benefit from visual representations of repositories. This paper presents the first systematic empirical study of visual repository representations for LLM-based agents on repository-level issue resolution. We evaluate four recent multimodal models. Our results show that a strictly vision-only setup degrades accuracy and increases token cost, because agents lack sufficient symbolic detail and compensate with repeated visual queries. In contrast, integrating visual graphs of repository structure as a supplementary modality alongside standard text interfaces helps agents understand structure more efficiently: input token consumption decreases by up to 26% while issue-resolution accuracy is maintained or improved. Visualization is most useful during fault localization and when the agent autonomously controls exploration depth. These findings point to a practical hybrid text-and-vision design for next-generation coding agents.