LLM智能體能理解程式碼儲存庫

摘要

由大型語言模型驅動的編碼智能體在軟體工程任務中展現出強大的效能。然而，多數智能體幾乎完全以文字形式處理程式庫，這與人類開發者透過資料夾層級結構與依賴關係等視覺架構來定位大型程式碼庫的方式有所不同。隨著多模態大型語言模型的發展，智能體能否有效利用程式庫的視覺表徵仍是一個開放性問題。本文針對基於LLM的智能體在儲存庫層級問題解決上，首次進行了系統性的視覺表徵實證研究。我們評估了四種最新的多模態模型。結果顯示，純視覺的設定模式不僅會降低準確率，還會增加代幣成本，因為智能體缺乏足夠的符號細節，必須透過重複的視覺查詢來補償。相對地，將程式庫結構的視覺圖表作為輔助模態，與標準文字介面整合使用，能幫助智能體更有效地理解結構：輸入代幣消耗量最多減少26%，同時問題解決的準確率維持不變或獲得提升。視覺化在錯誤定位階段以及智能體自主控制探索深度時尤為有效。這些發現為下一代編碼智能體提供了實用的文字與視覺混合設計方向。

English

Coding agents powered by large language models have demonstrated strong performance on software engineering tasks. Yet most agents consume repositories almost entirely as text, which differs from how human developers use visual structure such as folder hierarchies and dependency relationships to orient themselves in large codebases. With multimodal large language models (MLLMs), it is an open question whether agents can effectively benefit from visual representations of repositories. This paper presents the first systematic empirical study of visual repository representations for LLM-based agents on repository-level issue resolution. We evaluate four recent multimodal models. Our results show that a strictly vision-only setup degrades accuracy and increases token cost, because agents lack sufficient symbolic detail and compensate with repeated visual queries. In contrast, integrating visual graphs of repository structure as a supplementary modality alongside standard text interfaces helps agents understand structure more efficiently: input token consumption decreases by up to 26% while issue-resolution accuracy is maintained or improved. Visualization is most useful during fault localization and when the agent autonomously controls exploration depth. These findings point to a practical hybrid text-and-vision design for next-generation coding agents.