LLM 에이전트는 코드 저장소를 인식할 수 있다.

초록

대규모 언어 모델 기반 코딩 에이전트는 소프트웨어 엔지니어링 작업에서 뛰어난 성능을 입증해 왔다. 그러나 대부분의 에이전트는 저장소를 거의 완전히 텍스트로만 처리하는데, 이는 인간 개발자가 폴더 계층 구조나 의존 관계와 같은 시각적 구조를 활용하여 대규모 코드베이스에서 방향을 잡는 방식과는 다르다. 다중 모달 대규모 언어 모델(MLLM)의 등장으로, 에이전트가 저장소의 시각적 표현으로부터 효과적으로 이점을 얻을 수 있는지는 아직 해결되지 않은 문제이다. 본 논문은 저장소 수준의 이슈 해결에 있어 LLM 기반 에이전트를 위한 저장소의 시각적 표현에 대한 최초의 체계적인 경험적 연구를 제시한다. 우리는 네 가지 최신 다중 모달 모델을 평가했다. 결과에 따르면, 순수 시각 전용 설정은 정확도를 저하시키고 토큰 비용을 증가시키는데, 이는 에이전트가 충분한 기호 정보를 얻지 못하고 반복적인 시각 질의로 이를 보완하기 때문이다. 반면, 저장소 구조의 시각적 그래프를 표준 텍스트 인터페이스와 함께 보조 양식으로 통합하면 에이전트가 구조를 보다 효율적으로 이해하는 데 도움이 된다. 입력 토큰 소비는 최대 26% 감소하는 반면, 이슈 해결 정확도는 유지되거나 개선된다. 시각화는 오류 위치 파악 단계와 에이전트가 탐색 깊이를 자율적으로 제어할 때 가장 유용하다. 이러한 발견은 차세대 코딩 에이전트를 위한 실용적인 텍스트-시각 하이브리드 설계를 시사한다.

English

Coding agents powered by large language models have demonstrated strong performance on software engineering tasks. Yet most agents consume repositories almost entirely as text, which differs from how human developers use visual structure such as folder hierarchies and dependency relationships to orient themselves in large codebases. With multimodal large language models (MLLMs), it is an open question whether agents can effectively benefit from visual representations of repositories. This paper presents the first systematic empirical study of visual repository representations for LLM-based agents on repository-level issue resolution. We evaluate four recent multimodal models. Our results show that a strictly vision-only setup degrades accuracy and increases token cost, because agents lack sufficient symbolic detail and compensate with repeated visual queries. In contrast, integrating visual graphs of repository structure as a supplementary modality alongside standard text interfaces helps agents understand structure more efficiently: input token consumption decreases by up to 26% while issue-resolution accuracy is maintained or improved. Visualization is most useful during fault localization and when the agent autonomously controls exploration depth. These findings point to a practical hybrid text-and-vision design for next-generation coding agents.