NL2Code를 넘어서: 멀티모달 코드 인텔리전스에 대한 체계적 조사

초록

대규모 언어 모델(LLM)이 텍스트-코드 합성(text-to-code synthesis)을 상당히 발전시켰지만, 실제 프로그래밍 작업 중 다수는 스크린샷, 차트, 벡터 드로잉, 비디오, 대화형 상태와 같은 시각적 산출물을 통해 의도를 명시한다. 이러한 작업은 시각적 인식과 실행 가능한 프로그램을 연결하는 모델을 필요로 하는데, 그 이유는 정확성이 구문(syntax)뿐만 아니라 레이아웃, 데이터 의미론(data semantics), 상호작용 동작, 실행 후 적용되는 도메인별 제약 조건에 의존하기 때문이다. 본 조사는 다중 모드 코드 지능(Multimodal Code Intelligence)을 살펴보며, 시각적으로 근거한 입력과 출력 하에서 코드를 생성, 편집, 정제, 또는 추론하는 시스템을 다룬다. 먼저, 각 작업에서 코드가 수행하는 역할에 따라 해당 분야를 정식화하여, 코드를 렌더링된 산출물(rendered artifact), 편집 가능한 기호 구조(editable symbolic structure), 과학적 표현(scientific representation), 중간 추론 과정(intermediate reasoning trace), 또는 실행 가능한 정책이나 도구 인터페이스(executable policy or tool interface)로 구분한다. 그런 다음 벤치마크와 방법을 그래픽 사용자 인터페이스(Graphical User Interface), 과학적 시각화(Scientific Visualization), 구조화된 그래픽(Structured Graphics), 최첨단 과제 및 프레임워크(Frontier Tasks and Frameworks)의 네 가지 영역으로 체계화한다. 이 분류 체계는 성숙한 산출물 생성 문제를 떠오르는 에이전트 기반 및 통합 설정과 연결하며, 서로 다른 작업이 정확성의 증거를 어떻게 처리하는지 비교할 수 있게 해준다. 미래를 전망할 때, 향후 연구는 네 가지 검증 중심 방향에서 이점을 얻을 수 있다고 주장한다. 다중 신호 검증(multi-signal validation)은 정확성에 대한 상호 보완적 증거를 결합할 수 있고, 다중 상태 검증(multi-state verification)은 실행 경로에 걸친 동작을 테스트할 수 있으며, 교차 작업 전이 테스트(cross-task transfer testing)는 재사용 가능한 시각-코드 기술을 탐구할 수 있고, 검증 가능한 에이전트 과정(verifiable agent traces)은 에이전트의 행동이 시각적 증거에 근거하는지 밝힐 수 있다. 이들 방향은 함께 이 분야를 단일 출력 모방에서 증거 기반의 실행 가능한 시스템으로 나아가게 할 수 있다. 진행 중인 프로젝트와 자료는 https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code{GitHub}에서 확인할 수 있다.

English

While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, data semantics, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move this field from single-output imitation toward evidence-grounded executable systems. An ongoing project and resources are available on https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code{GitHub}.