CodeOCR:视觉语言模型在代码理解中的有效性探究
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
February 2, 2026
作者: Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, Xiaodong Gu
cs.AI
摘要
大型语言模型(LLMs)在源代码理解领域取得了显著成就,但随着软件系统规模不断扩大,计算效率已成为关键瓶颈。当前这些模型依赖基于文本的范式,将源代码视为线性标记序列,这导致上下文长度及相关计算成本呈线性增长。多模态大语言模型(MLLMs)的快速发展为优化效率提供了新思路——通过将源代码渲染为图像进行表示。与难以在不损失语义的情况下压缩的文本不同,图像模态天然适合压缩处理。通过调整分辨率,图像可被压缩至原始标记成本的极小比例,同时仍能被视觉模型识别。为探索该方法的可行性,我们首次系统性地研究了MLLMs在代码理解中的有效性。实验表明:(1)MLLMs能以大幅缩减的标记量有效理解代码,最高实现8倍压缩;(2)MLLMs能有效利用语法高亮等视觉线索,在4倍压缩下提升代码补全性能;(3)代码克隆检测等理解任务对视觉压缩表现出卓越的耐受性,部分压缩比甚至略优于原始文本输入。我们的发现既揭示了MLLMs在代码理解中的潜力与当前局限,也指明了向图像模态代码表示转型是实现更高效推理的重要路径。
English
Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.