ChatPaper.aiChatPaper

CodeOCR:視覺語言模型在程式碼理解中的有效性研究

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

February 2, 2026
作者: Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, Xiaodong Gu
cs.AI

摘要

大型語言模型在原始碼理解領域已取得顯著成就,然而隨著軟體系統規模擴大,計算效率已成為關鍵瓶頸。當前這些模型依賴基於文本的範式,將原始碼視為線性符號序列,這導致上下文長度及相關計算成本呈線性增長。多模態大型語言模型的快速發展帶來新契機:透過將原始碼轉譯為渲染圖像來優化效率。有別於文本壓縮容易導致語義流失,圖像模態天生具備壓縮適應性。透過調整解析度,圖像可縮減至原始符號成本的數分之一,同時仍能被視覺模型識別。為驗證此方法的可行性,我們針對多模態大模型在程式碼理解效能展開首項系統性研究。實驗結果表明:(1)多模態大模型能以最高8倍壓縮率實現有效程式碼理解;(2)該模型能有效利用語法突顯等視覺線索,在4倍壓縮率下提升程式碼補全效能;(3)程式碼克隆檢測等任務對視覺壓縮展現卓越韌性,部分壓縮比率甚至微幅超越原始文本輸入。本研究揭示了多模態大模型在程式碼理解領域的潛力與現行局限,指出圖像模態的程式碼表徵將成為實現高效推理的重要路徑。
English
Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.
PDF812February 5, 2026