编码三角：大语言模型如何理解代码？

摘要

大型语言模型（LLMs）在代码生成领域取得了显著进展，但其真实的编程能力仍待深入探究。我们提出了代码三角框架，系统地评估LLMs在三个基本维度上的表现：编辑分析、代码实现与测试用例生成。通过在竞争性编程基准上的大量实验，我们发现尽管LLMs能在这三个维度上形成一个自洽的系统，但其解决方案往往缺乏人类程序员的多样性与鲁棒性。我们识别出模型认知与人类专业知识之间存在显著的分布偏移，模型错误往往因训练数据偏差和有限的推理迁移而聚集。研究表明，融入人类编写的编辑说明、解决方案及多样化测试用例，以及利用模型混合策略，能显著提升LLMs的性能与鲁棒性。此外，我们揭示了LLMs认知中的一致性与不一致性，这或许能促进自我反思与改进，为开发更强大的编码模型指明潜在方向。

English

Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.

编码三角：大语言模型如何理解代码？

Coding Triangle: How Does Large Language Model Understand Code?

摘要

Support