コーディングトライアングル：大規模言語モデルはコードをどのように理解するか？

要旨

大規模言語モデル（LLM）はコード生成において顕著な進歩を遂げているが、その真のプログラミング能力は未だ十分に探求されていない。本論文では、コードトライアングルフレームワークを導入し、LLMを編集分析、コード実装、テストケース生成という3つの基本的な次元にわたって体系的に評価する。競技プログラミングのベンチマークを用いた広範な実験を通じて、LLMはこれらの次元において自己整合的なシステムを形成できるものの、その解決策はしばしば人間のプログラマーの多様性と堅牢性を欠いていることを明らかにする。モデルの認知と人間の専門知識との間には有意な分布シフトが存在し、モデルのエラーは訓練データのバイアスや限られた推論転移に起因してクラスタリングする傾向があることが判明した。本研究は、人間が生成した解説、解決策、多様なテストケースを組み込むこと、およびモデルの混合を活用することが、LLMの性能と堅牢性を大幅に向上させることを示す。さらに、LLMの認知における一貫性と不整合性を明らかにし、自己反省と自己改善を促進する可能性を示すことで、より強力なコーディングモデルの開発に向けた潜在的な方向性を提供する。

English

Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.

コーディングトライアングル：大規模言語モデルはコードをどのように理解するか？

Coding Triangle: How Does Large Language Model Understand Code?

要旨

Support