UCoder:通过内部探测大型语言模型实现无监督代码生成
UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models
December 19, 2025
作者: Jiajun Wu, Jian Yang, Wei Zhang, Lin Jing, Yuqing Ma, Ensheng Shi, Yuchi Ma, Zhoujun Li, Xianglong Liu
cs.AI
摘要
大型语言模型(LLMs)在代码生成任务中展现出卓越能力,但其效果高度依赖于带有大量标注数据(如问答对)或无标注数据集(如代码片段)的监督训练,这些数据通常成本高昂且难以大规模获取。为突破这一局限,本文提出IPC方法——一种通过内部探测实现代码生成的无监督框架,无需依赖任何外部语料(包括无标注代码片段)。我们引入问题空间探测、测试理解探测、解空间探测及知识巩固强化机制,深入挖掘LLMs内部存在的知识结构与置信度模式。进一步地,IPC通过自一致性机制和基于表示的质量评估来筛选可靠代码候选,用以训练UCoder(基于无监督学习的代码生成器)。我们在多个代码基准测试上验证了所提方法,结果表明无监督方法能达到与监督方法相媲美的性能,同时显著降低对标注数据和计算资源的依赖。分析实验表明,模型内部状态蕴含丰富的代码质量与正确性信号,有效利用这些信号能够为代码生成任务实现高效的无监督学习,为资源受限场景下训练代码大语言模型开辟了新路径。
English
Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.