UCoder: 대규모 언어 모델의 내부 탐색을 통한 비지도 코드 생성

초록

대규모 언어 모델(LLM)은 코드 생성 작업에서 뛰어난 능력을 입증해왔습니다. 그러나 그 효과성은 방대한 양의 레이블 지정 데이터(예: 질문-응답 쌍) 또는 비레이블 데이터(예: 코드 조각)를 활용한 지도 학습에 크게 의존하며, 이러한 데이터는 대규모로 확보하기에 비용이 많이 들고 어려운 경우가 많습니다. 이러한 한계를 해결하기 위해, 본 논문은 외부 코퍼스(비레이블 코드 조각조차도) 없이 LLM의 내부 지식을 탐색(Internal Probing)하여 코드 생성을 수행하는 비지도 프레임워크인 IPC 방법을 소개합니다. 우리는 문제 공간 탐색, 테스트 이해 탐색, 해결 공간 탐색, 그리고 지식 통합 및 강화를 도입하여 LLM 내에 존재하는 내부 지식과 신뢰도 패턴을 탐색합니다. 나아가 IPC는 자기 일관성 메커니즘과 표현 기반 품질 추정을 통해 신뢰할 수 있는 코드 후보를 식별하여 UCoder(비지도 학습을 적용한 코드 생성기)를 학습시킵니다. 우리는 제안된 접근 방식을 여러 코드 벤치마크에서 검증하며, 레이블 지정 데이터와 컴퓨팅 자원에 대한 의존성을 크게 줄이면서도 비지도 방법이 지도 방법에 버금가는 성능을 달성할 수 있음을 입증합니다. 분석 실험을 통해 모델의 내부 상태에는 코드 품질과 정확성에 대한 풍부한 신호가 포함되어 있으며, 이러한 신호를 적절히 활용하면 코드 생성 작업을 위한 효과적인 비지도 학습이 가능함을 확인했습니다. 이는 자원이 제한된 시나리오에서 코드 LLM을 훈련시키는 새로운 방향을 제시합니다.

English

Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.

UCoder: 대규모 언어 모델의 내부 탐색을 통한 비지도 코드 생성

UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

초록

Support