재구성을 통한 이해: LLM 사전 학습을 위한 소프트웨어 개발 과정의 역전

초록

대규모 언어 모델(LLM)은 코드 생성 분야에서 놀라운 성과를 거두었지만, 복잡한 소프트웨어 공학에 필요한 심층적이고 장기적인 추론에는 종종 어려움을 겪습니다. 우리는 이러한 한계가 표준 사전 학습 데이터의 본질에서 기인한다고 봅니다. 정적 소프트웨어 저장소는 복잡한 지적 과정의 최종 상태만을 나타낼 뿐, 중간 단계의 기획, 디버깅, 반복적 개선 과정은 추상화되어 배제되기 때문입니다. 이 격차를 해소하기 위해 우리는 재구성을 통한 이해라는 새로운 패러다임을 제안합니다. 우리는 정적 저장소 뒤에 숨겨진 에이전트 궤적(기획, 추론, 디버깅 단계)을 역공학하는 것이 단순한 원시 코드보다 훨씬 풍부한 지도 신호를 제공할 것이라고 가정합니다. 이를 구현하기 위해 우리는 다중 에이전트 시뮬레이션을 활용하여 이러한 궤적을 합성하는 프레임워크를 도입했습니다. 이 과정은 소스 저장소의 구조적 현실(예: 의존성 그래프 및 파일 계층 구조)에 기반하여 충실도를 보장합니다. 더 나아가 합성 데이터의 논리적 엄밀성을 보장하기 위해, 검색 기반 최적화 기법을 사용하여 사고 연쇄(CoT) 추론을 반복적으로 개선하여 실제 코드의 가능도를 최대화합니다. 실험 결과, 재구성된 궤적에 대한 지속적 사전 학습은 장문맥 이해, 코딩 능력, 에이전트 능력을 포함한 다양한 벤치마크에서 Llama-3-8B의 성능을 크게 향상시키는 것으로 나타났습니다.

English

While Large Language Models (LLMs) have achieved remarkable success in code generation, they often struggle with the deep, long-horizon reasoning required for complex software engineering. We attribute this limitation to the nature of standard pre-training data: static software repositories represent only the terminal state of an intricate intellectual process, abstracting away the intermediate planning, debugging, and iterative refinement. To bridge this gap, we propose a novel paradigm: understanding via reconstruction. We hypothesize that reverse-engineering the latent agentic trajectories -- the planning, reasoning, and debugging steps -- behind static repositories provides a far richer supervision signal than raw code alone. To operationalize this, we introduce a framework that synthesizes these trajectories using a multi-agent simulation. This process is grounded in the structural realities of the source repositories (e.g., dependency graphs and file hierarchies) to ensure fidelity. Furthermore, to guarantee the logical rigor of the synthetic data, we employ a search-based optimization technique that iteratively refines the Chain-of-Thought (CoT) reasoning to maximize the likelihood of the ground-truth code. Empirical results demonstrate that continuous pre-training on these reconstructed trajectories significantly enhances Llama-3-8B's performance across diverse benchmarks, including long-context understanding, coding proficiency, and agentic capabilities.

재구성을 통한 이해: LLM 사전 학습을 위한 소프트웨어 개발 과정의 역전

Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining

초록

Support