층을 생략할까, 반복시킬까? LLM에서 계층 프로그램 학습하기

초록

대규모 언어 모델(LLMs)은 고정된 깊이와 순서로 모든 레이어를 비순환적으로 실행하여 추론을 수행한다. 본 연구에서는 사전 학습된 레이어를 모듈로 묶은 후, 각 입력에 맞춰 사용자 정의 프로그램을 형성하도록 건너뛰거나 반복할 수 있는, 학습 없이도 가능한 유연하고 동적인 레이어 프로그램(PoLar)이 광범위하게 존재함을 밝힌다. 대부분의 입력에 대해, 상당히 짧은 프로그램 실행이 동일하거나 더 나은 정확도를 달성할 수 있으며, 원래 LLM의 잘못된 예측은 더 적은 레이어를 사용하는 대체 프로그램으로 교정될 수 있다. 이러한 관찰은 추론이 표준 순방향 전파를 넘어 여러 유효한 잠재 계산 경로를 허용함을 시사한다. 실제로 PoLar를 효율적으로 구현하기 위해, 각 입력에 대해 사전 학습된 레이어를 동적으로 건너뛰거나 반복하는 실행 프로그램을 학습하는 경량의 PoLar 예측 네트워크를 제안한다. 수학적 추론 벤치마크 실험 결과, PoLar는 표준 추론 및 기존 동적 깊이 방법보다 일관되게 정확도를 향상시켰으며, 종종 더 적은 레이어를 실행하면서도 이러한 이점이 분포 외 평가에서도 유지됨을 보여준다. 본 결과는 고정 깊이 실행이 LLM의 잠재적 추론 능력 중 극히 일부만을 포착함을 시사한다.

English

Large language models (LLMs) perform inference by following a fixed depth and order, non-recurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic program-of-layers (PoLar), where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass. To efficiently achieve PoLar in practice, we propose a lightweight PoLar prediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input. Experiments on mathematical reasoning benchmarks demonstrate that PoLar consistently improves accuracy over standard inference and prior dynamic-depth methods, often while executing fewer layers, and that these gains persist under out-of-distribution evaluation. Our results suggest that fixed-depth execution captures only a narrow subset of an LLM's latent reasoning capacity.