희소 오토인코더의 모델 내부를 활용한 LLM 사후 훈련 데이터 엔지니어링 안내

초록

모델 내부 표현은 대규모 언어 모델(LLM)이 학습 데이터를 처리하는 방식에 대한 풍부한 정보를 인코딩하고 있으나, 사후 학습 데이터 엔지니어링은 주로 외부 신호에 의존하며 모델 내부에 존재하는 풍부한 내재적 신호를 무시하고 있다. 본 논문은 LLM 강화 학습(RL)을 위한 데이터 엔지니어링 프레임워크인 SAERL을 제안한다. SAERL은 고급 기계적 해석 가능성 도구인 희소 오토인코더(SAE)로 추출한 모델 내부 표현을 활용하여 다양성, 난이도, 품질이라는 세 가지 내재적 데이터 속성을 모델링한다. 각 속성은 구체적인 데이터 엔지니어링 작업의 기반이 된다: 배치 다양성 제어를 위한 적정 수준의 배치 혼합을 적용한 SAE 공간 클러스터링, 쉬움에서 어려움으로의 커리큘럼 순서를 위한 난이도 프록시, 데이터 필터링을 위한 품질 탐침. SAERL은 기본 GRPO 대비 평균 정확도를 3.00% 향상시켰으며, Qwen2.5-Math-1.5B 모델에서 목표 정확도에 도달하는 학습 단계를 20% 단축하였고, 모델 규모 및 RL 알고리즘 전반에 걸쳐 일관된 성능 향상을 보였다. 실험 결과, SAE는 모델 계열과 규모를 넘어 효과적으로 전이되어 경량화되고 재사용 가능한 데이터 엔지니어링 도구로 활용될 수 있음을 확인하였다. 이러한 결과는 모델 내부 표현이 사후 학습 데이터 엔지니어링에 강력하고 실용적인 신호원이 될 수 있음을 입증한다.

English

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.