무의미함이 도움이 된다: 프롬프트 공간 변형이 추론 탐색을 확장한다

초록

검증 가능한 보상을 활용한 강화 학습, 특히 그룹 상대 정책 최적화(GRPO)는 대규모 언어 모델(LLM)의 추론 능력을 크게 발전시켰다. 그러나 복잡한 작업에서 GRPO는 '제로 어드밴티지 문제'를 자주 겪는다. 즉, 질의에 대한 모든 샘플링된 롤아웃이 실패할 경우 상대적 어드밴티지가 0으로 붕괴된다. 그 결과 모델은 이러한 질문에 대한 효과적인 학습 신호를 상실하여 학습 데이터와 계산 자원을 낭비하게 된다. 이러한 문제에 대해 단순히 샘플링 예산을 늘리는 것이 일반적인 해결책이지만, 정적인 샘플링 정책은 본질적으로 추론 탐색을 제한하여 성공률을 저해한다. 본 논문에서는 이러한 탐색 병목 현상을 극복하기 위한 간단하면서 효과적인 학습 프레임워크인 Lorem Perturbation for Exploration(LoPE)을 제안한다. 우리는 작업과 무관한 프롬프트 공간 변형이 어려운 질문에 대한 직교적인 추론 경로를 활성화할 수 있을 정도로 모델의 출력 분포를 전환할 수 있다고 가정한다. 구체적으로 LoPE는 재샘플링 전에 프롬프트에 Lorem Ipsum 어휘(의사 라틴어 플레이스홀더 텍스트)에서 확률적으로 조합된 시퀀스를 추가한다. 17억, 40억, 70억 매개변수 모델에서의 실험 결과, LoPE가 기존 프롬프트를 이용한 재샘플링을 크게 능가함을 확인했다. 추가 분석에 따르면, 낮은 perplexity를 보이는 다른 라틴어 기반 무작위 시퀀스도 효과적인 변형으로 작용한다. 우리의 결과는 LoPE가 LLM 강화 학습의 탐색 범위 확장을 위한 강력한 기준선임을 입증한다.

English

Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the model's output distribution enough to unlock orthogonal reasoning pathways for hard questions. Specifically, LoPE prepends sequences stochastically assembled from Lorem Ipsum vocabulary (a pseudo-Latin placeholder text) to the prompts before resampling. Experiments across 1.7B, 4B, and 7B models demonstrate that LoPE significantly outperforms resampling with the original prompts. Further analysis reveals that other Latin-based random sequences with low perplexity are also effective perturbations. Our results establish LoPE as a strong baseline for broadening exploration in LLM reinforcement learning.

무의미함이 도움이 된다: 프롬프트 공간 변형이 추론 탐색을 확장한다

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

초록

Support