Light-IF: 복잡한 명령어 수행을 위한 미리보기 및 자체 점검을 통해 LLM에 일반화 가능한 추론 능력 부여하기

초록

LLM(대형 언어 모델)의 추론 능력 발전은 수학 문제 해결, 코딩 작업, 일반 퍼즐 등에서의 성능을 크게 향상시켰지만, 특히 복잡한 지시사항에 대한 정확한 준수 여부는 여전히 일관되지 않습니다. 우리의 연구는 사고 단계에서의 게으른 추론이 지시사항 준수 실패의 주요 요인임을 밝혔습니다. 이를 해결하기 위해, 우리는 엄격한 지시사항 제약을 충족시키기 위해 필수적인 미리보기와 자기 점검을 포함한 엄격한 추론 프로세스를 가능하게 하는 포괄적인 프레임워크를 제안합니다. 구체적으로, 먼저 복잡한 제약 조건이 포함된 지시사항을 생성하고 필터링 과정을 거쳐 유효한 프롬프트를 얻으며, 이를 통해 hard, easy, pass 세 가지 범주로 구분된 프롬프트 데이터셋을 구축합니다. 그런 다음, pass 프롬프트에 대해 거부 샘플링을 적용하여 소규모이지만 고품질의 데이터셋을 선별함으로써 모델의 콜드 스타트 초기화를 가능하게 하고 효과적인 추론 패턴에의 적응을 촉진합니다. 이후, 엔트로피 보존 지도 미세 조정(Entropy-SFT) 전략과 규칙 기반의 밀집 보상으로 안내된 토큰 단위 엔트로피 적응형(TEA-RL) 강화 학습을 결합하여 모델이 추론 메커니즘을 변형하도록 유도합니다. 이를 통해 미리보기와 자기 점검을 포함한 일반화 가능한 추론 능력을 키우는 것이 목표입니다. 지시사항 준수 벤치마크에서 수행된 광범위한 실험은 다양한 모델 규모에서 뛰어난 성능 향상을 보여줍니다. 특히, 우리의 Light-IF-32B 모델은 DeepSeek-R1과 같은 더 큰 오픈소스 모델과 Doubao-1.6과 같은 클로즈드소스 모델을 모두 능가하는 성과를 달성했습니다.

English

While advancements in the reasoning abilities of LLMs have significantly enhanced their performance in solving mathematical problems, coding tasks, and general puzzles, their effectiveness in accurately adhering to instructions remains inconsistent, particularly with more complex directives. Our investigation identifies lazy reasoning during the thinking stage as the primary factor contributing to poor instruction adherence. To mitigate this issue, we propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking, essential for satisfying strict instruction constraints. Specifically, we first generate instructions with complex constraints and apply a filtering process to obtain valid prompts, resulting in three distinct prompt datasets categorized as hard, easy, and pass. Then, we employ rejection sampling on the pass prompts to curate a small yet high-quality dataset, enabling a cold-start initialization of the model and facilitating its adaptation to effective reasoning patterns. Subsequently, we employ an entropy-preserving supervised fine-tuning (Entropy-SFT) strategy coupled with token-wise entropy-adaptive (TEA-RL) reinforcement learning guided by rule-based dense rewards. This approach encourages the model to transform its reasoning mechanism, ultimately fostering generalizable reasoning abilities that encompass preview and self-checking. Extensive experiments conducted on instruction-following benchmarks demonstrate remarkable performance improvements across various model scales. Notably, our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.

Light-IF: 복잡한 명령어 수행을 위한 미리보기 및 자체 점검을 통해 LLM에 일반화 가능한 추론 능력 부여하기

Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following

초록

Support