대규모 추론 모델에서 길이 압축 최적화

초록

대규모 추론 모델(Large Reasoning Models, LRMs)은 놀라운 성과를 거두었지만, 종종 불필요하고 장황한 추론 과정을 생성하는 문제를 겪고 있습니다. 우리는 이러한 문제의 핵심적인 측면을 "무효 사고(invalid thinking)"로 규정했습니다. 이는 모델이 정답을 도출한 후에도 반복적으로 자신의 작업을 재확인하는 경향을 말합니다. 이러한 특정 비효율성을 해결하기 위해, 우리는 일반적인 효율성(Efficacy)과 경제성(Efficiency) 원칙을 넘어 두 가지 새로운 세분화된 원칙을 제안합니다: 첫째, 중복을 제거하는 간결성(Brevity) 원칙과 둘째, 핵심 추론 단계를 보존하는 충분성(Sufficiency) 원칙입니다. 이러한 원칙을 바탕으로, 우리는 그룹 상대 정책 최적화(Group Relative Policy Optimization, GRPO)에 기반한 사후 훈련 방법인 LC-R1을 소개합니다. LC-R1은 전체적인 간결성을 위한 길이 보상(Length Reward)과 사고 과정의 무효 부분을 제거하도록 특별히 설계된 압축 보상(Compress Reward)의 새로운 조합을 사용합니다. 여러 추론 벤치마크에서의 광범위한 실험을 통해, LC-R1은 정확도에서 약 2%의 미미한 하락만으로 시퀀스 길이를 약 50%까지 크게 줄이는 데 성공했으며, 높은 압축을 우선시하는 파레토 프론티어 상의 유리한 균형점을 달성했습니다. 우리의 분석은 LC-R1의 견고성을 추가로 검증하며, 더 강력하면서도 계산적으로 효율적인 LRMs 개발을 위한 귀중한 통찰을 제공합니다. 우리의 코드는 https://github.com/zxiangx/LC-R1에서 공개되었습니다.

English

Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as "invalid thinking" -- models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

대규모 추론 모델에서 길이 압축 최적화

Optimizing Length Compression in Large Reasoning Models

초록

Support