루브릭 기반 온-폴리시 증류

초록

온-정책 증류(OPD)는 모델 정렬을 위한 강력한 패러다임이지만, 교사 로짓(logit)에 대한 의존성으로 인해 그 적용이 화이트박스 시나리오로 제한된다. 본 연구는 구조화된 의미적 루브릭(rubric)이 교사 로짓의 확장 가능한 대안 역할을 하여, 교사가 생성한 응답만을 사용해 OPD를 가능하게 할 수 있다고 주장한다. 이를 증명하기 위해, 우리는 루브릭 기반 OPD를 위한 단순하면서도 기초적인 프레임워크인 ROPD를 소개한다. 구체적으로, ROPD는 교사-학생 간의 대비(contrast)로부터 프롬프트 특화 루브릭을 도출하고, 이 루브릭을 활용해 학생 롤아웃(rollout)을 점수화하여 온-정책 최적화를 수행한다. 실험적으로, ROPD는 대부분의 시나리오에서 고급 로짓 기반 OPD 방법을 능가하며, 샘플 효율성에서 최대 10배의 향상을 달성한다. 이러한 결과는 루브릭 기반 OPD를 기존의 로짓 기반 OPD에 대한 유연하고 블랙박스 호환 가능한 대안으로 자리매김하게 하며, 독점 및 오픈소스 LLM 전반에 걸친 확장 가능한 증류를 위한 단순하면서도 강력한 기준선을 제공한다. 코드는 https://github.com/Peregrine123/ROPD_official 에서 확인할 수 있다.

English

On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only teacher-generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric-based OPD. Specifically, ROPD induces prompt-specific rubrics from teacher-student contrasts, and then utilizes these rubrics to score the student rollouts for on-policy optimization. Empirically, ROPD outperforms the advanced logit-based OPD methods across most scenarios, and achieving up to a 10x gain in sample efficiency. These results position rubric-based OPD as a flexible, black-box-compatible alternative to the prevailing logit-based OPD, offering a simple yet strong baseline for scalable distillation across proprietary and open-source LLMs. Code is available at https://github.com/Peregrine123/ROPD_official.

루브릭 기반 온-폴리시 증류

Rubric-based On-policy Distillation

초록

Support