생각한 후 채점: 비디오 보상 모델링을 위한 분리된 추론과 채점

초록

생성 비디오 모델의 최근 발전은 훈련 후 및 테스트 타임 스케일링에 점점 더 의존하고 있으며, 이들 모두 비디오 보상 모델(RM)의 품질에 크게 좌우됩니다. 이상적인 보상 모델은 다양한 시나리오에서 인간의 선호도와 일치하는 정확한 보상을 예측해야 합니다. 그러나 기존 패러다임은 근본적인 딜레마에 직면해 있습니다: 판별적 RM은 명시적 추론 없이 다중모드 대규모 언어 모델(MLLM)이 추출한 특징에 대해 직접 보상을 회귀하므로, 단축 학습에 취약하고 일반화를 위해 대규모 데이터 스케일링에 크게 의존합니다. 반면, 사고 사슬(CoT) 추론을 활용하는 생성적 RM은 인간 선호도 배후의 근거를 내재화하기 위해 세밀한 의미론적 감독을 활용하므로 우수한 해석 가능성과 일반화 잠재력을 보여줍니다. 그러나 단일 자기회귀 추론 체인 내에서 추론과 채점이 결합되어 있어 본질적인 최적화 병목 현상을 겪습니다. CoT 추론의 일반화 이점을 활용하면서 결합된 추론과 채점의 훈련 불안정성을 완화하기 위해, 우리는 훈련 효율적이고 일반화 가능한 비디오 보상 모델인 DeScore를 소개합니다. DeScore는 분리된 "생각-후-채점" 패러다임을 채택합니다: MLLM이 먼저 명시적 CoT를 생성한 후, 학습 가능한 질의 토큰과 최종 보상을 예측하는 회귀 헤드로 구성된 전용 판별적 채점 모듈이 이를 따라옵니다. DeScore는 두 단계 프레임워크를 통해 최적화됩니다: (1) 강력한 채점 능력을 보장하기 위한 무작위 마스크 메커니즘을 통합한 판별적 콜드 스타트, (2) CoT 추론 품질을 독립적으로 개선하고 최종 보상을 보정하는 이중 목표 강화 학습 단계로, 더 높은 품질의 추론이 직접적으로 우수한 모델 성능으로 이어지도록 합니다.

English

Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: Discriminative RMs regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, Generative RMs with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.

생각한 후 채점: 비디오 보상 모델링을 위한 분리된 추론과 채점

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

초록

Support