**Themis：训练稳健的多语言代码奖励模型以实现灵活的多标准评分**

摘要

奖励模型（RMs）已成为语言模型（LM）后训练方案中不可或缺的组成部分，能够实现策略对齐和测试时扩展。然而，关于奖励模型在代码生成领域应用的研究相对匮乏，现有工作主要聚焦于执行反馈。这种选择将后训练过程局限于针对独立可执行代码的功能正确性优化。本研究系统探讨了多语言、多标准代码奖励模型的训练与评估方法。为此，我们首先构建了Themis-CodeRewardBench基准测试平台，该平台覆盖8种编程语言和5个偏好维度（即评判标准），并在此基准上对50余个代码、数学及通用奖励模型进行性能剖析。针对当前奖励模型在功能正确性评分之外的能力局限，我们开发了迄今最大规模的开源代码偏好数据集Themis-CodePreference（包含超过35万组偏好对），并基于此训练出Themis-RM系列模型——一套参数规模从6亿到320亿不等的多语言代码奖励模型，支持灵活的多标准评分。实验与消融研究表明：该模型呈现积极的比例扩展趋势；在多样化偏好数据训练下表现出强大的跨语言迁移能力；多标准训练对构建可靠的代码奖励模型具有关键作用。

English

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.

Themis：训练稳健的多语言代码奖励模型以实现灵活的多标准评分

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

摘要

Support