ChatPaper.aiChatPaper

RubricHub:基于自动化粗细粒度生成的全方位高区分度评分标准数据集

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

January 13, 2026
作者: Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, Wei Chen
cs.AI

摘要

可验证奖励强化学习(RLVR)在数学等推理密集型领域取得了显著进展。然而,由于缺乏标准答案,开放生成任务的优化仍面临挑战。虽然基于量规的评估为验证提供了结构化代理,但现有方法存在可扩展性瓶颈和评价标准粗糙的问题,导致监督天花板效应。为此,我们提出了一种自动化的"由粗到精"量规生成框架。通过融合原则指导的生成、多模型聚合和难度演进机制,该方法能构建具有高区分度的综合评价标准,精准捕捉生成内容的细微差异。基于此框架,我们推出了RubricHub——一个涵盖多领域的大规模数据集(约11万条)。我们通过包含基于量规的拒绝采样微调(RuFT)和强化学习(RuRL)的两阶段后训练流程验证其有效性。实验结果表明,RubricHub能带来显著性能提升:经过后训练的Qwen3-14B模型在HealthBench上达到69.3分的顶尖水平,超越了GPT-5等前沿闭源模型。代码与数据即将开源发布。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale (sim110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. The code and data will be released soon.
PDF252January 20, 2026