RubricHub:基於自動化粗細粒度生成技術的全面性高鑑別力評分量規資料集
RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation
January 13, 2026
作者: Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, Wei Chen
cs.AI
摘要
具備可驗證獎勵的強化學習(RLVR)在數學等推理密集型領域已取得重大進展。然而,由於缺乏真實標籤,優化開放式生成任務仍具挑戰性。雖然基於評分準則的評估為驗證提供了結構化代理方案,但現有方法存在可擴展性瓶頸與評判標準粗糙的問題,導致出現監督天花板效應。為解決此問題,我們提出自動化「由粗到細評分準則生成」框架。通過融合原則引導的準則合成、多模型聚合與難度演化機制,我們的方法能產出全面且具高辨識度的評判標準,精準捕捉生成內容的細微差異。基於此框架,我們構建了RubricHub——一個大規模(約11萬條)多領域數據集。我們通過兩階段後訓練流程驗證其有效性,包括基於評分準則的拒絕採樣微調(RuFT)與強化學習(RuRL)。實驗結果表明,RubricHub能顯著提升模型性能:經後訓練的Qwen3-14B模型在HealthBench基準上達到69.3分的頂尖水平,超越GPT-5等專有前沿模型。相關代碼與數據即將開源。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale (sim110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. The code and data will be released soon.