InfiMed-ORBIT:通过基于评分标准的渐进式训练对齐大语言模型于开放式复杂任务
InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training
October 17, 2025
作者: Pengkai Wang, Qi Zuo, Pengwei Liu, Zhijie Sang, Congkai Xie, Hongxia Yang
cs.AI
摘要
大型语言模型(LLMs)通过强化学习(RL)在特定领域取得了显著进展,尤其是在奖励可通过程序验证的领域,如数学和编程。在这些领域,模型受益于由明确规则目标引导的明确操作基础。然而,这一进展揭示了一个重大局限:在奖励模糊、主观或依赖上下文的开放领域,如创意写作、科学推理,尤其是医疗咨询,缺乏稳健的奖励函数,使得这些领域对当前的RL策略构成挑战。为弥合这一差距,我们引入了ORBIT,一个专为高风险医疗对话设计的开放式基于准则的增量训练框架。ORBIT结合了合成对话生成与动态准则创建,利用这些准则指导增量RL过程。特别地,该方法不依赖于外部医学知识或手动规则,而是通过准则引导的反馈来塑造学习。在Qwen3-4B-Instruct模型上实施时,我们的方法仅用2k样本就能将其在HealthBench-Hard基准上的表现从7.0大幅提升至27.2,从而实现了该规模模型的最先进成果。我们的分析证实,准则驱动的RL在多样化咨询场景中促进了持续的性能提升,超越了简单的数值改进。这些发现强调了基于准则的反馈作为推进LLMs在复杂、开放任务中发展的可扩展策略的重要性。
English
Large Language Models (LLMs) have shown substantial advances through
reinforcement learning (RL), particularly in domains where rewards can be
programmatically verified, such as mathematics and code. In these areas, models
benefit from a well-defined operational base guided by explicit rule-based
objectives. However, this progress reveals a significant limitation: in
open-ended domains where rewards are ambiguous, subjective, or
context-dependent, such as creative writing, scientific reasoning, and notably
medical consultation, robust reward functions are lacking, making these areas
challenging for current RL strategies. To bridge this gap, we introduce ORBIT,
an open-ended rubric-based incremental training framework specifically designed
for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue
generation with the dynamic creation of rubrics, employing these rubrics to
direct an incremental RL process. In particular, this approach does not depend
on external medical knowledge or manual rules, instead utilizing rubric-guided
feedback to shape learning. When implemented on the Qwen3-4B-Instruct model,
our method can greatly enhance its performance on the HealthBench-Hard
benchmark from 7.0 to 27.2 using only 2k samples, thus achieving
state-of-the-art results for models of this scale. Our analysis confirms that
rubric-driven RL fos-ters consistent performance gains across diverse
consultation scenarios, going beyond simple numerical improvements. These
findings underscore rubric-based feedback as a scalable strategy for advancing
LLMs in intricate, open-ended tasks.