ChatPaper.aiChatPaper

InfiMed-ORBIT:通过基于评分标准的增量训练对齐大语言模型于开放式复杂任务

InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

October 17, 2025
作者: Pengkai Wang, Qi Zuo, Pengwei Liu, Zhijie Sang, Congkai Xie, Hongxia Yang
cs.AI

摘要

大型語言模型(LLMs)通過強化學習(RL)取得了顯著進展,特別是在獎勵可以通過程序驗證的領域,如數學和代碼。在這些領域中,模型受益於由明確基於規則的目標所指導的明確操作基礎。然而,這一進展揭示了一個重大限制:在獎勵模糊、主觀或依賴於上下文的開放領域,如創意寫作、科學推理,尤其是醫療諮詢,缺乏穩健的獎勵函數,使得這些領域對當前的RL策略具有挑戰性。為彌補這一差距,我們引入了ORBIT,這是一個專門為高風險醫療對話設計的基於開放式評分標準的增量訓練框架。ORBIT將合成對話生成與動態創建的評分標準相結合,利用這些評分標準來指導增量RL過程。特別地,這種方法不依賴於外部醫學知識或手動規則,而是利用評分標準引導的反饋來塑造學習。當在Qwen3-4B-Instruct模型上實施時,我們的方法僅使用2k樣本就能將其HealthBench-Hard基準測試的性能從7.0大幅提升至27.2,從而實現了該規模模型的頂尖水平。我們的分析證實,評分標準驅動的RL在多樣化的諮詢場景中促進了持續的性能提升,超越了簡單的數值改進。這些發現強調了基於評分標準的反饋作為一種可擴展策略,在複雜、開放式任務中推進LLMs的潛力。
English
Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.
PDF102October 20, 2025