RuleReasoner:通过领域感知动态采样实现强化型规则推理
RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling
June 10, 2025
作者: Yang Liu, Jiaqi Li, Zilong Zheng
cs.AI
摘要
基于规则的推理已被公认为推理领域的基本问题之一,然而现实应用中规则格式、类型及复杂度的多样性带来了严峻挑战。近期研究表明,大型推理模型(LRMs)展现出卓越的推理能力,其性能通过强化学习(RL)得到了显著提升。然而,小型推理模型(SRMs)能否有效学习基于规则的推理,并在多样任务和领域中实现稳健泛化,仍是一个悬而未决的问题。为此,我们提出了强化规则推理方法,简称RuleReasoner,这是一种通过精心策划的任务集合及新颖的领域感知动态采样策略来执行基于规则推理的简单而有效的方法。具体而言,RuleReasoner通过基于历史奖励更新不同领域的采样权重,对每个训练批次进行重采样。这不仅促进了领域增强,还为RL提供了灵活的在线学习计划,无需依赖现有方法中预先设计的人工混合训练方案。在分布内(ID)和分布外(OOD)基准测试中的实证评估显示,RuleReasoner在八个ID任务上平均领先前沿LRMs 4.1个百分点,在三个OOD任务上平均领先10.4个百分点(相较于OpenAI-o1)。尤为突出的是,与先前的RL动态采样方法相比,我们的方法还展现了更高的计算效率。
English
Rule-based reasoning has been acknowledged as one of the fundamental problems
in reasoning, while deviations in rule formats, types, and complexity in
real-world applications pose severe challenges. Recent studies have shown that
large reasoning models (LRMs) have remarkable reasoning capabilities, and their
performance is substantially enhanced by reinforcement learning (RL). However,
it remains an open question whether small reasoning models (SRMs) can learn
rule-based reasoning effectively with robust generalization across diverse
tasks and domains. To address this, we introduce Reinforced Rule-based
Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct
rule-based reasoning via a wide collection of curated tasks and a novel
domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples
each training batch by updating the sampling weights of different domains based
on historical rewards. This facilitates domain augmentation and flexible online
learning schedules for RL, obviating the need for pre-hoc human-engineered
mix-training recipes used in existing methods. Empirical evaluations on
in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that
RuleReasoner outperforms frontier LRMs by a significant margin (Delta4.1%
average points on eight ID tasks and Delta10.4% average points on three OOD
tasks over OpenAI-o1). Notably, our approach also exhibits higher computational
efficiency compared to prior dynamic sampling methods for RL.