ChatPaper.aiChatPaper

RuleReasoner:通过领域感知动态采样增强的基于规则的推理

RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

June 10, 2025
作者: Yang Liu, Jiaqi Li, Zilong Zheng
cs.AI

摘要

基於規則的推理已被公認為推理中的基本問題之一,然而現實應用中規則格式、類型及複雜性的偏差帶來了嚴峻挑戰。近期研究表明,大型推理模型(LRMs)展現出卓越的推理能力,且其性能通過強化學習(RL)得到顯著提升。然而,小型推理模型(SRMs)能否有效學習基於規則的推理,並在多樣化任務與領域間展現出穩健的泛化能力,仍是一個未解之謎。為此,我們提出了強化規則推理法,即RuleReasoner,這是一種通過廣泛收集的任務集與新穎的領域感知動態採樣策略來執行基於規則推理的簡潔而高效的方法。具體而言,RuleReasoner通過基於歷史獎勵更新不同領域的採樣權重,對每一訓練批次進行重採樣。這促進了領域擴展及RL的靈活在線學習計劃,無需依賴現有方法中預先設計的人為混合訓練方案。在分佈內(ID)與分佈外(OOD)基準上的實證評估顯示,RuleReasoner在八項ID任務上平均領先前沿LRMs達4.1個百分點,在三項OOD任務上相較OpenAI-o1平均高出10.4個百分點。值得注意的是,與先前的RL動態採樣方法相比,我們的方法還展現出更高的計算效率。
English
Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin (Delta4.1% average points on eight ID tasks and Delta10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.
PDF283June 11, 2025