LSRIF: 指示追従のための論理構造化強化学習

要旨

大規模言語モデルにおいて指示追従は重要であるが、現実世界の指示には順次依存関係や条件分岐といった論理構造が含まれることが多い。既存手法は通常、並列制約を持つデータセットを構築し平均報酬を最適化するが、論理的依存関係を無視しノイズの多い信号を生成する。我々は指示の論理を明示的にモデル化する論理構造化訓練フレームワークLSRIFを提案する。まず並列・順次・条件分岐などの制約構造を持つデータセットLSRInstructを構築し、次に構造認識報酬付与手法LSRIFを設計する。これには並列構造に対する平均集約、順次構造に対する失敗ペナルティ伝播、条件分岐に対する選択的報酬が含まれる。実験により、LSRIFが指示追従（ドメイン内・ドメイン外）と一般推論で顕著な改善をもたらすことが示された。分析により、明示的な論理構造を用いた学習が注意層のパラメータ更新を引き起こし、制約と論理演算子へのトークンレベルの注意を鋭くすることが明らかになった。

English

Instruction-following is critical for large language models, but real-world instructions often contain logical structures such as sequential dependencies and conditional branching. Existing methods typically construct datasets with parallel constraints and optimize average rewards, ignoring logical dependencies and yielding noisy signals. We propose a logic-structured training framework LSRIF that explicitly models instruction logic. We first construct a dataset LSRInstruct with constraint structures such as parallel, sequential, and conditional types, and then design structure-aware rewarding method LSRIF including average aggregation for parallel structures, failure-penalty propagation for sequential structures, and selective rewards for conditional branches. Experiments show LSRIF brings significant improvements in instruction-following (in-domain and out-of-domain) and general reasoning. Analysis reveals that learning with explicit logic structures brings parameter updates in attention layers and sharpens token-level attention to constraints and logical operators.

LSRIF: 指示追従のための論理構造化強化学習

LSRIF: Logic-Structured Reinforcement Learning for Instruction Following

要旨

Support