ChatPaper.aiChatPaper

超越词级监督:通过强化学习释放基于解码的回归潜力

Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning

December 6, 2025
作者: Ming Chen, Sheng Tang, Rong-Xi Tan, Ziniu Li, Jiacheng Chen, Ke Xue, Chao Qian
cs.AI

摘要

基于解码的回归方法将回归任务重构为序列生成问题,已成为应用大语言模型进行数值预测的重要范式。然而,离散化的词元级优化目标(如交叉熵)与连续数值之间的错位制约了该范式的发展。现有依赖词元级约束的方法往往难以捕捉目标值的全局量级,限制了预测精度与泛化能力。本文提出通过强化学习释放基于解码的回归方法潜力,将生成过程建模为马尔可夫决策过程,利用序列级奖励机制保障全局数值一致性。在表格回归和代码指标回归任务上的大量实验表明,我们的方法(特别是采用ReMax和GRPO优化器时)持续超越最先进的词元级基线方法和传统回归头,彰显了引入序列级信号的优势。进一步分析揭示,强化学习能显著提升采样效率和预测精度,使基于解码的回归成为通用数值预测中稳健而精确的范式。
English
Decoding-based regression, which reformulates regression as a sequence generation task, has emerged as a promising paradigm of applying large language models for numerical prediction. However, its progress is hindered by the misalignment between discrete token-level objectives (e.g., cross-entropy) and continuous numerical values. Existing approaches relying on token-level constraints often fail to capture the global magnitude of the target value, limiting their precision and generalization. In this paper, we propose to unlock the potential of decoding-based regression via Reinforcement Learning (RL). We formulate the generation process as a Markov Decision Process, utilizing sequence-level rewards to enforce global numerical coherence. Extensive experiments on tabular regression and code metric regression demonstrate that our method (specifically with ReMax and GRPO) consistently outperforms both state-of-the-art token-level baselines and traditional regression heads, showing the superiority of introducing sequence-level signals. Our analysis further reveals that RL significantly enhances sampling efficiency and predictive precision, establishing decoding-based regression as a robust and accurate paradigm for general-purpose numerical prediction.
PDF62December 10, 2025