超越词级监督：通过强化学习释放基于解码的回归潜力

摘要

基于解码的回归方法将回归任务重构为序列生成问题，已成为应用大语言模型进行数值预测的重要范式。然而，离散化的词元级优化目标（如交叉熵）与连续数值之间的错位制约了该范式的发展。现有依赖词元级约束的方法往往难以捕捉目标值的全局量级，限制了预测精度与泛化能力。本文提出通过强化学习释放基于解码的回归方法潜力，将生成过程建模为马尔可夫决策过程，利用序列级奖励机制保障全局数值一致性。在表格回归和代码指标回归任务上的大量实验表明，我们的方法（特别是采用ReMax和GRPO优化器时）持续超越最先进的词元级基线方法和传统回归头，彰显了引入序列级信号的优势。进一步分析揭示，强化学习能显著提升采样效率和预测精度，使基于解码的回归成为通用数值预测中稳健而精确的范式。

English

Decoding-based regression, which reformulates regression as a sequence generation task, has emerged as a promising paradigm of applying large language models for numerical prediction. However, its progress is hindered by the misalignment between discrete token-level objectives (e.g., cross-entropy) and continuous numerical values. Existing approaches relying on token-level constraints often fail to capture the global magnitude of the target value, limiting their precision and generalization. In this paper, we propose to unlock the potential of decoding-based regression via Reinforcement Learning (RL). We formulate the generation process as a Markov Decision Process, utilizing sequence-level rewards to enforce global numerical coherence. Extensive experiments on tabular regression and code metric regression demonstrate that our method (specifically with ReMax and GRPO) consistently outperforms both state-of-the-art token-level baselines and traditional regression heads, showing the superiority of introducing sequence-level signals. Our analysis further reveals that RL significantly enhances sampling efficiency and predictive precision, establishing decoding-based regression as a robust and accurate paradigm for general-purpose numerical prediction.

超越词级监督：通过强化学习释放基于解码的回归潜力

Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning

摘要

Support