超越詞元級監督:透過強化學習釋放解碼式迴歸的潛力
Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning
December 6, 2025
作者: Ming Chen, Sheng Tang, Rong-Xi Tan, Ziniu Li, Jiacheng Chen, Ke Xue, Chao Qian
cs.AI
摘要
基於解碼的迴歸方法將迴歸任務重構為序列生成任務,已成為應用大型語言模型進行數值預測的重要範式。然而,離散化的詞元級目標(如交叉熵)與連續數值之間的錯位問題制約了其發展。現有依賴詞元級約束的方法往往難以捕捉目標值的整體量級,限制了預測精度與泛化能力。本文提出透過強化學習釋放解碼式迴歸潛力的新路徑。我們將生成過程建模為馬可夫決策過程,利用序列級獎勵機制保障全局數值連貫性。在表格迴歸與代碼度量迴歸上的大量實驗表明,我們的方法(特別是結合ReMax與GRPO策略時)持續優於最先進的詞元級基線方法與傳統迴歸頭,彰顯了引入序列級信號的優越性。進一步分析揭示,強化學習能顯著提升採樣效率與預測精度,從而確立解碼式迴歸作為通用數值預測任務中兼具魯棒性與準確性的新範式。
English
Decoding-based regression, which reformulates regression as a sequence generation task, has emerged as a promising paradigm of applying large language models for numerical prediction. However, its progress is hindered by the misalignment between discrete token-level objectives (e.g., cross-entropy) and continuous numerical values. Existing approaches relying on token-level constraints often fail to capture the global magnitude of the target value, limiting their precision and generalization. In this paper, we propose to unlock the potential of decoding-based regression via Reinforcement Learning (RL). We formulate the generation process as a Markov Decision Process, utilizing sequence-level rewards to enforce global numerical coherence. Extensive experiments on tabular regression and code metric regression demonstrate that our method (specifically with ReMax and GRPO) consistently outperforms both state-of-the-art token-level baselines and traditional regression heads, showing the superiority of introducing sequence-level signals. Our analysis further reveals that RL significantly enhances sampling efficiency and predictive precision, establishing decoding-based regression as a robust and accurate paradigm for general-purpose numerical prediction.