動的潜在ルーティング

要旨

我々は、時間変動報酬関数を持つマルコフ決定過程（MDP）におけるサブポリシーの時間的連結について調査する。一般化ダイクストラ探索（GDS）を導入し、中間最適サブポリシーの時間的合成を通じて大域的最適な目標到達ポリシーが復元可能であることを証明する。GDSの根底にある「探索、選択、更新」の原理に着想を得て、動的潜在ルーティング（DLR）を提案する。これは、単一のトレーニング段階での動的探索を通じて、離散潜在コード、ルーティングポリシー、モデルパラメータを同時に学習する言語モデルのポストトレーニング手法である。低データファインチューニング設定において、DLRは4つのデータセットと6つのモデルにわたって教師ありファインチューニングと同等かそれを上回り、平均+6.6パーセントポイントの向上を達成した。一方、従来の離散潜在ベースラインは一貫してSFTを下回った。メカニズム解析と対象を絞ったコードアブレーションにより、DLRが明確な因果的役割を持つ構造化されたルーティング行動を学習することが示された。

English

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.