동적 잠재 라우팅

초록

우리는 시간에 따라 변하는 보상 함수를 가진 마르코프 결정 과정(MDP)에서 하위 정책들의 시간적 연결을 조사한다. 전역 다익스트라 탐색(GDS)을 도입하고, 전역적으로 최적인 목표 도달 정책들이 중간 최적 하위 정책들의 시간적 합성을 통해 복원될 수 있음을 증명한다. GDS의 기반이 되는 '탐색, 선택, 업데이트' 원리에 착안하여, 동적 잠재 라우팅(DLR)이라는 언어 모델 사후 훈련 방법을 제안한다. 이 방법은 단일 훈련 단계에서 동적 탐색을 통해 이산 잠재 코드, 라우팅 정책, 모델 파라미터를 공동으로 학습한다. 저데이터 미세 조정 설정에서 DLR은 네 개의 데이터셋과 여섯 개의 모델에 걸쳐 지도 미세 조정과 동등하거나 더 나은 성능을 보이며, 평균 6.6% 포인트의 향상을 달성한다. 반면, 이전의 이산 잠재 기준선들은 일관되게 SFT보다 낮은 성능을 보인다. 메커니즘 분석과 목표적 코드 제거 실험은 DLR이 뚜렷한 인과적 역할을 가진 구조화된 라우팅 행동을 학습함을 보여준다.

English

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.