사고에서 행위 주체성으로: 대규모 언어 모델 강화학습의 신용 할당

초록

대규모 언어 모델(LLM)을 위한 강화 학습(RL)은 점점 더 희소하고 결과 수준의 보상에 의존하고 있으나, 긴 트레이젝토리 내에서 어떤 행동이 결과를 초래했는지 판단하는 것은 여전히 어렵습니다. 이러한 크레딧 할당(CA) 문제는 두 가지 영역에서 나타납니다: 단일 사고 연쇄 생성(500~30,000+ 토큰) 내에서 토큰과 단계 간에 크레딧을 분배해야 하는 추론 RL 영역과, 다중 턴 환경 상호작용으로 인해 확률적 전이, 부분 관측 가능성, 100+ 턴(100,000~1M 토큰)의 시간 지평이 도입되어 에피소드 수준 크레딧의 정보성이 점차 낮아지는 에이전시 RL 영역입니다. 본 논문은 2024년부터 2026년 초 사이에 발표된 47개의 CA 방법(41개 핵심 방법, 6개 인접 기여 방법)을 조사하며, 이를 할당 세분성(토큰, 세그먼트, 단계, 턴, 다중 에이전트)과 방법론(몬테카를로, 시간차, 모델 기반, 게임 이론, 정보 이론)에 따른 2차원 분류 체계로 구성합니다. 조사 자체를 넘어, 우리는 세 가지 재사용 가능한 자료를 기여합니다: (1) 분류 체계 라벨, 기준 방법군, 증거 수준이 포함된 구조화된 기계 판독 논문 목록; (2) 체계적인 방법론적 격차를 식별하기 위해 검토 문헌에 대해 검증된 향후 CA 논문용 보고 체크리스트; (3) 과제군, 메타데이터 요구사항, 통제 분기 과제를 포함한 벤치마크 프로토콜 명세 및 방법 선택 결정 트리. 우리의 종합 분석은 추론 RL에서 에이전시 RL로의 전환이 크레딧 할당의 지형을 복잡하게 만들고 재구성함을 시사합니다: 추론 CA는 과정 보상 모델과 비판단적 그룹 비교를 중심으로 성숙해 가는 반면, 에이전시 CA는 추론 RL에서는 직접적인 선례가 없는 새로운 접근법(후견적 반사실 분석, 특권 비대칭 평가자, 턴 수준 MDP 재구성)을 주도하고 있습니다.

English

Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards -- yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500--30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K--1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches -- hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations -- that have no direct precedent in reasoning RL.

사고에서 행위 주체성으로: 대규모 언어 모델 강화학습의 신용 할당

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

초록

Support