언제 생각하고 언제 말할까: LLM 추론을 위한 공개 정책 학습

초록

단일 스트림 자동회귀 인터페이스에서는 동일한 토큰이 모델 상태를 업데이트함과 동시에 되돌릴 수 없는 공개적 약속을 구성합니다. 이러한 결합은 침묵 비용을 발생시킵니다: 추가적인 숙고는 첫 번째 작업 관련 콘텐츠의 출력을 지연시키는 반면, 단순히 일찍 스트리밍하는 것은 이후 생성을 편향시킬 수 있는 조기 약속을 초래할 위험이 있습니다. 본 논문에서는 공개 시점을 표준 자동회귀 생성 내에서 제어 가능한 결정으로 만드는 병렬 교차 추론(Side-by-Side Interleaved Reasoning, SxS)을 소개합니다. SxS는 부분적 공개와 지속적인 비공개 추론을 동일한 컨텍스트에서 교차시키지만, 지금까지의 추론에 의해 지지될 때만 콘텐츠를 공개합니다. 의미 없는 내용을 채우는 행위를 장려하지 않으면서 이러한 속도 조절을 학습하기 위해, 우리는 답변 접두사를 이를 지지하는 추론 접두사와 매칭함으로써 함의-정렬된 교차 궤적을 구성합니다. 그런 다음 SFT(지도 미세 조정)를 통해 이중 행동 의미를 학습하고, RL(강화 학습)을 통해 새로운 형식 하에서 추론 성능을 회복하도록 모델을 훈련시킵니다. 두 가지 Qwen3 아키텍처/규모(혼합 전문가 Qwen3-30B-A3B, 조밀 Qwen3-4B)와 인-도메인(AIME25) 및 아웃-오브-도메인(GPQA-Diamond) 벤치마크 전반에 걸쳐, SxS는 업데이트 간 대기 시간과 같은 토큰 수준의 대용 지표 하에서 정확도-콘텐츠-지연 시간 파레토 트레이드오프를 개선합니다.

English

In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a silence tax: additional deliberation postpones the first task-relevant content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce Side-by-Side (SxS) Interleaved Reasoning, which makes disclosure timing a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is supported by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoE Qwen3-30B-A3B, dense Qwen3-4B) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy--content-latency Pareto trade-offs under token-level proxies such as inter-update waiting.

언제 생각하고 언제 말할까: LLM 추론을 위한 공개 정책 학습

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

초록

Support