상호의존적 세대를 통한 일반화된 병렬 스케일링

초록

병렬 LLM 추론 스케일링은 단일 입력 프롬프트에 대해 N>1개의 응답을 샘플링하는 것을 포함합니다. 그러나 이러한 N개의 병렬 응답은 서로 독립적으로 생성되는 경향이 있어, 컴퓨팅 리소스를 분할하고 한 생성에서 잠재적으로 유용한 정보를 다른 생성에서 활용하지 못하게 합니다. 이는 과거 계산이 모든 미래 단계에서 사용되는 응답 길이 스케일링과는 대조적입니다. 더 높은 품질의 응답과 응답 집합을 위해, 우리는 배치된 LLM 은닉 상태를 독립적인 슬라이스가 아닌 전체적인 텐서로 재구성하여 상호 의존적인 응답을 병렬로 생성하는 Bridge를 제안합니다. 단지 소량(2.8%-5.1%)의 새로운 파라미터만으로, Bridge는 검증 가능한 보상을 통한 강화 학습의 상대적 평균 정확도 향상을 최대 50%까지 개선하고 올바른 응답의 일관성을 높입니다. 한 번 훈련된 Bridge는 모든 생성 폭에 대해 독립적인 생성보다 더 나은 성능을 발휘하며, 시퀀스 간 정보를 효과적으로 활용하는 더 일반적인 병렬 스케일링 모드를 제공하며, 모든 생성 후 집계 기술과 호환됩니다.

English

Parallel LLM inference scaling involves sampling a set of N>1 responses for a single input prompt. However, these N parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8%-5.1%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 50% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.

상호의존적 세대를 통한 일반화된 병렬 스케일링

Generalized Parallel Scaling with Interdependent Generations

초록

Support