相互依存世代を伴う一般化並列スケーリング

要旨

並列LLM推論スケーリングでは、単一の入力プロンプトに対してN>1の応答をサンプリングする。しかし、これらのN個の並列応答は互いに独立して生成される傾向があり、計算リソースが分割され、ある生成で得られる有用な情報が他の生成に活用されないままになる。これは、過去の計算が将来のすべてのステップで使用される応答長スケーリングとは対照的である。より高品質な応答と応答セットを実現するため、我々はBridgeを提案する。Bridgeは、バッチ処理されたLLMの隠れ状態を独立したスライスではなく、全体としてのテンソルと再考することで、相互依存する応答を並列に生成する。わずかな追加パラメータ（2.8%-5.1%）のみで、Bridgeは検証可能な報酬を用いた強化学習からの相対的な平均精度向上を最大50%改善し、正しい応答の一貫性を高める。一度訓練すれば、Bridgeは任意の生成幅にスケーリングでき、独立した生成よりも高い性能を発揮し、シーケンス間の情報を効果的に活用するより一般的な並列スケーリングモードを実現する。これは、あらゆる生成後集約技術と互換性がある。

English

Parallel LLM inference scaling involves sampling a set of N>1 responses for a single input prompt. However, these N parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8%-5.1%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 50% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.

相互依存世代を伴う一般化並列スケーリング

Generalized Parallel Scaling with Interdependent Generations

要旨

Support