广义并行扩展与代际依赖

摘要

并行大语言模型推理扩展涉及为单一输入提示采样一组N>1的响应。然而，这些N个并行响应往往彼此独立生成，分割了计算资源，导致一次生成中潜在的有用信息未被其他生成所利用。这与响应长度扩展形成对比，后者在所有后续步骤中均利用了过去的计算。为了获得更高质量的响应及响应集，我们提出了Bridge方法，通过将批量大语言模型的隐藏状态重新构想为整体张量而非独立切片，来生成相互依赖的并行响应。仅需引入少量（2.8%-5.1%）新参数，Bridge便可将基于可验证奖励的强化学习的相对平均准确率提升高达50%，并增强正确响应的一致性。一经训练，Bridge可扩展至任意生成宽度，其性能均优于独立生成，开启了一种更通用的并行扩展模式，有效利用序列间的信息，兼容任何后生成聚合技术。

English

Parallel LLM inference scaling involves sampling a set of N>1 responses for a single input prompt. However, these N parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8%-5.1%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 50% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.

广义并行扩展与代际依赖

Generalized Parallel Scaling with Interdependent Generations

摘要

Support