廣義平行擴展與世代相依模型

摘要

并行大语言模型（LLM）推理扩展涉及对单一输入提示采样一组N>1的响应。然而，这些N个并行响应往往彼此独立生成，分割了计算资源，并使得某一生成过程中潜在的有用信息未被其他生成过程利用。这与响应长度扩展形成对比，后者在所有后续步骤中均利用了先前的计算。为了获得更高质量的响应及响应集，我们提出了“桥梁”方法，通过将批量LLM隐藏状态重新构想为整体张量而非独立切片，以生成相互依赖的并行响应。仅需引入少量（2.8%-5.1%）新参数，“桥梁”便可将基于可验证奖励的强化学习带来的相对平均准确率提升提高至50%，并增强正确响应的一致性。一经训练，“桥梁”即可扩展至任意生成宽度，且在所有情况下均展现出优于独立生成的性能，从而开启了一种更为通用的并行扩展模式，该模式能有效利用序列间的信息，并与任何生成后聚合技术兼容。

English

Parallel LLM inference scaling involves sampling a set of N>1 responses for a single input prompt. However, these N parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8%-5.1%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 50% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.

廣義平行擴展與世代相依模型

Generalized Parallel Scaling with Interdependent Generations

摘要

Support