HARP: Transformer 推論パスにおけるためらいを考慮した再構成

要旨

本論文の目的は、大規模言語モデルの性能を向上させることであり、推論ステップにおける可変の計算要求に対処することを目指しています。ここで、一部のトークンが他よりも多くの計算リソースを必要とする状況があります。私たちは、HARPという、"オフザシェルフ"のTransformerのフォワードパスに対する単純な変更を提案します。意思決定におけるためらいとフレーミング効果から着想を得て、HARPは、モデルがトークン生成中に不確実性に遭遇した際に追加の計算を選択的に適用します。私たちの手法は、難しい意思決定ポイントで一時停止し、異なる視点で入力を再構築することによって、人間の認知プロセスを模倣します。他の手法とは異なり、HARPはモデルに依存せず、トレーニング不要であり、実装が容易です。私たちは、様々な下流タスクとモデルサイズで当社の手法を徹底的に評価し、性能が最大+5.16%向上することを示しています。特筆すべきは、HARPがビームサーチよりも2倍高速な推論時間を維持しながら、これらの利益を達成する点です。シンプルでありながらも大きな利益をもたらすHARPは、最小限の計算影響でTransformerベースの言語モデルの性能を向上させるための実用的な解決策を提供しています。

English

This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to "off-the-shelf" Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We thoroughly evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than beam search. Simple and yet with significant gains, HARP offers a practical solution for enhancing the performance of Transformer-based language models with minimal computational impact.

HARP: Transformer 推論パスにおけるためらいを考慮した再構成

HARP: Hesitation-Aware Reframing in Transformer Inference Pass

要旨

Support