SIFT: ステッカーを用いた文脈に基づくLLM推論の基盤化

要旨

本論文では、大規模言語モデルの推論プロセスにおいて、文脈の誤解が重大な問題となり得ることを明らかにしている。これは、Llama3.2-3B-Instructのような小規模モデルから、DeepSeek-R1のような最先端モデルに至るまで広く見られる。例えば、「10 dollars per kilo」というフレーズにおいて、LLMは「per」が「それぞれに対して」を意味することを認識せず、計算誤りを引き起こす可能性がある。この問題に対処するため、我々は**Stick to the Facts (SIFT)**と呼ばれる新しいポストトレーニング手法を提案する。SIFTは、推論時の計算リソースを活用して、LLMの推論を文脈に基づかせる。SIFTの中核には、モデル自身によって生成される*Sticker*があり、文脈内の重要な情報を明示的に強調する。このStickerを基に、SIFTは2つの予測を生成する——1つは元のクエリからの予測、もう1つはStickerを追加したクエリからの予測である。これらが異なる場合、Stickerは*順方向*最適化（抽出された事実をクエリにより適切に整合させるため）と*逆方向*生成（モデルの内在的な傾向に従わせるため）を経て順次改良され、より忠実な推論結果が得られる。3Bから100B+までの多様なモデルとベンチマーク（例：GSM8K、MATH-500）を用いた研究により、一貫した性能向上が確認された。特に、SIFTはDeepSeek-R1のAIME2024におけるpass@1精度を78.33%から**85.67**%に向上させ、オープンソースコミュニティにおける新たな最先端を確立した。コードはhttps://github.com/zhijie-group/SIFTで公開されている。

English

This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1. For example, in the phrase "10 dollars per kilo," LLMs might not recognize that "per" means "for each," leading to calculation errors. We introduce a novel, post-training approach called **Stick to the Facts (SIFT)** to tackle this. SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts. At the core of SIFT lies the *Sticker*, which is generated by the model itself to explicitly emphasize the key information within the context. Given the curated Sticker, SIFT generates two predictions -- one from the original query and one from the query augmented with the Sticker. If they differ, the Sticker is sequentially refined via *forward* optimization (to better align the extracted facts with the query) and *inverse* generation (to conform with the model's inherent tendencies) for more faithful reasoning outcomes. Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements. Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to **85.67**%, establishing a new state-of-the-art in the open-source community. The code is available at https://github.com/zhijie-group/SIFT.

SIFT: ステッカーを用いた文脈に基づくLLM推論の基盤化

SIFT: Grounding LLM Reasoning in Contexts via Stickers

要旨

Support