コスト効率的な推論のための混合思考表現を用いた大規模言語モデルカスケード

要旨

GPT-4のような大規模言語モデル（LLM）は、さまざまなタスクで顕著な性能を発揮していますが、この高い性能はしばしば有料APIサービスの高額な利用コストを伴います。本論文では、特に推論（例：数学的、因果的）タスクを実行する際に、LLMの利用コストを削減するためのLLMカスケードの構築を研究する動機を持っています。私たちのカスケードパイプラインは、より単純な質問はより弱いがより手頃なLLMで対応でき、一方で難しい質問のみがより強力で高価なLLMを必要とするという直感に従っています。この意思決定を実現するために、より弱いLLMの「回答一貫性」を質問の難易度の信号として考慮し、回答サンプリングと一貫性チェックのためのいくつかの方法を提案します。これには、2つの思考表現（例：Chain-of-ThoughtとProgram-of-Thought）の混合を活用する方法も含まれます。GPT-3.5-turboとGPT-4をそれぞれ弱いLLMと強いLLMとして、6つの推論ベンチマークデータセットでの実験を通じて、提案したLLMカスケードが、強いLLMのみを使用した場合と同等の性能を達成しつつ、そのコストのわずか40%しか必要としないことを実証します。

English

Large language models (LLMs) such as GPT-4 have exhibited remarkable performance in a variety of tasks, but this strong performance often comes with the high expense of using paid API services. In this paper, we are motivated to study building an LLM cascade to save the cost of using LLMs, particularly for performing reasoning (e.g., mathematical, causal) tasks. Our cascade pipeline follows the intuition that simpler questions can be addressed by a weaker but more affordable LLM, whereas only the challenging questions necessitate the stronger and more expensive LLM. To realize this decision-making, we consider the "answer consistency" of the weaker LLM as a signal of the question difficulty and propose several methods for the answer sampling and consistency checking, including one leveraging a mixture of two thought representations (i.e., Chain-of-Thought and Program-of-Thought). Through experiments on six reasoning benchmark datasets, with GPT-3.5-turbo and GPT-4 being the weaker and stronger LLMs, respectively, we demonstrate that our proposed LLM cascades can achieve performance comparable to using solely the stronger LLM but require only 40% of its cost.

コスト効率的な推論のための混合思考表現を用いた大規模言語モデルカスケード

Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning

要旨

Support