必要な時だけインターフェースを呼び出す：質問応答における大規模言語モデルの適応的呼び出し

要旨

大規模言語モデル（LM）と小規模言語モデルの協調パラダイムは、性能とコストのバランスを効果的に取りますが、その重要な課題は、小規模LMで幻覚（hallucination）が発生した際の呼び出しタイミングを正確に特定することにあります。これまでの最適化努力は主に後処理技術に焦点を当てており、これらはLMの推論プロセスとは独立していたため、高い計算コストと限定的な効果をもたらしていました。本論文では、AttenHScoreと呼ばれる実用的な呼び出し評価指標を提案します。これは、小規模LMの生成プロセスにおける幻覚の蓄積と伝播を計算し、潜在的な推論エラーを継続的に増幅します。検出閾値を動的に調整することで、大規模LMのより正確なリアルタイム呼び出しを実現します。さらに、小規模LMの限られた推論能力を考慮し、不確実性を意識した知識再編成を活用して、異なるテキストチャンクから重要な情報をより良く捕捉できるように支援します。大規模な実験により、AttenHScoreが複数のQAデータセットにおいてリアルタイムの幻覚検出能力を向上させる点でほとんどのベースラインを上回り、特に複雑なクエリに対処する際に優れていることが明らかになりました。さらに、私たちの戦略は追加のモデルトレーニングを必要とせず、様々なトランスフォーマーベースのLMに適応する柔軟性を示します。

English

The collaborative paradigm of large and small language models (LMs) effectively balances performance and cost, yet its pivotal challenge lies in precisely pinpointing the moment of invocation when hallucinations arise in small LMs. Previous optimization efforts primarily focused on post-processing techniques, which were separate from the reasoning process of LMs, resulting in high computational costs and limited effectiveness. In this paper, we propose a practical invocation evaluation metric called AttenHScore, which calculates the accumulation and propagation of hallucinations during the generation process of small LMs, continuously amplifying potential reasoning errors. By dynamically adjusting the detection threshold, we achieve more accurate real-time invocation of large LMs. Additionally, considering the limited reasoning capacity of small LMs, we leverage uncertainty-aware knowledge reorganization to assist them better capture critical information from different text chunks. Extensive experiments reveal that our AttenHScore outperforms most baseline in enhancing real-time hallucination detection capabilities across multiple QA datasets, especially when addressing complex queries. Moreover, our strategies eliminate the need for additional model training and display flexibility in adapting to various transformer-based LMs.

必要な時だけインターフェースを呼び出す：質問応答における大規模言語モデルの適応的呼び出し

Invoke Interfaces Only When Needed: Adaptive Invocation for Large Language Models in Question Answering

要旨

Support