マルチホップQAにおけるマスキング：言語モデルのコンテキスト順列に対する性能分析

要旨

マルチホップ質問応答（MHQA）は、質問応答に複雑さの層を追加し、より挑戦的なタスクとします。言語モデル（LM）が複数の検索結果をプロンプトとして与えられると、関連情報を検索するだけでなく、情報源間でマルチホップ推論を行うことが求められます。LMは従来の質問応答タスクでは良好な性能を発揮しますが、因果マスクが複雑な文脈間での推論能力を妨げる可能性があります。本論文では、検索結果（取得された文書）を様々な構成で並べ替えることで、LMがマルチホップ質問にどのように応答するかを探ります。本研究から以下の興味深い知見が得られました：1）Flan-T5ファミリーのようなエンコーダ-デコーダモデルは、サイズが大幅に小さいにもかかわらず、MHQAタスクで因果デコーダのみのLMを一般的に上回る性能を示す；2）ゴールド文書の順序を変更すると、Flan T5モデルとファインチューニングされたデコーダのみのモデルの両方で異なる傾向が現れ、文書の順序が推論チェーンの順序と一致する場合に最適な性能が観察される；3）因果マスクを変更して双方向注意を追加することで、因果デコーダのみのモデルの最終性能を効果的に向上させることができる。これに加えて、MHQAの文脈におけるLMの注意重みの分布を徹底的に調査しました。実験から、正しい答えが得られた場合、注意重みがより高い値でピークに達する傾向があることが明らかになりました。この知見を活用して、ヒューリスティックにLMのこのタスクにおける性能を向上させます。私たちのコードはhttps://github.com/hwy9855/MultiHopQA-Reasoningで公開されています。

English

Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs' performance on this task. Our code is publicly available at https://github.com/hwy9855/MultiHopQA-Reasoning.

マルチホップQAにおけるマスキング：言語モデルのコンテキスト順列に対する性能分析

Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation

要旨

Support