在多跳问答中的掩码机制:语言模型在上下文置换下的表现分析
Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation
May 16, 2025
作者: Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan
cs.AI
摘要
多跳问答(MHQA)为问答任务增添了复杂性,使其更具挑战性。当语言模型(LMs)面对多个搜索结果时,它们不仅需要检索相关信息,还需在信息源之间进行多跳推理。尽管LMs在传统问答任务中表现优异,但因果掩码可能会限制其在复杂上下文中的推理能力。本文通过在不同配置下对搜索结果(检索到的文档)进行排列,探讨了LMs如何应对多跳问题。我们的研究揭示了以下有趣发现:1)编码器-解码器模型,如Flan-T5系列,尽管规模显著较小,但在MHQA任务中通常优于仅因果解码器的LMs;2)改变关键文档的顺序,在Flan T5模型和微调后的仅解码器模型中均显示出不同的趋势,当文档顺序与推理链顺序一致时,性能达到最佳;3)通过修改因果掩码,为仅因果解码器模型引入双向注意力机制,能有效提升其最终表现。此外,我们还深入研究了LMs在MHQA上下文中的注意力权重分布。实验表明,当答案正确时,注意力权重往往在较高值处达到峰值。我们利用这一发现,启发式地提升了LMs在此任务上的表现。我们的代码已公开于https://github.com/hwy9855/MultiHopQA-Reasoning。
English
Multi-hop Question Answering (MHQA) adds layers of complexity to question
answering, making it more challenging. When Language Models (LMs) are prompted
with multiple search results, they are tasked not only with retrieving relevant
information but also employing multi-hop reasoning across the information
sources. Although LMs perform well on traditional question-answering tasks, the
causal mask can hinder their capacity to reason across complex contexts. In
this paper, we explore how LMs respond to multi-hop questions by permuting
search results (retrieved documents) under various configurations. Our study
reveals interesting findings as follows: 1) Encoder-decoder models, such as the
ones in the Flan-T5 family, generally outperform causal decoder-only LMs in
MHQA tasks, despite being significantly smaller in size; 2) altering the order
of gold documents reveals distinct trends in both Flan T5 models and fine-tuned
decoder-only models, with optimal performance observed when the document order
aligns with the reasoning chain order; 3) enhancing causal decoder-only models
with bi-directional attention by modifying the causal mask can effectively
boost their end performance. In addition to the above, we conduct a thorough
investigation of the distribution of LM attention weights in the context of
MHQA. Our experiments reveal that attention weights tend to peak at higher
values when the resulting answer is correct. We leverage this finding to
heuristically improve LMs' performance on this task. Our code is publicly
available at https://github.com/hwy9855/MultiHopQA-Reasoning.Summary
AI-Generated Summary