在多跳問答中的遮蔽機制:語言模型在上下文置換下的表現分析
Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation
May 16, 2025
作者: Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan
cs.AI
摘要
多跳問答(Multi-hop Question Answering, MHQA)為問答系統增添了複雜性,使其更具挑戰性。當語言模型(Language Models, LMs)面對多個搜索結果時,它們不僅需要檢索相關信息,還需在多個信息源之間進行多跳推理。儘管語言模型在傳統問答任務中表現出色,但因果掩碼(causal mask)可能會阻礙其在複雜上下文中的推理能力。本文通過在不同配置下對搜索結果(檢索到的文檔)進行排列,探討了語言模型如何應對多跳問題。我們的研究揭示了以下有趣的發現:1)編碼器-解碼器模型,如Flan-T5系列,在MHQA任務中通常優於因果解碼器模型,儘管其規模顯著較小;2)改變黃金文檔的順序揭示了Flan T5模型和微調解碼器模型中的不同趨勢,當文檔順序與推理鏈順序一致時,性能最佳;3)通過修改因果掩碼來增強因果解碼器模型的雙向注意力,可以有效提升其最終表現。此外,我們還深入研究了MHQA背景下語言模型注意力權重的分佈。實驗表明,當答案正確時,注意力權重往往會達到較高值。我們利用這一發現,啟發式地提升了語言模型在此任務中的表現。我們的代碼已公開於https://github.com/hwy9855/MultiHopQA-Reasoning。
English
Multi-hop Question Answering (MHQA) adds layers of complexity to question
answering, making it more challenging. When Language Models (LMs) are prompted
with multiple search results, they are tasked not only with retrieving relevant
information but also employing multi-hop reasoning across the information
sources. Although LMs perform well on traditional question-answering tasks, the
causal mask can hinder their capacity to reason across complex contexts. In
this paper, we explore how LMs respond to multi-hop questions by permuting
search results (retrieved documents) under various configurations. Our study
reveals interesting findings as follows: 1) Encoder-decoder models, such as the
ones in the Flan-T5 family, generally outperform causal decoder-only LMs in
MHQA tasks, despite being significantly smaller in size; 2) altering the order
of gold documents reveals distinct trends in both Flan T5 models and fine-tuned
decoder-only models, with optimal performance observed when the document order
aligns with the reasoning chain order; 3) enhancing causal decoder-only models
with bi-directional attention by modifying the causal mask can effectively
boost their end performance. In addition to the above, we conduct a thorough
investigation of the distribution of LM attention weights in the context of
MHQA. Our experiments reveal that attention weights tend to peak at higher
values when the resulting answer is correct. We leverage this finding to
heuristically improve LMs' performance on this task. Our code is publicly
available at https://github.com/hwy9855/MultiHopQA-Reasoning.Summary
AI-Generated Summary