다중 홉 질의응답에서의 마스킹: 언어 모델의 문맥 순열에 따른 성능 분석

초록

다중 홉 질문 응답(Multi-hop Question Answering, MHQA)은 질문 응답에 복잡성을 더하여 더욱 도전적인 과제로 만든다. 언어 모델(Language Models, LMs)이 여러 검색 결과를 입력받을 때, 이들은 관련 정보를 검색하는 것뿐만 아니라 정보 소스 간의 다중 홉 추론을 수행해야 한다. 언어 모델이 전통적인 질문 응답 과제에서는 우수한 성능을 보이지만, 인과적 마스크(causal mask)는 복잡한 맥락 간의 추론 능력을 저해할 수 있다. 본 논문에서는 다양한 구성 하에서 검색 결과(검색된 문서)의 순열을 변경함으로써 언어 모델이 다중 홉 질문에 어떻게 반응하는지 탐구한다. 우리의 연구는 다음과 같은 흥미로운 결과를 보여준다: 1) Flan-T5 계열과 같은 인코더-디코더 모델은 크기가 상당히 작음에도 불구하고 MHQA 과제에서 인과적 디코더 전용 언어 모델보다 일반적으로 더 우수한 성능을 보인다; 2) 골드 문서의 순서를 변경하면 Flan T5 모델과 미세 조정된 디코더 전용 모델 모두에서 뚜렷한 경향이 나타나며, 문서 순서가 추론 체인 순서와 일치할 때 최적의 성능이 관찰된다; 3) 인과적 마스크를 수정하여 양방향 주의(bi-directional attention)를 추가함으로써 인과적 디코더 전용 모델의 최종 성능을 효과적으로 향상시킬 수 있다. 이 외에도, 우리는 MHQA 맥락에서 언어 모델의 주의 가중치 분포에 대한 철저한 조사를 수행한다. 실험 결과, 정답이 도출될 때 주의 가중치가 더 높은 값에서 피크를 이루는 경향이 있음을 발견한다. 우리는 이러한 발견을 활용하여 이 과제에서 언어 모델의 성능을 경험적으로 개선한다. 우리의 코드는 https://github.com/hwy9855/MultiHopQA-Reasoning에서 공개되어 있다.

English

Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs' performance on this task. Our code is publicly available at https://github.com/hwy9855/MultiHopQA-Reasoning.

다중 홉 질의응답에서의 마스킹: 언어 모델의 문맥 순열에 따른 성능 분석

Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation

초록

Support