大型语言模型能理解上下文吗？

摘要

理解上下文是理解人类语言的关键，这是大型语言模型（LLMs）越来越多地展示出的令人印象深刻的能力。然而，尽管LLMs的评估涵盖了自然语言处理领域内的各个领域，但对探究它们理解上下文特征的语言能力的关注有限。本文通过调整现有数据集，引入了一个上下文理解基准，以适应生成模型的评估。这个基准包括四个不同的任务和九个数据集，所有这些数据集都包含了旨在评估模型理解上下文能力的提示。首先，我们在上下文学习预训练情景下评估LLMs的性能。实验结果表明，与最先进的微调模型相比，预训练的密集模型在理解更微妙的上下文特征方面存在困难。其次，随着LLM压缩在研究和实际应用中的重要性日益增加，我们评估了在上下文学习设置下量化模型的上下文理解能力。我们发现，3位后训练量化会导致在我们的基准测试中性能降低的程度不同。我们对这些情景进行了广泛的分析，以证实我们的实验结果。

English

Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various domains within the realm of Natural Language Processing, limited attention has been paid to probing their linguistic capability of understanding contextual features. This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models. This benchmark comprises of four distinct tasks and nine datasets, all featuring prompts designed to assess the models' ability to understand context. First, we evaluate the performance of LLMs under the in-context learning pretraining scenario. Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models. Second, as LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings. We find that 3-bit post-training quantization leads to varying degrees of performance reduction on our benchmark. We conduct an extensive analysis of these scenarios to substantiate our experimental results.

大型语言模型能理解上下文吗？

Can Large Language Models Understand Context?

摘要

Support