更强大的模型并不意味着更好的教师来进行指导调整。

摘要

指令调整已被广泛采用，以确保大型语言模型（LLMs）有效地遵循用户指令。LLMs的指令遵循能力主要依赖于用于调整的指令数据集。最近，合成指令数据集作为一种经济可行的解决方案出现，为LLMs提供多样化和高质量的指令。然而，现有方法通常假定更大或更强的模型对指令调整更有帮助，因此简单地采用这些模型作为合成指令的响应生成器。在本文中，我们挑战这一普遍接受的假设。我们在五个基础模型和二十个响应生成器之间进行了大量实验，发现更大更强的模型未必是更小模型的更好教师。我们将这一现象称为更大模型的悖论。我们观察到，现有的度量指标不能准确预测响应生成器的有效性，因为它们忽略了教师和被微调的基础模型之间的兼容性。因此，我们开发了一种新的度量指标，名为兼容性调整奖励（CAR），来衡量响应生成器的有效性。我们在五个基础模型上的实验表明，CAR优于几乎所有基线。

English

Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions effectively. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic instruction datasets have emerged as an economically viable solution to provide LLMs diverse and high-quality instructions. However, existing approaches typically assume that larger or stronger models are stronger teachers for instruction tuning, and hence simply adopt these models as response generators to the synthetic instructions. In this paper, we challenge this commonly-adopted assumption. Our extensive experiments across five base models and twenty response generators reveal that larger and stronger models are not necessarily stronger teachers of smaller models. We refer to this phenomenon as the Larger Models' Paradox. We observe that existing metrics cannot precisely predict the effectiveness of response generators since they ignore the compatibility between teachers and base models being fine-tuned. We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators. Our experiments across five base models demonstrate that CAR outperforms almost all baselines.

更强大的模型并不意味着更好的教师来进行指导调整。

Stronger Models are NOT Stronger Teachers for Instruction Tuning

摘要

Support