机器实用思维：追踪大语言模型实用能力的涌现

摘要

当前的大型语言模型（LLMs）已在社交智能任务中展现出新兴能力，包括隐含意义解析（Sravanthi等，2024）和心理理论推理（Shapira等，2024），这两者均需深厚的语用理解。然而，LLMs在训练过程中如何获得这一能力仍不甚明了。本研究引入了ALTPRAG，一个基于语用学“替代选择”概念构建的数据集，旨在评估不同训练阶段的LLMs能否准确推断出说话者的微妙意图。每个实例均配有两段语境适宜但语用差异的续写，从而实现对语用解读与对比推理的精细评估。我们系统性地评估了22个LLMs在关键训练阶段的表现：预训练、监督微调（SFT）及偏好优化，以探究语用能力的发展轨迹。结果显示，即便是基础模型也对语用线索表现出显著敏感性，且随着模型与数据规模的扩大，这种敏感性持续提升。此外，SFT和RLHF进一步促进了性能提升，尤其在认知语用推理方面。这些发现强调了语用能力作为LLM训练中涌现且可组合的特性，并为模型与人类交际规范的对齐提供了新的洞见。

English

Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution (Sravanthi et al. (2024)) and theory-of-mind reasoning (Shapira et al. (2024)), both of which require substantial pragmatic understanding. However, how LLMs acquire this competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, designed to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two contextually appropriate but pragmatically distinct continuations, enabling fine-grained assessment of both pragmatic interpretation and contrastive reasoning. We systematically evaluate 22 LLMs across key training stages: pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic reasoning. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.