通过交互评估数学语言模型

摘要

基于静态输入和输出对大型语言模型（LLMs）进行评估的标准方法对于开发助手是不足够的：这种评估方式未能考虑到其部署中的基本交互元素，从而限制了我们对语言模型能力的理解。我们引入了CheckMate，这是一个适应性强的原型平台，供人类与LLMs进行交互和评估。我们使用CheckMate进行了一项研究，评估了三种语言模型（InstructGPT、ChatGPT和GPT-4）作为助手在证明本科水平数学方面的表现，参与者包括本科生和数学教授。我们发布了由此产生的交互和评分数据集MathConverse。通过分析MathConverse，我们得出了一个初步的人类行为分类法，并发现尽管通常存在正相关性，但在LLMs生成中存在明显的正确性与被认为有帮助性之间的分歧等其他发现。此外，我们通过一系列由专业数学家提供的案例研究，识别了GPT-4在数学推理中的有用场景和现有问题。最后，我们为机器学习从业者和数学家提供了可操作的经验教训：能够传达不确定性、对用户更正做出良好响应、更易解释和简洁的模型可能构成更好的助手；交互式评估是持续了解这些模型能力的一种有前途的方式；人类应当意识到语言模型的代数缺陷，并因此判断它们应该被使用的地方。

English

The standard methodology of evaluating large language models (LLMs) based on static pairs of inputs and outputs is insufficient for developing assistants: this kind of assessments fails to take into account the essential interactive element in their deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models~(InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a preliminary taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we identify useful scenarios and existing issues of GPT-4 in mathematical reasoning through a series of case studies contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models which communicate uncertainty, respond well to user corrections, are more interpretable and concise may constitute better assistants; interactive evaluation is a promising way to continually navigate the capability of these models; humans should be aware of language models' algebraic fallibility, and for that reason discern where they should be used.

通过交互评估数学语言模型

Evaluating Language Models for Mathematics through Interactions

摘要

Support