通過互動評估數學語言模型

摘要

基於靜態輸入和輸出對大型語言模型（LLMs）進行評估的標準方法對於開發助手是不足夠的：這種評估方式未能考慮到在其部署中的基本互動元素，因此限制了我們對語言模型能力的理解。我們介紹了CheckMate，這是一個適應性強的原型平台，供人們與LLMs進行互動和評估。我們使用CheckMate進行了一項研究，評估了三個語言模型（InstructGPT、ChatGPT和GPT-4）作為助手在證明本科水平數學方面的表現，參與者包括從本科生到數學教授的混合群體。我們釋出了由此產生的互動和評分數據集MathConverse。通過分析MathConverse，我們歸納出了人類行為的初步分類，並發現儘管一般上存在積極相關性，但在LLM生成中存在明顯的正確性與感知幫助性之間的分歧等發現。此外，我們通過一系列由專家數學家提供的案例研究，識別了GPT-4在數學推理中的有用情景和現有問題。我們最終得出了對機器學習從業者和數學家的可行建議：能夠傳達不確定性、對用戶更正做出良好反應、更易解釋和簡潔的模型可能構成更好的助手；互動式評估是持續了解這些模型能力的一種有前途的方式；人們應該意識到語言模型的代數性錯誤性，並因此識別應該使用它們的場景。

English

The standard methodology of evaluating large language models (LLMs) based on static pairs of inputs and outputs is insufficient for developing assistants: this kind of assessments fails to take into account the essential interactive element in their deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models~(InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a preliminary taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we identify useful scenarios and existing issues of GPT-4 in mathematical reasoning through a series of case studies contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models which communicate uncertainty, respond well to user corrections, are more interpretable and concise may constitute better assistants; interactive evaluation is a promising way to continually navigate the capability of these models; humans should be aware of language models' algebraic fallibility, and for that reason discern where they should be used.

通過互動評估數學語言模型

Evaluating Language Models for Mathematics through Interactions

摘要

Support