数学における言語モデルの評価：インタラクションを通じたアプローチ

要旨

大規模言語モデル（LLM）を評価する標準的な方法論は、入力と出力の静的なペアに基づいており、アシスタントの開発には不十分です。この種の評価は、デプロイメントにおける重要なインタラクティブな要素を考慮しておらず、言語モデルの能力を理解する上で制限をかけています。本論文では、CheckMateを紹介します。これは、人間がLLMとインタラクションし、評価するための適応可能なプロトタイププラットフォームです。CheckMateを用いて、InstructGPT、ChatGPT、GPT-4という3つの言語モデルを、学部レベルの数学証明におけるアシスタントとして評価する研究を実施しました。参加者は、学部生から数学教授まで多様な背景を持つ人々で構成されています。この研究から得られたインタラクションと評価データセットであるMathConverseを公開します。MathConverseを分析することで、人間の行動の予備的な分類体系を導き出し、一般的には正の相関が見られるものの、LLMの生成において正確性と知覚された有用性の間に顕著な乖離が存在する事例を明らかにしました。さらに、専門の数学者による一連のケーススタディを通じて、GPT-4の数学的推論における有用なシナリオと既存の問題を特定しました。最後に、ML実務者と数学者にとっての実践的な提言をまとめます。不確実性を伝え、ユーザーの修正にうまく対応し、より解釈可能で簡潔なモデルは、より優れたアシスタントを構成する可能性があります。インタラクティブな評価は、これらのモデルの能力を継続的に探るための有望な方法です。人間は、言語モデルの代数的誤りを認識し、それゆえにどこで使用すべきかを判断する必要があります。

English

The standard methodology of evaluating large language models (LLMs) based on static pairs of inputs and outputs is insufficient for developing assistants: this kind of assessments fails to take into account the essential interactive element in their deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models~(InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a preliminary taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we identify useful scenarios and existing issues of GPT-4 in mathematical reasoning through a series of case studies contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models which communicate uncertainty, respond well to user corrections, are more interpretable and concise may constitute better assistants; interactive evaluation is a promising way to continually navigate the capability of these models; humans should be aware of language models' algebraic fallibility, and for that reason discern where they should be used.

数学における言語モデルの評価：インタラクションを通じたアプローチ

Evaluating Language Models for Mathematics through Interactions

要旨

Support