수학을 위한 언어 모델 평가: 상호작용을 통한 접근

초록

입력과 출력의 정적 쌍에 기반하여 대규모 언어 모델(LLMs)을 평가하는 표준 방법론은 어시스턴트 개발에 있어 불충분합니다: 이러한 평가 방식은 배포 시 필수적인 상호작용 요소를 고려하지 못하므로, 언어 모델의 능력을 이해하는 데 한계를 가져옵니다. 우리는 인간이 LLMs와 상호작용하고 평가할 수 있는 적응형 프로토타입 플랫폼인 CheckMate를 소개합니다. 우리는 CheckMate를 사용하여 학부 수준의 수학 증명에서 어시스턴트로서 세 가지 언어 모델(InstructGPT, ChatGPT, GPT-4)을 평가하는 연구를 수행했으며, 학부생부터 수학 교수에 이르는 다양한 참가자 그룹을 포함했습니다. 우리는 이 연구에서 얻은 상호작용 및 평가 데이터셋인 MathConverse를 공개합니다. MathConverse를 분석함으로써, 우리는 인간 행동의 예비 분류 체계를 도출하고, 일반적으로 긍정적인 상관관계가 있음에도 불구하고 LLM 생성물에서 정확성과 인지된 유용성 사이에 주목할 만한 차이가 있는 사례를 포함한 여러 발견을 밝혔습니다. 더 나아가, 우리는 전문 수학자들이 기여한 일련의 사례 연구를 통해 GPT-4의 수학적 추론에서 유용한 시나리오와 기존 문제를 식별합니다. 우리는 머신러닝 실무자와 수학자들을 위한 실행 가능한 결론을 제시합니다: 불확실성을 명확히 전달하고, 사용자 수정에 잘 반응하며, 더 해석 가능하고 간결한 모델이 더 나은 어시스턴트가 될 수 있다; 상호작용적 평가는 이러한 모델의 능력을 지속적으로 탐색하는 유망한 방법이다; 인간은 언어 모델의 대수적 오류 가능성을 인지하고, 그에 따라 언어 모델을 사용해야 할 곳을 분별해야 한다.

English

The standard methodology of evaluating large language models (LLMs) based on static pairs of inputs and outputs is insufficient for developing assistants: this kind of assessments fails to take into account the essential interactive element in their deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models~(InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a preliminary taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we identify useful scenarios and existing issues of GPT-4 in mathematical reasoning through a series of case studies contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models which communicate uncertainty, respond well to user corrections, are more interpretable and concise may constitute better assistants; interactive evaluation is a promising way to continually navigate the capability of these models; humans should be aware of language models' algebraic fallibility, and for that reason discern where they should be used.

수학을 위한 언어 모델 평가: 상호작용을 통한 접근

Evaluating Language Models for Mathematics through Interactions

초록

Support