AAAR-1.0: 研究支援のためのAIの潜在能力の評価

要旨

AIシステムの能力を評価する多くの研究が行われており、特に大規模言語モデル（LLMs）が日常的なタスク、例えばメールの作成、質問への回答、創造的なコンテンツ生成などを支援する能力が評価されています。しかしながら、研究者は、研究アイデアのブレスト、実験の設計、論文の執筆やレビューなど、自身の作業にLLMsを活用する際に固有の課題と機会に直面しています。本研究では、AAAR-1.0というベンチマークデータセットを紹介し、LLMのパフォーマンスを評価するために設計されたもので、3つの基本的で専門的な研究タスクにおけるLLMの性能を評価します：(i) 方程式推論、論文提出物の文脈情報に基づいて方程式の正確性を評価するタスク、(ii) 実験設計、研究アイデアと解決策を検証するための実験の設計、(iii) 論文の弱点、論文提出物の弱点を特定するタスク、および(iv) レビュー批評、人間のレビューにおける各セグメントが欠陥があるかどうかを特定するタスク。AAAR-1.0は従来のベンチマークと異なり、2つの重要な点で異なります：第一に、明示的に研究指向であり、深いドメイン専門知識が必要なタスクを含んでいます。第二に、研究者指向であり、研究者が日常的に行う主要な活動を反映しています。オープンソースおよびプロプライエタリなLLMsの評価により、洗練された研究タスクを実行する際の潜在能力と限界が明らかになります。AAAR-1.0は新しいバージョンに継続的に改良していく予定です。

English

Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; (iii) PaperWeakness, identifying weaknesses in paper submissions; and (iv) REVIEWCRITIQUE, identifying each segment in human reviews is deficient or not. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will keep iterating AAAR-1.0 to new versions.