IntentGrasp: 意図理解のための包括的ベンチマーク

要旨

発話、会話、文章の背後にある意図を正確に理解することは、有益な大規模言語モデル（LLM）アシスタントの開発にとって極めて重要である。本論文では、LLMの意図理解能力を評価するための包括的ベンチマークであるIntentGraspを紹介する。12の多様なドメインにわたる49の高品質かつオープンライセンスのコーパスから派生したIntentGraspは、ソースデータセットのキュレーション、意図ラベルの文脈化、タスク形式の統一を通じて構築されている。IntentGraspは、262,759インスタンスからなる大規模なトレーニングセットと、12,909のテストケースからなるAll Set、よりバランスが取れ挑戦的な470ケースからなるGem Setの2つの評価セットを含む。7つのファミリーにわたる20のLLM（GPT-5.4、Gemini-3.1-Pro、Claude-Opus-4.7などのフロンティアモデルを含む）に対する広範な評価では、All Setで60%未満、Gem Setで25%未満のスコアと、不十分な性能が示された。注目すべきことに、テストされた20モデルのうち17モデルがGem Setでランダム推測ベースライン（15.2%）よりも低い性能を示し、推定される人間の性能は約81.1%であり、改善の余地が大きいことが示された。この能力を向上させるために、本論文ではIntentional Fine-Tuning（IFT）を提案する。これはIntentGraspのトレーニングセットでモデルをファインチューニングするもので、All Setで30以上のF1ポイント、Gem Setで20以上のF1ポイントという顕著な改善をもたらす。特筆すべきことに、leave-one-domain-out（Lodo）実験はIFTの強力なクロスドメイン汎化能力をさらに示し、LLMの意図理解を大幅に向上させる有望なアプローチであることを実証している。全体として、意図理解能力をベンチマークし向上させることにより、本研究は人間の利益と社会の善のために、より意図的で、有能で、安全なAIアシスタントへの有望な道筋を明らかにしている。

English

Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.