IntentGrasp:一個全面的意圖理解基準
IntentGrasp: A Comprehensive Benchmark for Intent Understanding
May 7, 2026
作者: Yuwei Yin, Chuyuan Li, Giuseppe Carenini
cs.AI
摘要
準確理解語言、對話與寫作背後的意圖,對於開發實用的大型語言模型(LLM)助手至關重要。本文提出 IntentGrasp,一個用於評估 LLM 意圖理解能力的全面基準測試。該基準源自 49 個高品質、開放授權的語料庫,涵蓋 12 個不同領域,並透過來源資料集篩選、意圖標籤情境化及任務格式統一建構而成。IntentGrasp 包含一個大規模的訓練集(共 262,759 個實例)與兩個評估集:一個是包含 12,909 個測試案例的「全集(All Set)」;另一個是更為平衡且具挑戰性的「精選集(Gem Set)」,共有 470 個案例。在橫跨 7 個家族(包括 GPT-5.4、Gemini-3.1-Pro 與 Claude-Opus-4.7 等前沿模型)的 20 個 LLM 上所進行的廣泛評估顯示,其表現並不理想——在全集的得分低於 60%,在精選集則低於 25%。值得注意的是,在 20 個受測模型中,有 17 個在精選集上的表現比隨機猜測基準(15.2%)更差,而人類估計表現約為 81.1%,顯示仍有極大改進空間。為提升此能力,本文提出「意圖導向微調(IFT)」,即在 IntentGrasp 的訓練集上對模型進行微調,從而在全集上獲得逾 30 個 F1 分數的提升,在精選集上獲得逾 20 個分數的提升。值得注意的是,「留一域(Lodo)」實驗進一步證明了 IFT 具有強大的跨領域泛化能力,確認它是一種大幅增強 LLM 意圖理解能力的可行方法。總體而言,透過對意圖理解能力進行基準測試與提升,本研究為邁向更具意圖性、更強大且更安全的 AI 助手,以造福人類與社會福祉,指出了一條有前景的道路。
English
Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.