IntentGrasp：一個全面的意圖理解基準

摘要

準確理解語言、對話與寫作背後的意圖，對於開發實用的大型語言模型（LLM）助手至關重要。本文提出 IntentGrasp，一個用於評估 LLM 意圖理解能力的全面基準測試。該基準源自 49 個高品質、開放授權的語料庫，涵蓋 12 個不同領域，並透過來源資料集篩選、意圖標籤情境化及任務格式統一建構而成。IntentGrasp 包含一個大規模的訓練集（共 262,759 個實例）與兩個評估集：一個是包含 12,909 個測試案例的「全集（All Set）」；另一個是更為平衡且具挑戰性的「精選集（Gem Set）」，共有 470 個案例。在橫跨 7 個家族（包括 GPT-5.4、Gemini-3.1-Pro 與 Claude-Opus-4.7 等前沿模型）的 20 個 LLM 上所進行的廣泛評估顯示，其表現並不理想——在全集的得分低於 60%，在精選集則低於 25%。值得注意的是，在 20 個受測模型中，有 17 個在精選集上的表現比隨機猜測基準（15.2%）更差，而人類估計表現約為 81.1%，顯示仍有極大改進空間。為提升此能力，本文提出「意圖導向微調（IFT）」，即在 IntentGrasp 的訓練集上對模型進行微調，從而在全集上獲得逾 30 個 F1 分數的提升，在精選集上獲得逾 20 個分數的提升。值得注意的是，「留一域（Lodo）」實驗進一步證明了 IFT 具有強大的跨領域泛化能力，確認它是一種大幅增強 LLM 意圖理解能力的可行方法。總體而言，透過對意圖理解能力進行基準測試與提升，本研究為邁向更具意圖性、更強大且更安全的 AI 助手，以造福人類與社會福祉，指出了一條有前景的道路。

English

Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.

IntentGrasp：一個全面的意圖理解基準

IntentGrasp: A Comprehensive Benchmark for Intent Understanding

摘要

Support