良いクエリとは何か？人間を混乱させる言語的特徴がLLM性能に与える影響の測定

要旨

大規模言語モデル（LLM）における虚構生成は、一般にモデルまたはそのデコード戦略の欠陥として扱われてきた。本論文は古典言語学の知見に基づき、クエリの形式が聴者（およびモデル）の応答を形成し得ることを論じる。この洞察を操作化するため、節の複雑性、語彙の希少性、照応、否定、回答可能性、意図の接地など、人間の理解に影響を与えることが知られる22次元のクエリ特徴ベクトルを構築した。369,837件の実世界クエリを用いて、虚構生成を生じさせやすいクエリの類型が存在するかを検証する。大規模分析により、一貫した「リスク景観」が明らかになった：深い節の入れ子構造や未特定性といった特徴は高い虚構生成傾向と関連し、明確な意図の接地や回答可能性は低い虚構生成率と関連した。一方、ドメイン特異性など他の特徴は、データセットおよびモデルに依存した混合的な効果を示した。以上より、これらの知見は虚構生成リスクと相関する経験的に観測可能なクエリ特徴表現を確立し、誘導型クエリ書き換えや将来の介入研究への道を開くものである。

English

Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query's form can also shape a listener's (and model's) response. We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension. Using 369,837 real-world queries, we ask: Are there certain types of queries that make hallucination more likely? A large-scale analysis reveals a consistent "risk landscape": certain features such as deep clause nesting and underspecification align with higher hallucination propensity. In contrast, clear intention grounding and answerability align with lower hallucination rates. Others, including domain specificity, show mixed, dataset- and model-dependent effects. Thus, these findings establish an empirically observable query-feature representation correlated with hallucination risk, paving the way for guided query rewriting and future intervention studies.

良いクエリとは何か？人間を混乱させる言語的特徴がLLM性能に与える影響の測定

What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

要旨

Support