言語複雑度測定をノイジーなゼロショットプロキシとして活用した大規模言語モデル性能評価

要旨

大規模言語モデル（LLM）は自然言語生成において大きな進歩を遂げているが、正確な計算や構造分析を必要とするタスクではしばしば課題に直面している。本論文では、最先端のLLMの言語複雑度測定タスクにおける性能を、LIX可読性指標と平均依存距離（ADD）の計算を通じて調査する。スウェーデンの高校および大学レベルのエッセイを使用し、モデルがLIXスコアを計算し、依存関係解析を行う能力を評価し、その結果を確立されたグラウンドトゥルースと比較する。我々の調査結果によると、すべてのモデルがこれらのタスクに対してある程度の能力を示す中で、ChatGPT-o1-miniが最も一貫した性能を発揮し、LIX計算と依存関係解析の両方で最高の精度を達成した。さらに、モデルがLIXを計算する精度と、Massive Multitask Language Understanding（MMLU）ベンチマークにおける全体的な性能との間に、-0.875 p 0.026（N=6）という強い有意な相関関係が観察された。これらの結果は、言語複雑度測定能力が、LLMの一般的な能力を評価するためのノイジーなゼロショットプロキシとして機能しうることを示唆しており、大規模なベンチマークデータセットを必要としない実用的なモデル評価方法を提供するものである。

English

Large Language Models (LLMs) have made significant strides in natural language generation but often face challenges in tasks requiring precise calculations and structural analysis. This paper investigates the performance of state-of-the-art LLMs on language complexity measurement tasks, through the computation of the LIX readability metric and Average Dependency Distance (ADD). Using Swedish high school and university-level essays, we evaluate the models' abilities to compute LIX scores and perform dependency parsing, comparing their results to established ground truths. Our findings reveal that while all models demonstrate some capacity for these tasks, ChatGPT-o1-mini performs most consistently, achieving the highest accuracy in both LIX computation and dependency parsing. Additionally, we observe a strong significant correlation -0.875 p 0.026 (N=6) between the models' accuracy in computing LIX and their overall performance on the Massive Multitask Language Understanding (MMLU) benchmark. These results suggest that language complexity measurement abilities can serve as a noisy zero-shot proxies for assessing the general capabilities of LLMs, providing a practical method for model evaluation without the need for extensive benchmarking datasets.

言語複雑度測定をノイジーなゼロショットプロキシとして活用した大規模言語モデル性能評価

Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance

要旨

Support