今日の大規模言語モデルは、ウェルビーイングの概念を説明する準備ができているか？

要旨

ウェルビーイングは、個人の成長と情報に基づいた人生の意思決定に不可欠な、精神的、身体的、社会的な側面を包含する概念である。大規模言語モデル（LLM）を活用してウェルビーイングを理解しようとする個人が増える中で、重要な課題が浮上している：LLMは、正確であるだけでなく、多様な聴衆に合わせた説明を生成することができるのか？高品質な説明には、事実の正確性と、異なる専門知識を持つユーザーの期待に応える能力の両方が求められる。本研究では、10種類の多様なLLMによって生成された2,194のウェルビーイング概念に関する43,880の説明からなる大規模データセットを構築した。また、原則に基づいたLLM-as-a-judge評価フレームワークを導入し、二重の評価者を用いて説明の品質を評価する。さらに、教師ありファインチューニング（SFT）と直接選好最適化（DPO）を用いてオープンソースのLLMをファインチューニングすることで、生成される説明の品質が大幅に向上することを示す。結果は以下の通りである：（1）提案されたLLM評価者は人間の評価とよく一致する；（2）説明の品質はモデル、聴衆、カテゴリーによって大きく異なる；（3）DPOおよびSFTでファインチューニングされたモデルは、より大規模なモデルを上回り、専門的な説明タスクにおける選好ベースの学習の有効性が示された。

English

Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.

今日の大規模言語モデルは、ウェルビーイングの概念を説明する準備ができているか？

Are Today's LLMs Ready to Explain Well-Being Concepts?

要旨

Support