なぜオープンソースのLLMはデータ分析に苦戦するのか？体系的実証研究

要旨

大規模言語モデル（LLMs）はデータ分析タスクの自動化において有望であるが、オープンソースモデルはこの種の推論集約的なシナリオにおいて重大な制約に直面している。本研究では、オープンソースLLMsのデータ分析能力を向上させるための戦略を調査する。多様で現実的なシナリオからなるシードデータセットをキュレーションし、モデルをデータ理解、コード生成、戦略的計画の3つの次元で評価する。我々の分析から以下の3つの主要な知見が得られた：(1) 戦略的計画の質がモデルの性能を決定する主要な要因である、(2) インタラクションデザインとタスクの複雑さが推論能力に大きな影響を与える、(3) 最適な性能を達成するためには、多様性よりもデータの質がより大きな影響を示す。これらの知見を活用してデータ合成手法を開発し、オープンソースLLMsの分析的推論能力に大幅な改善を示す。

English

Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.

なぜオープンソースのLLMはデータ分析に苦戦するのか？体系的実証研究

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

要旨

Support