LML: データ拡張予測のためのデータセット学習

要旨

この論文は、通常は機械学習（ML）モデルを用いて処理される分類タスクにおいて、大規模言語モデル（LLMs）を使用する新しいアプローチを紹介しています。MLモデルがデータのクリーニングや特徴量エンジニアリングに大きく依存するのに対し、この手法はLLMsを使用することでプロセスを合理化しています。本論文では、「データ拡張予測（DAP）」と呼ばれる新しい手法によって推進される「言語モデル学習（LML）」という新しい概念を提案しています。分類は、LLMsによって行われ、人間がデータを手動で探索し理解し、データを参照して分類を決定するのと類似した方法で行われます。トレーニングデータは要約され、各ラベルの分類に最も影響を与える特徴を決定するために評価されます。DAPのプロセスでは、システムはデータの要約を使用して自動的にクエリを作成し、これを使用してデータセットから関連する行を取得します。LLMsによってデータの要約と関連する行が使用され、複雑なデータでも満足のいく精度で分類が生成されます。DAPにおけるデータの要約と類似データの使用により、文脈に即した意思決定が確保されます。提案された手法では、「説明可能な機械学習モデルとして機能する」という言葉を使用して、各予測の背後にあるロジックをユーザーが確認できるようにすることで予測の解釈可能性を向上させています。一部のテストケースでは、システムは90％以上の精度を記録し、システムの効果的な性能と従来のMLモデルをさまざまなシナリオで上回る可能性を証明しています。コードは以下のリンクから入手可能です：https://github.com/Pro-GenAI/LML-DAP

English

This paper introduces a new approach to using Large Language Models (LLMs) for classification tasks, which are typically handled using Machine Learning (ML) models. Unlike ML models that rely heavily on data cleaning and feature engineering, this method streamlines the process using LLMs. This paper proposes a new concept called "Language Model Learning (LML)" powered by a new method called "Data-Augmented Prediction (DAP)". The classification is performed by LLMs using a method similar to humans manually exploring and understanding the data and deciding classifications using data as a reference. Training data is summarized and evaluated to determine the features that lead to the classification of each label the most. In the process of DAP, the system uses the data summary to automatically create a query, which is used to retrieve relevant rows from the dataset. A classification is generated by the LLM using data summary and relevant rows, ensuring satisfactory accuracy even with complex data. Usage of data summary and similar data in DAP ensures context-aware decision-making. The proposed method uses the words "Act as an Explainable Machine Learning Model" in the prompt to enhance the interpretability of the predictions by allowing users to review the logic behind each prediction. In some test cases, the system scored an accuracy above 90%, proving the effectiveness of the system and its potential to outperform conventional ML models in various scenarios. The code is available at https://github.com/Pro-GenAI/LML-DAP

LML: データ拡張予測のためのデータセット学習

LML: Language Model Learning a Dataset for Data-Augmented Prediction

要旨

Support