LLM-FE：基于大语言模型作为进化优化器的表格数据自动化特征工程

摘要

自动化特征工程在提升表格学习任务的预测模型性能中扮演着关键角色。传统的自动化特征工程方法受限于对预设变换的依赖，这些变换局限于手动设计的固定搜索空间内，往往忽视了领域知识。近期，利用大型语言模型（LLMs）的进展使得将领域知识融入特征工程过程成为可能。然而，现有的基于LLM的方法要么采用直接提示，要么仅依赖验证分数进行特征选择，未能充分利用先前特征发现实验的洞见，或在特征生成与数据驱动性能之间建立有意义的推理联系。针对这些挑战，我们提出了LLM-FE，一个创新框架，它结合了进化搜索与LLMs的领域知识和推理能力，以自动发现适用于表格学习任务的有效特征。LLM-FE将特征工程表述为程序搜索问题，其中LLMs迭代地提出新的特征转换程序，而数据驱动的反馈则引导搜索过程。我们的实验结果表明，LLM-FE在多种分类和回归基准测试中持续超越现有最先进的基线方法，显著提升了表格预测模型的性能。

English

Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks.