LLM-FE:基于大语言模型作为进化优化器的表格数据自动化特征工程
LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers
March 18, 2025
作者: Nikhil Abhyankar, Parshin Shojaee, Chandan K. Reddy
cs.AI
摘要
自动化特征工程在提升表格学习任务的预测模型性能中扮演着关键角色。传统的自动化特征工程方法受限于对预设变换的依赖,这些变换局限于手动设计的固定搜索空间内,往往忽视了领域知识。近期,利用大型语言模型(LLMs)的进展使得将领域知识融入特征工程过程成为可能。然而,现有的基于LLM的方法要么采用直接提示,要么仅依赖验证分数进行特征选择,未能充分利用先前特征发现实验的洞见,或在特征生成与数据驱动性能之间建立有意义的推理联系。针对这些挑战,我们提出了LLM-FE,一个创新框架,它结合了进化搜索与LLMs的领域知识和推理能力,以自动发现适用于表格学习任务的有效特征。LLM-FE将特征工程表述为程序搜索问题,其中LLMs迭代地提出新的特征转换程序,而数据驱动的反馈则引导搜索过程。我们的实验结果表明,LLM-FE在多种分类和回归基准测试中持续超越现有最先进的基线方法,显著提升了表格预测模型的性能。
English
Automated feature engineering plays a critical role in improving predictive
model performance for tabular learning tasks. Traditional automated feature
engineering methods are limited by their reliance on pre-defined
transformations within fixed, manually designed search spaces, often neglecting
domain knowledge. Recent advances using Large Language Models (LLMs) have
enabled the integration of domain knowledge into the feature engineering
process. However, existing LLM-based approaches use direct prompting or rely
solely on validation scores for feature selection, failing to leverage insights
from prior feature discovery experiments or establish meaningful reasoning
between feature generation and data-driven performance. To address these
challenges, we propose LLM-FE, a novel framework that combines evolutionary
search with the domain knowledge and reasoning capabilities of LLMs to
automatically discover effective features for tabular learning tasks. LLM-FE
formulates feature engineering as a program search problem, where LLMs propose
new feature transformation programs iteratively, and data-driven feedback
guides the search process. Our results demonstrate that LLM-FE consistently
outperforms state-of-the-art baselines, significantly enhancing the performance
of tabular prediction models across diverse classification and regression
benchmarks.Summary
AI-Generated Summary