Paper published in a book (Scientific congresses, symposiums and conference proceedings)
Feature Generation Using LLMs: An Evolutionary Algorithm Approach
NOURBAKHSH, Aria; ALCARAZ, Benoît; SCHOMMER, Christoph
2025 • In Mualla, Yazan (Ed.) Advances in Explainability, Agents, and Large Language Models - 1st International Workshop on Causality, Agents and Large Models, CALM 2024, Proceedings
Feature Generation; Large Language Model; Machine Learning; Feature engineerings; Feature generation; Large language model; Machine-learning
Abstract :
[en] A crucial step in machine learning pipelines is to present each entity with features or attributes that are representative of the characteristics of the processed entities. Feature engineering is an important step in finding a relation among attributes that otherwise may not be processed by the ML algorithms. Meanwhile, Large Language Models have shown promising abilities in coding, mathematical reasoning, and processing world knowledge. In this work, we utilize an LLM for the problem of feature generation from tabular data based on the previously given features. We have created a pipeline that takes a set of attributes and a prompt to generate new features. Then, our selection algorithm selects the best-performing sets of attributes. We apply our method to eight datasets from different domains and data types. Our results show that, in most cases, the language model can produce new features based on mathematical and logical operators that are useful for the given tasks and can improve the classification result.
Disciplines :
Computer science
Author, co-author :
NOURBAKHSH, Aria ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
ALCARAZ, Benoît ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
SCHOMMER, Christoph ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
External co-authors :
no
Language :
English
Title :
Feature Generation Using LLMs: An Evolutionary Algorithm Approach
Publication date :
2025
Event name :
CALM2024
Event organizer :
PRIMA2024
Event place :
Kyoto, Jpn
Event date :
18-11-2024 => 19-11-2024
Main work title :
Advances in Explainability, Agents, and Large Language Models - 1st International Workshop on Causality, Agents and Large Models, CALM 2024, Proceedings
Editor :
Mualla, Yazan
Publisher :
Springer Science and Business Media Deutschland GmbH
We thank the Luxembourg National Research Fund (FNR) for the funding of this research as part of the project C21-Collaboration 21: IPBG2020/IS/14839977/C21.
Ahn, J., Verma, R., Lou, R., Liu, D., Zhang, R., Yin, W.: Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157 (2024)
Bellman, R.: Dynamic programming. Princeton University Press, Princeton (1957)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013). https://doi.org/10.1109/TPAMI.2013.50
Bohanec, M.: Car Evaluation. UCI Machine Learning Repository (1997). https://doi.org/10.24432/C5JP48
Chang, Y., et al.: A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 15(3), 1–45 (2024)
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223. JMLR Workshop and Conference Proceedings (2011)
Coello, C., Alimam, M.N., Kouatly, R.: Effectiveness of chatgpt in coding: a comparative analysis of popular large language models. Digital 4(1), 114–125 (2024)
Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(1–4), 131–156 (1997)
Dubey, A., et al.: The llama 3 herd of models (2024). https://arxiv.org/abs/2407. 21783
Han, S., Yoon, J., Arik, S.O., Pfister, T.: Large language models can automatically engineer features for few-shot tabular learning. In: Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 235, pp. 17454–17479. PMLR (2024). https://proceedings. mlr.press/v235/han24f.html
Heaton, J.: An empirical analysis of feature engineering for predictive modeling. In: SoutheastCon 2016, pp. 1–6. IEEE (2016)
Jeong, D.P., Lipton, Z.C., Ravikumar, P.: Llm-select: Feature selection with large language models. arXiv preprint arXiv:2407.02694 (2024)
Karabacak, M., Margetis, K.: Embracing large language models for medical applications: opportunities and challenges. Cureus 15(5) (2023)
Katz, G., Shin, E.C.R., Song, D.: Explorekit: automatic feature generation and selection. In: 2016 IEEE 16th International Conference on Data Mining (ICDM). pp. 979–984. IEEE (2016)
Khurana, U., Samulowitz, H., Turaga, D.: Feature engineering for predictive modeling using reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)
Khurana, U., Turaga, D., Samulowitz, H., Parthasrathy, S.: Cognito: Automated feature engineering for supervised learning. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). pp. 1304–1307. IEEE (2016)
Li, D., Tan, Z., Liu, H.: Exploring large language models for feature selection: A data-centric perspective. arXiv preprint arXiv:2408.12025 (2024)
Li, Y., Wang, S., Ding, H., Chen, H.: Large language models in finance: a survey. In: Proceedings of the fourth ACM International Conference on AI in finance, pp. 374–382 (2023)
Liu, J., Xia, C.S., Wang, Y., Zhang, L.: Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Adv. Neural Inform. Proce. Syst. 36 (2024)
Romera-Paredes, B., et al.: Mathematical discoveries from program search with large language models. Nature 625(7995), 468–475 (2024)
Shafti, L.S., Pérez, E.P.: Constructive induction and genetic algorithms for learning concepts with complex interaction. In: Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation. pp. 1811-1818. GECCO ’05, Association for Computing Machinery, New York, NY, USA (2005). https://doi.org/10. 1145/1068009.1068317
Sutton, R.S., Matheus, C.J.: Learning polynomial functions by feature construction. In: Machine Learning Proceedings 1991, pp. 208–212. Elsevier (1991)
Tang, J., Alelyani, S., Liu, H.: Feature selection for classification: a review. Data Classif.: Algorithms Appl. 37 (2014)
pandas development team, T.: pandas-dev/pandas: Pandas (2020). https://doi. org/10.5281/zenodo.3509134
Teboul, A.: Diabetes health indicators dataset (2022). https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset. Accessed insert-date-here
Wang, Z.: Causalbench: A comprehensive benchmark for evaluating causal reasoning capabilities of large language models. In: Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pp. 143–151 (2024)
Wu, X., Wu, S.h., Wu, J., Feng, L., Tan, K.C.: Evolutionary computation in the era of large language model: Surv. Roadmap. arXiv preprint arXiv:2401.10034 (2024)