Feature Generation Using LLMs: An Evolutionary Algorithm Approach

NOURBAKHSH, Aria; ALCARAZ, Benoît; SCHOMMER, Christoph

doi:10.1007/978-3-031-89103-8_4

Request a copy

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

Feature Generation Using LLMs: An Evolutionary Algorithm Approach

NOURBAKHSH, Aria; ALCARAZ, Benoît; SCHOMMER, Christoph

2025 • In Mualla, Yazan (Ed.) Advances in Explainability, Agents, and Large Language Models - 1st International Workshop on Causality, Agents and Large Models, CALM 2024, Proceedings

Peer reviewed

Permalink
https://hdl.handle.net/10993/67649

DOI
10.1007/978-3-031-89103-8_4

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

LLM_Paper.pdf

Author postprint (477.61 kB)

Request a copy

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Feature Generation; Large Language Model; Machine Learning; Feature engineerings; Feature generation; Large language model; Machine-learning

Abstract :

[en] A crucial step in machine learning pipelines is to present each entity with features or attributes that are representative of the characteristics of the processed entities. Feature engineering is an important step in finding a relation among attributes that otherwise may not be processed by the ML algorithms. Meanwhile, Large Language Models have shown promising abilities in coding, mathematical reasoning, and processing world knowledge. In this work, we utilize an LLM for the problem of feature generation from tabular data based on the previously given features. We have created a pipeline that takes a set of attributes and a prompt to generate new features. Then, our selection algorithm selects the best-performing sets of attributes. We apply our method to eight datasets from different domains and data types. Our results show that, in most cases, the language model can produce new features based on mathematical and logical operators that are useful for the given tasks and can improve the classification result.

Disciplines :

Computer science

Author, co-author :

NOURBAKHSH, Aria ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

ALCARAZ, Benoît ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

SCHOMMER, Christoph ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

External co-authors :

Language :

English

Title :

Feature Generation Using LLMs: An Evolutionary Algorithm Approach

Publication date :

2025

Event name :

CALM2024

Event organizer :

PRIMA2024

Event place :

Kyoto, Jpn

Event date :

18-11-2024 => 19-11-2024

Main work title :

Advances in Explainability, Agents, and Large Language Models - 1st International Workshop on Causality, Agents and Large Models, CALM 2024, Proceedings

Editor :

Mualla, Yazan

Publisher :

Springer Science and Business Media Deutschland GmbH

ISBN/EAN :

978-3-03-189102-1

Peer reviewed :

Peer reviewed

Additional URL :

https://link.springer.com/content/pdf/10.1007/978-3-031-89103-8_4

Funding text :

We thank the Luxembourg National Research Fund (FNR) for the funding of this research as part of the project C21-Collaboration 21: IPBG2020/IS/14839977/C21.

Available on ORBilu :

since 02 February 2026

Statistics

Number of views

3 (1 by Unilu)

Number of downloads

0 (0 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Aeberhard, S., Forina, M.: Wine. UCI Machine Learning Repository (1992). https://doi.org/10.24432/C5PC7J
Ahn, J., Verma, R., Lou, R., Liu, D., Zhang, R., Yin, W.: Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157 (2024)
Bellman, R.: Dynamic programming. Princeton University Press, Princeton (1957)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013). https://doi.org/10.1109/TPAMI.2013.50
Bohanec, M.: Car Evaluation. UCI Machine Learning Repository (1997). https://doi.org/10.24432/C5JP48
Chang, Y., et al.: A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 15(3), 1–45 (2024)
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223. JMLR Workshop and Conference Proceedings (2011)
Coello, C., Alimam, M.N., Kouatly, R.: Effectiveness of chatgpt in coding: a comparative analysis of popular large language models. Digital 4(1), 114–125 (2024)
Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(1–4), 131–156 (1997)
Dor, O., Reich, Y.: Strengthening learning algorithms by feature discovery. Inf. Sci. 189, 176–190 (2012)
Dubey, A., et al.: The llama 3 herd of models (2024). https://arxiv.org/abs/2407. 21783
Han, S., Yoon, J., Arik, S.O., Pfister, T.: Large language models can automatically engineer features for few-shot tabular learning. In: Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 235, pp. 17454–17479. PMLR (2024). https://proceedings. mlr.press/v235/han24f.html
Heaton, J.: An empirical analysis of feature engineering for predictive modeling. In: SoutheastCon 2016, pp. 1–6. IEEE (2016)
Jeong, D.P., Lipton, Z.C., Ravikumar, P.: Llm-select: Feature selection with large language models. arXiv preprint arXiv:2407.02694 (2024)
Karabacak, M., Margetis, K.: Embracing large language models for medical applications: opportunities and challenges. Cureus 15(5) (2023)
Katz, G., Shin, E.C.R., Song, D.: Explorekit: automatic feature generation and selection. In: 2016 IEEE 16th International Conference on Data Mining (ICDM). pp. 979–984. IEEE (2016)
Khurana, U., Samulowitz, H., Turaga, D.: Feature engineering for predictive modeling using reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)
Khurana, U., Turaga, D., Samulowitz, H., Parthasrathy, S.: Cognito: Automated feature engineering for supervised learning. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). pp. 1304–1307. IEEE (2016)
Li, D., Tan, Z., Liu, H.: Exploring large language models for feature selection: A data-centric perspective. arXiv preprint arXiv:2408.12025 (2024)
Li, Y., Wang, S., Ding, H., Chen, H.: Large language models in finance: a survey. In: Proceedings of the fourth ACM International Conference on AI in finance, pp. 374–382 (2023)
Liu, J., Xia, C.S., Wang, Y., Zhang, L.: Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Adv. Neural Inform. Proce. Syst. 36 (2024)
Liu, Z.: Amazon commerce reviews. UCI Mach. Learn. Repository (2011). https://doi.org/10.24432/C55C88
Markovitch, S., Rosenstein, D.: Feature generation using general constructor functions. Mach. Learn. 49, 59–98 (2002)
Meyerson, E., et al.: Language model crossover: Variation through few-shot prompting. arXiv preprint arXiv:2302.12170 (2023)
Motoda, H., Liu, H.: Feature selection, extraction and construction. Commun. IICM (Inst. Inform. Comput. Mach. Taiwan) 5(67–72), 2 (2002)
Nash, W., Sellers, T., Talbot, S., Cawthorn, A., Ford, W.: Abalone. UCI Mach. Learn. Repository (1995). https://doi.org/10.24432/C55C7W
Navigli, R., Conia, S., Ross, B.: Biases in large language models: origins, inventory, and discussion. ACM J. Data Inform. Qual. 15(2), 1–21 (2023)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Realinho, V., Machado, J., Baptista, L., Martins, M.V.: Predict students’ dropout and academic success. UCI Mach. Learn. Repository (2021). DOI: https://doi.org/10.24432/C5MC89
Romera-Paredes, B., et al.: Mathematical discoveries from program search with large language models. Nature 625(7995), 468–475 (2024)
Shafti, L.S., Pérez, E.P.: Constructive induction and genetic algorithms for learning concepts with complex interaction. In: Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation. pp. 1811-1818. GECCO ’05, Association for Computing Machinery, New York, NY, USA (2005). https://doi.org/10. 1145/1068009.1068317
Sutton, R.S., Matheus, C.J.: Learning polynomial functions by feature construction. In: Machine Learning Proceedings 1991, pp. 208–212. Elsevier (1991)
Tang, J., Alelyani, S., Liu, H.: Feature selection for classification: a review. Data Classif.: Algorithms Appl. 37 (2014)
pandas development team, T.: pandas-dev/pandas: Pandas (2020). https://doi. org/10.5281/zenodo.3509134
Teboul, A.: Diabetes health indicators dataset (2022). https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset. Accessed insert-date-here
Sigillito, V., Wing, S., Hutton, L., Baker, K.: Ionosphere. UCI Mach. Learn. Repository (1989). https://doi.org/10.24432/C5W01B
Wang, Z.: Causalbench: A comprehensive benchmark for evaluating causal reasoning capabilities of large language models. In: Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pp. 143–151 (2024)
Wnek, J.: MONK’s problems. UCI Mach. Learn. Repository (1992). https://doi. org/10.24432/C5R30R
Wu, X., Wu, S.h., Wu, J., Feng, L., Tan, K.C.: Evolutionary computation in the era of large language model: Surv. Roadmap. arXiv preprint arXiv:2401.10034 (2024)
Zhang, X., Zhang, J., Rekabdar, B., Zhou, Y., Wang, P., Liu, K.: Dynamic and adaptive feature generation with llm. arXiv preprint arXiv:2406.03505 (2024)