Anonymization; Diversity in AI; Generalization; LLMs; De-anonymization
Abstract :
[en] Text anonymization aims to enable the secure sharing of information between parties. One of the main challenges in data anonymization is achieving a balance between ensuring data privacy and maintaining data utility. To address these challenges, recent studies have explored the use of Large Language Models (LLMs), which have shown improved performance on datasets from Europe. Based on these findings, this paper aims to create a dataset from less explored parts of the world, specifically Africa, to assess the relevance of LLMs on diverse datasets and to discuss the generalization of the results. Additionally, this paper proposes an evaluation framework for assessing various anonymization techniques, including those utilizing LLMs. The performance of these techniques is assessed using several metrics, such as BERTScore for semantic evaluation and Information Loss for utility preservation.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > FINATRAX - Digital Financial Services and Cross-organizational Digital Transformations
Disciplines :
Management information systems Computer science
Author, co-author :
KOKMEL, Meliane Angele ; University of Luxembourg > Faculty of Science, Technology and Medicine > Department of Computer Science > Team Leon VAN DER TORRE
ABBAS, Antragama Ewa ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > FINATRAX
TCHAPPI HAMAN, Igor ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > FINATRAX
External co-authors :
no
Language :
English
Title :
Striking the Balance: Generalization vs. Memorization in Anonymization and De-anonymization through LLMs
Publication date :
2025
Event name :
Proceedings of The 8th International Conference on Emerging Data and Industry (EDI40)
This research was funded in whole by the Luxembourg National Research Fund (FNR) and PayPal, PEARL grant
reference 13342933/Gilbert Fridgen. For the purpose of open access, and in fulfillment of the obligations arising from the grant agreement, the author has applied a Creative Commons Attribution 4.0 International (CC BY 4.0) license to any Author Accepted Manuscript version arising from this submission.
Elia Ahidi, Elisante Lukwaro, Khamisi Kalegele, and Devotha G Nyambo A review on nlp techniques and associated challenges in extracting features from education data International Journal of Computing and Digital Systems 16 1 2024 961 979
Dimitris Asimopoulos, Ilias Siniosoglou, Vasileios Argyriou, Sotirios K Goudos, Konstantinos E Psannis, Nikoleta Karditsioti, Theocharis Saoulidis, and Panagiotis Sarigiannidis. Evaluating the efficacy of ai techniques in textual anonymization: A comparative study. arXiv preprint arXiv:2405.06709, 2024.
United Nations. General Assembly. Universal declaration of human rights, volume 3381. Department of State, United States of America, 1949.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and, et al. Language models are few-shot learners Advances in neural information processing systems 33 2020 1877 1901
Josep Domingo-Ferrer. Personal big data, gdpr and anonymization. In Flexible Query Answering Systems: 13th International Conference, FQAS 2019, Amantea, Italy, July 2-5, 2019, Proceedings 13, pages 7-10. Springer, 2019.
Lizhou Fan, Lingyao Li, Zihui Ma, Sanggyu Lee, Huizi Yu, and Libby Hemphill A bibliometric review of large language models research from 2017 to 2023 ACM Transactions on Intelligent Systems and Technology 15 5 2024 1 25
Andrea Gadotti, Luc Rocher, Florimond Houssiau, Ana-Maria Creţu, and Yves-Alexandre de Montjoye. Anonymization: The imperfect science of using data while preserving privacy. Science Advances, 10(29):eadn7053, 2024.
Fadi Hassan, Josep Domingo-Ferrer, and Jordi Soria-Comas. Anonymization of unstructured data via named-entity recognition. In Modeling Decisions for Artificial Intelligence: 15th Int. Conf., MDAI 2018, Mallorca, Spain, October 15-18, 2018, Proc. 15, pages 296-305. Springer,2018.
Fadi Hassan, David Sánchez, Jordi Soria-Comas, and Josep Domingo-Ferrer. Automatic anonymization of textual documents: detecting sensitive information via word embeddings. In 2019 18th IEEE Int. Conf. On Trust, Security And Privacy In Computing And Communications/13thIEEE Int. Conf. On Big Data Science And Engineering (TrustCom/BigDataSE), pages 358-365. IEEE, 2019.
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd, et al. spacy: Industrial-strength natural language processing in python. online, 2020.
Joris Hulstijn, Igor Tchappi, Amro Najjar, and Reyhan Aydogan. Metrics for evaluating explainable recommender systems. In International Workshop on Explainable, Transparent Autonomous Agents and Multi-Agent Systems, pages 212-230. Springer, 2023.
Bennett Kleinberg and Maximilian Mozes. Web-based text anonymization with node. js: Introducing netanos (named entity-based text anonymization for open science). Journal of Open Source Software, 2(14):293, 2017.
Bennett Kleinberg, Maximilian Mozes, Yaloe van der Toolen, et al. Netanos-named entity-based text anonymization for open science. journal, 2017.
Bennett Kleinberg, Toby Davies, and Maximilian Mozes. Textwash-automated open-source text anonymisation. arXiv preprint arXiv:2208.13081, 2022.
Abdul Majeed, and Sungchang Lee Anonymization techniques for privacy preserving data publishing: A comprehensive survey IEEE access 9 2020 8512 8545
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55-60, 2014.
Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael P Wellman. Sok: Security and privacy in machine learning. In 2018 IEEE European symposium on security and privacy (EuroS&P), pages 399-414. IEEE, 2018.
Constantinos Patsakis and Nikolaos Lykousas. Man vs the machine in the struggle for effective text anonymisation in the age of large language models. Scientific Reports, 13(1):16026, 2023.
Rebeca Perez-Lainez, Ana Iglesias, and César de Pablo-Sánchez. Anonimytext: Anonimization of unstructured documents. In International Conference on Knowledge Discovery and Information Retrieval, 2009. URL https://api.semanticscholar.org/CorpusID:33934217.
Rebeca Perez-Lainez, Ana Iglesias, and Cesar de Pablo-Sanchez. Anonimytext: Anonimization of unstructured documents. In International Conference on Knowledge Discovery and Information Retrieval, volume 2, pages 284-287. SCITEPRESS, 2009.
Rahul Sharnagat. Named entity recognition: A literature survey. Center For Indian Language Technology, pages 1-27, 2014.
Peng Sun, Xuezhen Yang, Xiaobing Zhao, and Zhijuan Wang. An overview of named entity recognition. In 2018 International Conference on Asian Language Processing (IALP), pages 273-278. IEEE, 2018.
Harini Suresh and John Guttag. A framework for understanding sources of harm throughout the machine learning life cycle. In Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, pages 1-9, 2021.
Latanya Sweeney. Replacing personally-identifying information in medical records, the scrub system. In Proceedings of the AMIA annual fall symposium, page 333. American Medical Informatics Association, 1996.
Melissa Tessa, Sarah Abchiche, Yves Claude Ferstler, Igor Tchappi, Karima Benatchba, and Amro Najjar. Enhancing explanaibility in ai: Food recommender system use case. In Proceedings of the 11th International Conference on Human-Agent Interaction, pages 395-397, 2023.
A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
Paul Voigt, and Axel Von dem Bussche The eu general data protection regulation (gdpr) A Practical Guide, 1st Ed., Cham: Springer International Publishing 10 3152676 2017 10 5555
Yanling Wang, Qian Wang, Lingchen Zhao, and Cong Wang Differential privacy in deep learning: Privacy and beyond Future Generation Computer Systems 148 2023 408 424