AI-enabled Automation for Completeness Checking of Privacy Policies

Requirements Engineering; Legal Compliance; Privacy Policies; The General Data Protection Regulation (GDPR); Artificial Intelligence (AI); Conceptual Modeling; Qualitative Research

Abstract :

[en] Technological advances in information sharing have raised concerns about data protection. Privacy policies containprivacy-related requirements about how the personal data of individuals will be handled by an organization or a software system (e.g.,a web service or an app). In Europe, privacy policies are subject to compliance with the General Data Protection Regulation (GDPR). Aprerequisite for GDPR compliance checking is to verify whether the content of a privacy policy is complete according to the provisionsof GDPR. Incomplete privacy policies might result in large fines on violating organization as well as incomplete privacy-related softwarespecifications. Manual completeness checking is both time-consuming and error-prone. In this paper, we propose AI-based automationfor the completeness checking of privacy policies. Through systematic qualitative methods, we first build two artifacts to characterizethe privacy-related provisions of GDPR, namely a conceptual model and a set of completeness criteria. Then, we develop anautomated solution on top of these artifacts by leveraging a combination of natural language processing and supervised machinelearning. Specifically, we identify the GDPR-relevant information content in privacy policies and subsequently check them against thecompleteness criteria. To evaluate our approach, we collected 234 real privacy policies from the fund industry. Over a set of 48 unseenprivacy policies, our approach detected 300 of the total of 334 violations of some completeness criteria correctly, while producing 23false positives. The approach thus has a precision of 92.9% and recall of 89.8%. Compared to a baseline that applies keyword searchonly, our approach results in an improvement of 24.5% in precision and 38% in recall.

Research center :

Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SVV - Software Verification and Validation

Disciplines :

Computer science

Author, co-author :

AMARAL CEJAS, Orlando ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SVV

ABUALHAIJA, Sallam ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SVV

Torre, Damiano; Texas A&M University > Department of Computer Information Systems

SABETZADEH, Mehrdad ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SVV

BRIAND, Lionel ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SVV

External co-authors :

yes

Language :

English

Title :

AI-enabled Automation for Completeness Checking of Privacy Policies

Publication date :

November 2021

Journal title :

IEEE Transactions on Software Engineering

ISSN :

0098-5589

eISSN :

1939-3520

Publisher :

Institute of Electrical and Electronics Engineers, New-York, United States - New York

Peer reviewed :

Peer Reviewed verified by ORBi

Focus Area :

Security, Reliability and Trust

FnR Project :

FNR13759068 - Artificial Intelligence-enabled Automation For Gdpr Compliance, 2019 (01/01/2020-31/12/2022) - Lionel Briand

Funders :

FNR - Luxembourg National Research Fund

Available on ORBilu :

since 26 October 2021

Statistics

Number of views

612 (113 by Unilu)

Number of downloads

492 (26 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

WoS citations^™

Bibliography

European Union, "General data protection regulation," Accessed: Nov. 07, 2021. [Online]. Available: https://eur-lex.europa.eu/ legal-content/EN/TXT/?uri=OJ:L:2016:119:TOC
EU-GDPR, "EU GDPR portal," 2019. Accessed: Nov. 7, 2021. [Online]. Available: https://eugdpr.org
C. Tankard, "What the GDPR means for businesses," Netw. Secur., vol. 6, pp. 5-8, 2016.
C. Perera, M. Barhamgi, A. K. Bandara, M. Ajmal, B. A. Price, and B. Nuseibeh, "Designing privacy-aware internet of things applications," Inf. Sci., vol. 512, no. 1, pp. 238-257, 2020.
D. Torre, G. Soltana, M. Sabetzadeh, L. C. Briand, Y. Auffinger, and P. Goes, "Using models to enable compliance checking against the GDPR: An experience report," in Proc. 22nd ACM/ IEEE Int. Conf. Model Driven Eng. Languages Syst., 2019, pp. 1-11.
D. Torre, M. Alferez, G. Soltana, M. Sabetzadeh, and L. C. Briand, "Model driven engineering for data protection and privacy: Application and experience with GDPR," CoRR, vol. abs/2007.12046, 2020.
V. Ayala-Rivera and L. Pasquale, "The grace period has ended: An approach to operationalize GDPR requirements," in Proc. 31st IEEE Int. Conf. Requirements Eng., 2018, pp. 136-146.
J. Caramujo et al., "RSL-IL4Privacy: A domain-specific language for the rigorous specification of privacy policies," Requirements Eng., vol. 24, no. 1, pp. 1-26, 2019.
J. Bhatia and T. D. Breaux, "Semantic incompleteness in privacy policy goals," in Proc. 26th IEEE Int. Requirements Eng. Conf., 2018, pp. 159-169.
R. Slavin et al., "Toward a framework for detecting privacy policy violations in android application code," in Proc. 38th Int. Conf. Softw. Eng., 2016, pp. 25-36.
M. Fan et al., "An empirical evaluation of GDPR compliance violations in android mhealth apps," in Proc. IEEE 31st Int. Symp. Softw. Rel. Eng., 2020, pp. 253-264.
D. Torre et al., "An ai-assisted approach for checking the completeness of privacy policies against GDPR," in Proc. 28th IEEE Int. Requirements Eng. Conf., 2020, pp. 136-146.
J. Hirschberg and C. D. Manning, "Advances in natural language processing," Science, vol. 349, no. 6245, 2015.
D. Jurafsky and J. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd ed. Englewood Cliffs, NJ, USA: Prentice Hall, 2009.
M. Rodrigues et al., Advanced Applications of Natural Language Processing for Performing Information Extraction. Berlin, Germany: Springer, 2015.
A. Mikheev, M. Moens, and C. Grover, "Named entity recognition without gazetteers," in Proc. 9th Conf. Eur. Chapter Assoc. Comput. Linguistics, 1999, pp. 1-8.
J. E. Friedl, Mastering Regular Expressions. Newton, MA, USA: O'Reilly Media, Inc., 2006.
M. I. Jordan and T. M. Mitchell, "Machine learning: Trends, perspectives, and prospects," Science, vol. 349, no. 6245, 2015.
C. C. Aggarwal, Machine Learning for Text, 1st ed. Berlin, Germany: Springer, 2018.
I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical Machine Learning Tools and Techniques, 4th ed. San Mateo, CA, USA: Morgan Kaufmann, 2016.
H. Schutze, C. D. Manning, and P. Raghavan, Introduction to Information Retrieval. Cambridge, U.K.: Cambridge Univ. Press Cambridge, 2008.
J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proc. Conf. Empir. Methods Natural Language Process., 2014, pp. 1532-1543.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," in Proc. 1st Int. Conf. Learn. Representations, 2013, arXiv:1301.3781. [Online]. Available: http://arxiv.org/abs/1301.3781
O. Levy and Y. Goldberg, "Dependency-based word embeddings," in Proc. 52nd Annu. Meeting Assoc. Comput. Linguistics, 2014, pp. 302-308.
Z. S. Harris, "Distributional structure," Word, vol. 10, no. 2-3, pp. 146-162, 1954.
A. Joshi, V. Tripathi, K. Patel, P. Bhattacharyya, and M. Carman, "Are word embedding-based features useful for sarcasm detection?," in Proc. Conf. Empir. Methods Natural Language Process., 2016, pp. 1006-1011.
J. Mu, S. Bhat, and P. Viswanath, "All-but-the-top: Simple and effective postprocessing for word representations," in Proc. Int. Conf. Learn. Representations, 2018. Accessed: Nov. 07, 2021. [Online]. Available: https://openreview.net/forum?id=HkuGJ3kCb
L.-C. Yu, J. Wang, K. R. Lai, and X. Zhang, "Refining word embeddings for sentiment analysis," in Proc. Conf. Empir. Methods Natural Lang. Process., 2017, pp. 534-539.
D. Ghosh, W. Guo, and S. Muresan, "Sarcastic or not: Word embeddings to predict the literal or sarcastic meaning of words," in Proc. Conf. Empir. Methods Natural Language Process., 2015, pp. 1003-10012.
M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daume III, "Deep unordered composition rivals syntactic methods for text classification," in Proc. 53rd Annu. Meeting Association Comput. Linguistics, 7th Int. Joint Conf. Natural Language Process., 2015, pp. 1681-1691.
Y. Qi, D. Sachan, M. Felix, S. Padmanabhan, and G. Neubig, "When and why are pre-trained word embeddings useful for neural machine translation?," in Proc. Annu. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Language Technologies, 2018, pp. 529-535.
Y. Kim, "Convolutional neural networks for sentence classification," in Proc. 2014 Conf. Empir. Methods Natural Lang. Process., 2014, pp. 1746-1751.
D. Chen and C. D. Manning, "A fast and accurate dependency parser using neural networks," in Proc. Conf. Empir. Methods Natural Language Process., 2014, pp. 740-750.
J. Turian, L. Ratinov, and Y. Bengio, "Word representations: A simple and general method for semi-supervised learning," in Proc. 48thAnnu. Meeting Assoc. Comput. Linguistics, 2010, pp. 384-394.
M. E. Peters et al., "Deep contextualized word representations," in Proc. 2018 Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2018, pp. 2227-2237.
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding with unsupervised learning," OpenAI Blog, 2018. Accessed: Nov. 07, 2021. [Online]. Available: https://blog.openai.com/language-unsupervised/
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pretraining of deep bidirectional transformers for language understanding," in Proc. 2019 Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Tech., 2019, pp. 4171-4186.
K. Ethayarajh, "How contextual are contextualized word representations? Comparing the geometry of bert, elmo, and gpt-2 embeddings," in Proc. 2019 Conf. Empir. Methods Natural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process., 2019, pp. 55-65.
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," Trans. Assoc. Comput. Linguistics, vol. 5, pp. 135-146, 2017.
E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng, "Improving word representations via global context and multiple word prototypes," in Proc. 50th Annu. Meeting Assoc. Comput. Linguistics, 2012, pp. 873-882.
W. Blacoe and M. Lapata, "A comparison of vector-based representations for semantic composition," in Proc. Joint Conf. Empir. Methods Natural Lang. Process. Comput. Natural Lang. Learn., 2012, pp. 546-556.
X. Zhu, T. Li, and G. De Melo, "Exploring semantic properties of sentence embeddings," in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics, 2018, pp. 632-637.
J. Wieting, M. Bansal, K. Gimpel, and K. Livescu, "Towards universal paraphrastic sentence embeddings," in Proc. 4th Int. Conf. Learn. Representations, 2016. Accessed: Nov. 07, 2021. [Online]. Available: https://arxiv.org/pdf/1511.08198.pdf
O.Amaral, D. Torre, S. Abualhaija, M. Sabetzadeh, and L. C. Briand, Glossary and Completeness Criteria Traceability to the GDPR Articles, May 2021. [Online]. Available: https://tinyurl.com/t3h8e75z
J. Saldana, The Coding Manual for Qualitative Researchers. Newbury Park, CA, USA: SAGE Publishing, 2016.
European Commission, "Article 29 working party-Guidelines on data protection officers (DPOs)," Accessed: Nov. 7, 2021. [Online]. Available: https://ec.europa.eu/newsroom/article29/items/612048
G. Soltana, N. Sannier, M. Sabetzadeh, and L. C. Briand, "Modelbased simulation of legal policies: Framework, tool support, and validation," Softw. Syst. Model., vol. 17, no. 3, pp. 851-883, 2018.
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
L.Michaelis,"Wordmeaning, sentencemeaning, and syntacticmeaning," Cogn. Approaches Lexical Semantics, vol. 23, pp. 163-210, 2003.
S. Wang and C. D. Manning, "Baselines and bigrams: Simple, good sentiment and topic classification," in Proc. 50th Annu. Meeting Assoc. Comput. Linguistics, 2012, pp. 90-94.
A.Komninos and S.Manandhar, "Dependency based embeddings for sentence classification tasks," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics:Human Lang. Technologies, 2016, pp. 1490-1500.
R. Eckart de Castilho and I. Gurevych, "A broad-coverage collection of portable NLP components for building shareable analysis pipelines," in Proc. Workshop Open Infrastructures Anal. Frameworks HLT, 2014, pp. 1-11.
E.D.D. Team, "Deeplearning4j:Open-source distributed deep learning for the JVM, Apache Software Foundation license 2.0," 2020, Accessed: Jan. 2020. [Online].Available: http://deeplearning4j.org
J. H. Hayes, W. Li, and M. Rahimi, "Weka meets tracelab: Toward convenient classification: Machine learning for requirements engineering problems: A position paper," in Proc. IEEE 1st Int. Workshop Artif. Intell. Requirements Eng., 2014, pp. 9-12.
F. Eibe, M. Hall, and I. Witten, The Weka Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques, San Mateo, CA, USA: Morgan Kaufmann, 2016.
C. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge, MA, USA: Cambridge Univ. Press, 2008.
ALFI, "Association of the Luxembourg fund industry-946 member funds," Accessed:Mar. 2019. [Online]. Available: https://www.alfi. lu/Alfi/media/Members/Member%20Company%20Directory/ Membres-ALFI-Fonds-par-nom.pdf
J. Cohen, "A coefficient of agreement for nominal scales," Educ. Psychol. Meas., vol. 20, no. 1, pp. 37-46, 1960.
M. L. McHugh, "Interrater reliability: the kappa statistic," Biochemia Medica, vol. 22, no. 3, pp. 276-282, 2012.
E. Vanezi, G. M. Kapitsaki, D. Kouzapas, A. Philippou, and G. A. Papadopoulos, "Dialogop-A language and a graphical tool for formally defining GDPR purposes," in Proc. Research Challenges Inf. Sci.-14th Int. Conf., vol. 385, 2020, pp. 569-575.
P. Pullonen, J. Tom, R. Matulevicius, and A. Toots, "Privacyenhanced BPMN: Enabling data privacy analysis in business processes models," Softw. Syst. Model., vol. 18, no. 6, pp. 3235-3264, 2019.
N. V. N. Kumar and R. K. Shyamasundar, "Realizing purposebased privacy policies succinctly via information-flow labels," in Proc. IEEE 4th Int. Conf. Big Data Cloud Comput., 2014, pp. 753-760.
D. Sanchez, A. Viejo, and M. Batet, "Automatic assessment of privacy policies under the GDPR" Appl. Sci., vol. 11, no. 4, 2021, Art. no. 1762.
N. Mousavi Nejad, P. Jabat, R. Nedelchev, S. Scerri, and D. Graux, "Establishing a strong baseline for privacy policy classification," in Proc. IFIP Int. Conf. ICT Syst. Secur. Privacy Protection, 2020, pp. 370-383.
W. B. Tesfay, P. Hofmann, T. Nakamura, S. Kiyomoto, and J. Serna, "Privacyguide: Towards an implementation of the EU GDPR on internet privacy policy evaluation," in Proc. 4th ACM Int. Workshop Secu. Privacy Analytics, 2018, pp. 370-383.
J. Bhatia, T. D. Breaux, and F. Schaub, "Mining privacy goals from privacy policies using hybridized task recomposition," ACM Trans. Softw. Eng. Methodol., vol. 25, no. 3, 2016, Art. no. 22.
F. Liu, R. Ramanath, N.M. Sadeh, and N. A. Smith, "A step towards usable privacy policy: Automatic alignment of privacy statements," in Proc. 25th Int. Conf. Comput. Linguistics, 2014, pp. 884-894.
S. Wilson et al., "Crowdsourcing annotations for websites' privacy policies: Can it really work?," in Proc. 25th Int. Conf. World Wide Web, 2016, pp. 133-143.
M. Guerriero, D. A. Tamburri, and E. D. Nitto, "Defining, enforcing and checking privacy policies in data-intensive applications," in Proc. 13th Int. Conf. Softw. Eng. Adaptive Self-Managing Syst., 2018, pp. 172-182.
J. Bhatia, M. C. Evans, and T. D. Breaux, "Identifying incompleteness in privacy policy goals using semantic frames," Requir. Eng., vol. 24, no. 3, pp. 291-313, 2019.
M. Lippi et al., "Claudette: An automated detector of potentially unfair clauses in online terms of service," Artif. Intell. Law, vol. 27, no. 2, pp. 117-139, 2019.
M. Fan et al., "An empirical evaluation of GDPR compliance violations in Android mHealth Apps," IEEE 31st Int. Symp. Soft. Rel. Eng., pp. 253-264, 2020.
M. Bano, D. Zowghi, and C. Arora, "Requirements, politics, or individualism: What drives the success of Covid-19 contact-tracing apps?," IEEE Softw., vol. 38, no. 1, pp. 7-12, Jan./Feb. 2021.
M. Hatamian, S. Wairimu, N. Momen, and L. Fritsch, "A privacy and security analysis of early-deployed COVID-19 contact tracing android apps," Empir. Softw. Eng., vol. 26, no. 3, 2021, Art. no. 36.
S. Kununka, N. Mehandjiev, and P. Sampaio, "A comparative study of android and IoS mobile applications' data handling practices versus compliance to privacy policy," in Privacy and Identity Management. The Smart Revolution, Berlin, Germany: Springer, 2017.