data quality; data anomaly detection; categorical data; machine learning
Abstract :
[en] Data quality is crucial in modern software systems, like data-driven decision support systems. However, data quality is affected by data anomalies, which represent instances that deviate from most of the data. These anomalies affect the reliability and trustworthiness of software systems, and may propagate and cause more issues. Although many anomaly detection approaches have been proposed, they mainly focus on numerical data. Moreover, the few approaches targeting anomaly detection for categorical data do not yield consistent results across datasets.
In this paper, we propose a novel anomaly detection approach for categorical data named LAFF-AD (LAFF-based Anomaly Detection), which takes advantage of the learning ability of a state-of-the-art form filling tool (LAFF) to perform value inference on suspicious data. LAFF-AD runs a variant of LAFF that predicts the possible values of a suspicious categorical field in the suspicious instance. LAFF-AD then compares the output of LAFF to the recorded values in the suspicious instance, and uses a heuristic-based strategy to detect categorical data anomalies.
We evaluated LAFF-AD by assessing its effectiveness and efficiency on six datasets. Our experimental results show that LAFF-AD can accurately determine a high range of data anomalies, with recall values between 0.6 and 1 and a precision value of at least 0.808. Furthermore, LAFF-AD is efficient, taking at most 7000 s and 735 ms to perform training and prediction, respectively.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SVV - Software Verification and Validation
Disciplines :
Computer science
Author, co-author :
Belgacem, Hichem ; Luxembourg Institute of Science and Technology, Esch-sur-Alzette, Luxembourg
Li, Xiaochen ; Dalian University of Technology, Dalian, China
FNR - Fonds National de la Recherche NSERC - Natural Sciences and Engineering Research Council SFI - Science Foundation Ireland
Funding number :
C22/IS/17373407/LOGODOR
Funding text :
This research was funded in whole, or in part, by the Luxembourg National Research Fund (FNR), grant reference C22/IS/17373407/LOGODOR. Lionel Briand was in part supported by the Canada Research Chair and Discovery Grant programs of the Natural Sciences and Engineering Research Council of Canada (NSERC), and the Science Foundation Ireland grant 13/RC/2094-2. For the purpose of open access, and in fulfillment of the obligations arising from the grant agreement, the authors have applied a Creative Commons Attribution 4.0 International (CC BY 4.0) license to any Author Accepted Manuscript version arising from this submission.
Charu C. Aggarwal and Charu C. Aggarwal. 2017. An Introduction to Outlier Analysis. Springer, Cham.
Fabrizio Angiulli, Fabio Fassetti, Luigi Palopoli, and Cristina Serrao. 2022. A density estimation approach for detecting and explaining exceptional values in categorical data. Applied Intelligence 52, 15 (2022), 17534-17556.
Oluseun Omotola Aremu, David Hyland-Wood, and Peter Ross McAree. 2020. A machine learning approach to circumventing the curse of dimensionality in discontinuous time series machine data. Reliability Engineering & System Safety 195 (2020), 106706. DOI:https://doi.org/10.1016/j.ress.2019.106706
Iman Avazpour, Teerat Pitakrat, Lars Grunske, and John Grundy. 2014. Dimensions and metrics for evaluating recommendation systems. In Proceedings of the Recommendation Systems in Software Engineering. Springer, Berlin, Germany, 245-273.
Tanya Barrett, Karen Clark, Robert Gevorgyan, Vyacheslav Gorelenkov, Eugene Gribov, Ilene Karsch-Mizrachi, Michael Kimelman, Kim D. Pruitt, Sergei Resenchuk, Tatiana Tatusova, Eugene Yaschenko, and James Ostell. 2012. BioProject and BioSample databases at NCBI: Facilitating capture and organization of metadata. Nucleic Acids Research 40, D1 (2012), D57-D63.
Stephen D. Bay and Mark Schwabacher. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 29-38.
Hichem Belgacem, Xiaochen Li, Domenico Bianculli, and Lionel Briand. 2023. A machine learning approach for automated filling of categorical fields in data entry forms. ACM Transactions on Software Engineering and Methodology 32, 2 (2023), 47:1-47:40.
Hichem Belgacem, Xiaochen Li, Domenico Bianculli, and Lionel Briand. 2024. Learning-based relaxation of completeness requirements for data entry forms. ACM Transactions on Software Engineering and Methodology 33, 3 (2024), 77:1-77:32.
Andrew P. Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30, 7 (1997), 1145-1159.
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jorg Sander. 2000. LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, 93-104.
Barbara Caputo, K. Sim, Fredrik Furesjo, and Alex Smola. 2002. Appearance-based object recognition using SVMs: Which kernel should I use?. In Proceedings of the NIPS Workshop on Statistical methods for Computational Experiments in Visual Processing and Computer Vision. N/A, N/A, 1 pages.
Vendula Churova, Roman Vyŝovsky Katerina Marŝalova, David Kudlacek, and Daniel Schwarz. 2021. Anomaly detection algorithm for real-world data and evidence in clinical research: Implementation, evaluation, and validation study. JMIR Medical Informatics 9, 5 (2021), e27172.
Kaustav Das and Jeff Schneider. 2007. Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 220-229.
Stephen E. Fienberg. 1979. The use of chi-squared statistics for categorical data problems. Journal of the Royal Statistical Society: Series B (Methodological) 41, 1 (1979), 54-64.
Jim Gee and Mark Button. 2019. The Financial Cost of Fraud 2019: The latest data from around the world. Crowe UK, United Kingdom. https://researchportal.port.ac.uk/en/publications/the-financial-cost-of-fraud-2019-thelatest-datafrom-around-the
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. TheWEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11, 1 (2009), 10-18.
Douglas M. Hawkins. 1980. Identification of Outliers. Vol. 11. Springer, Dordrecht.
Zengyou He, Shengchun Deng, and Xiaofei Xu. 2005. An optimization model for outlier detection in categorical data. In Proceedings of the International Conference on Intelligent Computing. Springer, Springer, Berlin, 400-409.
Zengyou He, Shengchun Deng, Xiaofei Xu, and Joshua Zhexue Huang. 2006. A fast greedy algorithm for outlier mining. In Advances in Knowledge Discovery and Data Mining: 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, 2006. Proceedings 10. Springer, Springer, Berlin, 567-576.
Carrie Heeter. 2000. Interactivity in the context of designed experiences. Journal of Interactive Advertising 1, 1 (2000), 3-14.
Zhe Hui Hoo, Jane Candlish, and Dawn Teare. 2017. What is an ROC curve? BMJ Publishing Group Ltd.
Dino Ienco, Ruggero G. Pensa, and Rosa Meo. 2016. A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Transactions on Neural Networks and Learning Systems 28, 5 (2016), 1017-1029.
Faisal Jamil and Dohyeun Kim. 2021. An ensemble of prediction and learning mechanism for improving accuracy of anomaly detection in network intrusion environments. Sustainability 13, 18 (2021), 10057.
C. Fraser. 2019. Association between two categorical variables: Contingency analysis with Chi square. In Business Statistics for Competitive Advantage with Excel 2019 and JMP. Springer, Cham, 341-377. https://doi.org/10.1007/978-3-030-20374-0_13
Jalayer Khalilzadeh and Asli D. A. Tasci. 2017. Large sample size, significance level, and the effect size: Solutions to perils of using big data for academic research. Tourism Management 62 (2017), 89-96. DOI:https://doi.org/10.1016/j. tourman.2017.03.026
Hans-Peter Kriegel, Matthias Schubert, and Arthur Zimek. 2008. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 444-452.
Peter Kromkowski, Shaoran Li, Wenxi Zhao, Brendan Abraham, Austin Osborne, and Donald E. Brown. 2019. Evaluating statistical models for network traffic anomaly detection. In Proceedings of the 2019 Systems and Information Engineering Design Symposium. IEEE, IEEE, Charlottesville, VA, USA, 1-6.
Junli Li, Jifu Zhang, Ning Pang, and Xiao Qin. 2018. Weighted outlier detection of high-dimensional categorical data using feature grouping. IEEE Transactions on Systems, Man, and Cybernetics: Systems 50, 11 (2018), 4295-4308.
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-based anomaly detection. ACMTransactions on Knowledge Discovery from Data 6, 1 (2012), 1-39.
Marcos Martinez-Romero, Martin J. O'Connor, Attila L. Egyedi, Debra Willrett, Josef Hardi, John Graybeal, and Mark A. Musen. 2019. Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases. Database 2019 (June 2019), baz059. DOI:https://doi.org/10.1093/database/baz059
Caitlin Mills, Nigel Bosch, Art Graesser, and Sidney D'Mello. 2014. To quit or not to quit: Predicting future behavioral disengagement fromreading patterns. In Intelligent Tutoring Systems: 12th International Conference, ITS 2014, Honolulu, HI, USA, June 5-9, 2014. Proceedings 12. Springer, Springer, Cham, 19-28.
Kivanc Muslu, Yuriy Brun, and Alexandra Meliou. 2015. Preventing data errors with continuous testing. In Proceedings of the 2015 International Symposium on Software Testing and Analysis. Association for Computing Machinery, New York, NY, USA, 373-384.
Kazuyo Narita and Hiroyuki Kitagawa. 2008. Detecting outliers in categorical record databases based on attribute associations. In Progress inWWWResearch and Development, 10th Asia-PacificWeb Conference, APWeb 2008, Shenyang, China, April 26-28, 2008. Proceedings (Lecture Notes in Computer Science), Springer, 111-123. DOI:https://doi.org/10. 1007/978-3-540-78849-2_13
K. Noto, C. Brodley, and D. Slonim. 2012. FRaC: a feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Min Knowl Disc 25 (2012), 109-133. https://doi.org/10.1007/s10618-011-0234-x
Guansong Pang, Longbing Cao, and Ling Chen. 2016. Outlier detection in complex categorical data by modelling the feature value couplings. In (IJCAI'16), AAAI Press, New York, New York, USA, 1902-1908.
Guansong Pang, Longbing Cao, and Ling Chen. 2021. Homophily outlier detection in non-IID categorical data. Data Mining and Knowledge Discovery 35, 4 (2021), 1-62.
Guansong Pang, Kai Ming Ting, David Albrecht, and Huidong Jin. 2016. Zero++: Harnessing the power of zero appearances to detect anomalies in large-scale data sets. Journal of Artificial Intelligence Research 57 (2016), 593-620. https://www.jair.org/index.php/jair/article/view/11035
Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, null (November 2011), 2825-2830.
Tahereh Pourhabibi, Kok-Leong Ong, Booi H. Kam, and Yee Ling Boo. 2020. Fraud detection: A systematic literature review of graph-based anomaly detection approaches. Decision Support Systems 133 (2020), 113303. DOI:https://doi. org/10.1016/j.dss.2020.113303
Siyu Qian, Esther Munyisia, David Reid, David Hailey, Jade Pados, and Ping Yu. 2020. Trend in data errors after the implementation of an electronic medical record system: A longitudinal study in an Australian regional Drug and Alcohol Service. International Journal of Medical Informatics 144 (2020), 104292. DOI:https://doi.org/10.1016/j.ijmedinf. 2020.104292
NNR Ranga Suri, Narasimha Murty M, G. Athithan, NNR Ranga Suri, Narasimha Murty M, and G. Athithan. 2019. Outlier detection in categorical data. Outlier Detection: Techniques and Applications: A Data Mining Perspective N/A, N/A (2019), 69-93.
L. Rashidi, S. Hashemi, and A. Hamzeh. 2011. Anomaly detection in categorical datasets using bayesian networks. In Artificial Intelligence and Computational Intelligence: Third International Conference, AICI 2011, Taiyuan, China, September 24-25, 2011, Proceedings, Part II 3. 610-619.
Stuart J. Russell. 2010. Artificial Intelligence a Modern Approach. Pearson Education, Inc., N/A.
David Saff and Michael D. Ernst. 2003. Reducing wasted development time via continuous testing. In Proceedings of the 14th International Symposium on Software Reliability Engineering. IEEE, IEEE, Denver, CO, USA, 281-292.
Andrew Sears and Ying Zha. 2003. Data entry for mobile devices using soft keyboards: Understanding the effects of keyboard size and user tasks. International Journal of Human-Computer Interaction 16, 2 (2003), 163-184.
N. N. R. Ranga Suri, M. Narasimha Murty, and Gopalasamy Athithan. 2012. An algorithm for mining outliers in categorical data through ranking. In Proceedings of the 2012 12th International Conference on Hybrid Intelligent Systems. IEEE, IEEE, Pune, India, 247-252.
N. N. R. Ranga Suri, Musti Narasimha Murty, and GopalasamyAthithan. 2013.Arough clustering algorithm for mining outliers in categorical data. In Pattern Recognition and Machine Intelligence: 5th International Conference, PReMI 2013, Kolkata, India, December 10-14, 2013. Proceedings 5. Springer, Springer, Berlin, 170-175.
Ayman Taha and Ali S.Hadi. 2019. Anomaly detectionmethods for categorical data:Areview. ACMComputing Surveys 52, 2 (2019), 1-35.
Florian Tambon, Gabriel Laberge, Le An, Amin Nikanjam, Paulina Stevia Nouwou Mindom, Yann Pequignot, Foutse Khomh, Giulio Antoniol, Ettore Merlo, and Francois Laviolette. 2022. How to certify machine learning based safetycritical systems? A systematic literature review. Automated Software Engineering 29, 2 (2022), 38.
Hongzuo Xu, Yongjun Wang, Li Cheng, Yijie Wang, and Xingkong Ma. 2018. Exploring a high-quality outlying feature value set for noise-resilient outlier detection in categorical data. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, New York, NY, USA, 17-26.
H. Xu, Y. Wang, Z. Wu, and Y. Wang. 2019. Embedding-based complex feature value coupling learning for detecting outliers in Non-IID categorical data. Proceedings of the AAAI Conference on Artificial Intelligence 33, 1 (2019), 5541-5548.
Li Yang and Abdallah Shami. 2020. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 415 (2020), 295-316. DOI:https://doi.org/10.1016/j.neucom.2020.07.061