Data quality; Data-driven AI; Drift detection; Machine learning; MLOps; Multidisciplinary
Abstract :
[en] This paper introduces a novel end-to-end framework that efficiently integrates data quality assessment with machine learning (ML) model operations in real-time production environments. While existing approaches treat data quality assessment and ML systems as isolated processes, our framework addresses the critical gap between theoretical methods and practical implementation by combining dynamic drift detection, adaptive data quality metrics, and MLOps into a cohesive, lightweight system. The key innovation lies in its operational efficiency, enabling real-time, quality-driven ML decision-making with minimal computational overhead. We validate the framework in a steel manufacturing company’s Electroslag Remelting (ESR) vacuum pumping process, demonstrating a 12 % improvement in model performance (R2 = 94 %) and a fourfold reduction in prediction latency. By exploring the impact of data quality acceptability thresholds, we provide actionable insights into balancing data quality standards and predictive performance in industrial applications. This framework represents a significant advancement in MLOps, offering a robust solution for time-sensitive, data-driven decision-making in dynamic industrial environments.
Disciplines :
Computer science
Author, co-author :
BAYRAM, Firas ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal ; Department of Mathematics and Computer Science, Karlstad University, Karlstad, Sweden
Ahmed, Bestoun S.; Department of Mathematics and Computer Science, Karlstad University, Karlstad, Sweden ; American University of Bahrain, Riffa, Bahrain
Hallin, Erik; Uddeholms AB, Hagfors, Sweden
External co-authors :
yes
Language :
English
Title :
End-to-end data quality-driven framework for machine learning in production environment
N. Kuhar, P. Kumria, S. Rani, Overview of applications of artificial intelligence (AI) in diverse fields, in: Application of Artificial Intelligence in Wastewater Treatment, Springer, 2024, pp. 41–83. 10.1007/978-3-031-69433-2_2
V. B. Chitla, P. Lakshminarasimhan, M. M. Kumar, B. Prashanth, Multivariate analysis and detection of concept drift in longitudinal COVID-19 data: Implications for adaptive healthcare strategies, in: 2024 Second International Conference on Networks, Multimedia and Information Technology (NMITCON), IEEE, 2024, pp. 1–11. 10.1109/nmitcon62075.2024.10699076
U. Samal, A. Kumar, A neural network approach for software reliability prediction, International Journal of Reliability, Quality and Safety Engineering (2024a) 2450009. 10.1142/s0218539324500098
U. Samal, A. Kumar, Enhancing software reliability forecasting through a hybrid ARIMA-ANN model, Arabian Journal for Science and Engineering 49 (2024b) 7571–7584. 10.1007/s13369-023-08486-1
S. E. Whang, Y. Roh, H. Song, J.-G. Lee, Data collection and quality challenges in deep learning: A data-centric AI perspective, The VLDB Journal 32 (2023) 791–813. 10.1007/s00778-022-00775-9
L. Morán-Fernández, V. Bólon-Canedo, A. Alonso-Betanzos, How important is data quality? Best classifiers vs best features, Neurocomputing 470 (2022) 365–375. 10.1016/j.neucom.2021.05.107
O. H. Hamid, From model-centric to data-centric AI: A paradigm shift or rather a complementary approach?, in: 2022 8th International Conference on Information Technology Trends (ITT), IEEE, 2022, pp. 196–199. 10.1109/itt56123.2022.9863935
R. Rosati, L. Romeo, G. Cecchini, F. Tonetto, P. Viti, A. Mancini, E. Frontoni, From knowledge-based to big data analytic model: A novel iot and machine learning based decision support system for predictive maintenance in industry 4.0, Journal of Intelligent Manufacturing 34 (2023) 107–121. 10.1007/s10845-022-01960-x
D. Loshin, The practitioner’s guide to data quality improvement, Elsevier, Amsterdam, Netherlands, 2010. 10.1016/c2009-0-17212-4
C. Batini, C. Cappiello, C. Francalanci, A. Maurino, Methodologies for data quality assessment and improvement, ACM computing surveys (CSUR) 41 (2009) 1–52. 10.1145/1541880.1541883
F. Sidi, P. H. S. Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim, A. Mustapha, Data quality: A survey of data quality dimensions, in: 2012 International Conference on Information Retrieval & Knowledge Management, IEEE, 2012, pp. 300–304. 10.1109/infrkm.2012.6204995
F. Ridzuan, W. M. N. W. Zainon, A review on data quality dimensions for big data, Procedia Computer Science 234 (2024) 341–348. 10.1016/j.procs.2024.03.008
D. McGilvray, Executing data quality projects: Ten steps to quality data and trusted information (TM), Academic Press, Cambridge, MA, 2021. 10.1016/c2017-0-02932-4
U. Jayasankar, V. Thirumal, D. Ponnurangam, A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications, Journal of King Saud University-Computer and Information Sciences 33 (2021) 119–140. 10.1016/j.jksuci.2018.05.006
N. Polyzotis, M. Zinkevich, S. Roy, E. Breck, S. Whang, Data validation for machine learning, Proceedings of machine learning and systems 1 (2019) 334–347.
L. Ehrlinger, V. Haunschmid, D. Palazzini, C. Lettner, A DaQL to monitor data quality in machine learning applications, in: Database and Expert Systems Applications: 30th International Conference, DEXA 2019, Linz, Austria, August 26–29, 2019, Proceedings, Part I 30, Springer, 2019, pp. 227–237. 10.1007/978-3-030-27615-7_17
M. Mazumder, C. Banbury, X. Yao, B. Karlaš, W. Gaviria Rojas, S. Diamos, G. Diamos, L. He, A. Parrish, H. R. Kirk, et al., DataPerf: Benchmarks for data-centric ai development, Advances in Neural Information Processing Systems 36 (2024). 10.5555/3666122.3666357
D. Kreuzberger, N. Kühl, S. Hirschl, Machine learning operations (MLOps): Overview, definition, and architecture, IEEE access (2023). 10.1109/ACCESS.2023.3262138
M. Testi, M. Ballabio, E. Frontoni, G. Iannello, S. Moccia, P. Soda, G. Vessio, MLOps: A taxonomy and a methodology, IEEE Access 10 (2022) 63606–63618. 10.1109/ACCESS.2022.3181730
D. A. Tamburri, Sustainable MLOps: Trends and challenges, in: 2020 22nd international symposium on symbolic and numeric algorithms for scientific computing (SYNASC), IEEE, 2020, pp. 17–23. 10.1109/synasc51798.2020.00015
J. Jakubik, M. Vössing, N. Kühl, J. Walk, G. Satzger, Data-centric artificial intelligence, Business & Information Systems Engineering (2024) 1–9. 10.1007/s12599-024-00857-8
S. Juddoo, Overview of data quality challenges in the context of big data, in: 2015 International Conference on Computing, Communication and Security (ICCCS), IEEE, 2015, pp. 1–9. 10.1109/cccs.2015.7374131
S. Watts, G. Shankaranarayanan, A. Even, Data quality assessment in context: A cognitive perspective, Decision support systems 48 (2009) 202–211. 10.1016/j.dss.2009.07.012
A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, Quality assessment for linked data: A survey, Semantic Web 7 (2016) 63–93. 10.1007/978-3-030-96140-4_5
L. Cai, Y. Zhu, The challenges of data quality and data quality assessment in the big data era, Data science journal 14 (2015) 2–2. 10.5334/dsj-2015-002
L. Liu, M. T. Özsu, Encyclopedia of database systems, volume 6, Springer, New York, NY, USA, 2009. 10.1007/978-0-387-39940-9
J. R. Talburt, Y. Zhou, Entity information life cycle for big data: Master data management and information integration, Morgan Kaufmann, 2015. 10.1016/c2013-0-18748-x
L. Bertossi, F. Geerts, Data quality and explainable AI, Journal of Data and Information Quality (JDIQ) 12 (2020) 1–9. 10.1145/3386687
L. Pipino, R. Wang, D. Kopcso, W. Rybolt, Developing measurement scales for data-quality dimensions, in: Information quality, Routledge, England, UK, 2014, pp. 37–51. 10.4324/9781315703480
I. Taleb, H. T. El Kassabi, M. A. Serhani, R. Dssouli, C. Bouhaddioui, Big data quality: A quality dimensions evaluation, in: 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), IEEE, 2016, pp. 759–765. 10.1109/uic-atc-scalcom-cbdcom-iop-smartworld.2016.0122
S. Aminikhanghahi, D. J. Cook, A survey of methods for time series change point detection, Knowledge and information systems 51 (2017) 339–367. 10.1007/s10115-016-0987-z
C. Truong, L. Oudre, N. Vayatis, Selective review of offline change point detection methods, Signal Processing 167 (2020) 107299. 10.1016/j.sigpro.2019.107299
K. Choi, J. Yi, C. Park, S. Yoon, Deep learning for anomaly detection in time-series data: Review, analysis, and guidelines, IEEE access 9 (2021) 120043–120065. 10.1109/access.2021.3107975
F. Ahmadzadeh, Change point detection with multivariate control charts by artificial neural network, The International Journal of Advanced Manufacturing Technology 97 (2018) 3179–3190. 10.1007/s00170-009-2193-6
A. Tartakovsky, I. Nikiforov, M. Basseville, Sequential analysis: Hypothesis testing and changepoint detection, CRC press, Florida, USA, 2014. 10.1201/b17279
S. Liu, M. Yamada, N. Collier, M. Sugiyama, Change-point detection in time-series data by relative density-ratio estimation, Neural Networks 43 (2013) 72–83. 10.1016/j.neunet.2013.01.012
J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept drift: A review, IEEE transactions on knowledge and data engineering 31 (2018) 2346–2363. 10.1109/tkde.2018.2876857
M. M. John, H. H. Olsson, J. Bosch, Towards MLOps: A framework and maturity model, in: 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), IEEE, 2021, pp. 1–8. 10.1109/cain66642.2025.00018
G. Symeonidis, E. Nerantzis, A. Kazakis, G. A. Papakostas, MLOps-definitions, tools and challenges, in: 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), IEEE, 2022, pp. 0453–0460. 10.1109/ccwc54503.2022.9720902
P. Ruf, M. Madan, C. Reich, D. Ould-Abdeslam, Demystifying MLOps and presenting a recipe for the selection of open-source tools, Applied Sciences 11 (2021) 8861. 10.3390/app11198861
R. Subramanya, S. Sierla, V. Vyatkin, From DevOps to MLOps: Overview and application to electricity market forecasting, Applied Sciences 12 (2022) 9851. 10.3390/app12199851
P. Liu, L. Wang, R. Ranjan, G. He, L. Zhao, A survey on active deep learning: from model driven to data driven, ACM Computing Surveys (CSUR) 54 (2022) 1–34. 10.1145/3510414
S. Rangineni, An analysis of data quality requirements for machine learning development pipelines frameworks, International Journal of Computer Trends and Technology 71 (2023) 16–27. 10.14445/22312803/ijctt-v71i8p103
P. Singh, Systematic review of data-centric approaches in artificial intelligence and machine learning, Data Science and Management (2023). 10.1016/j.dsm.2023.06.001
N. Seedat, F. Imrie, M. van der Schaar, Navigating data-centric artificial intelligence with DC-Check: Advances, challenges, and opportunities, IEEE Transactions on Artificial Intelligence 5 (2023) 2589–2603. 10.1109/TAI.2023.3345805
A. Jain, H. Patel, L. Nagalapatti, N. Gupta, S. Mehta, S. Guttula, S. Mujumdar, S. Afzal, R. Sharma Mittal, V. Munigala, Overview and importance of data quality for machine learning tasks, in: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 3561–3562. 10.1145/3394486.3406477
G. Nain, K. K. Pattanaik, G. K. Sharma, H. Gauttam, PackMASNet: An information integration approach for quality inspection in industry 5.0, Expert Systems with Applications 255 (2024) 124582. 10.1016/j.eswa.2024.124582
G. Nain, K. Pattanaik, G. Sharma, H. Gauttam, W. Viriyasitavat, A novel mechanism for continual learning based predictive quality inspection in smart manufacturing, in: TENCON 2023-2023 IEEE Region 10 Conference (TENCON), IEEE, 2023, pp. 606–611. 10.1109/tencon58879.2023.10322423
M. Priestley, F. O’donnell, E. Simperl, A survey of data quality requirements that matter in ML development pipelines, ACM Journal of Data and Information Quality 15 (2023) 1–39. 10.1145/3592616
H. Chen, J. Chen, J. Ding, Data evaluation and enhancement for quality improvement of machine learning, IEEE Transactions on Reliability 70 (2021) 831–847. 10.1109/tr.2021.3070863
S. Agrahari, A. K. Singh, Concept drift detection in data stream mining: A literature review, Journal of King Saud University-Computer and Information Sciences 34 (2022) 9523–9540. 10.1016/j.jksuci.2021.11.006
A. Liu, J. Lu, Y. Song, J. Xuan, G. Zhang, Concept drift detection delay index, IEEE Transactions on Knowledge and Data Engineering 35 (2022) 4585–4597. 10.1109/tkde.2022.3153349
F. Bayram, P. Aupke, B. S. Ahmed, A. Kassler, A. Theocharis, J. Forsman, DA-LSTM: A dynamic drift-adaptive learning framework for interval load forecasting with LSTM networks, Engineering Applications of Artificial Intelligence 123 (2023) 106480. 10.1016/j.engappai.2023.106480
G. Grosso, N. Lai, M. Letizia, J. Pazzini, M. Rando, L. Rosasco, A. Wulzer, M. Zanetti, Fast kernel methods for data quality monitoring as a goodness-of-fit test, Machine Learning: Science and Technology 4 (2023) 035029. 10.1088/2632-2153/acebb7
A. Lionis, K. P. Peppas, H. E. Nistazakis, A. Tsigopoulos, RSSI probability density functions comparison using jensen-shannon divergence and pearson distribution, Technologies 9 (2021) 26. 10.1109/eebda56825.2023.10090664
H. Y. Teh, A. W. Kempa-Liehr, K. I.-K. Wang, Sensor data quality: A systematic review, Journal of Big Data 7 (2020) 1–49. 10.1186/s40537-020-0285-1
S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann, A. Grafberger, Automating large-scale data quality verification, Proceedings of the VLDB Endowment 11 (2018) 1781–1794. 10.14778/3229863.3229867
F. Bayram, B. S. Ahmed, E. Hallin, A. Engman, DQSOps: Data quality scoring operations framework for data-driven applications, in: Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering, 2023, pp. 32–41. 10.1145/3593434.3593445
F. Bayram, B. S. Ahmed, E. Hallin, Adaptive data quality scoring operations framework using drift-aware mechanism for industrial applications, Journal of Systems and Software (2024) 112184. 10.1016/j.jss.2024.112184
S. K. Kiangala, Z. Wang, An effective adaptive customization framework for small manufacturing plants using extreme gradient boosting-XGBoost and random forest ensemble learning algorithms in an industry 4.0 environment, Machine Learning with Applications 4 (2021) 100024. 10.1016/j.mlwa.2021.100024
W. Liu, Z. Chen, Y. Hu, XGBoost algorithm-based prediction of safety assessment for pipelines, International Journal of Pressure Vessels and Piping 197 (2022) 104655. 10.1016/j.ijpvp.2022.104655
F. Giannakas, C. Troussas, A. Krouska, C. Sgouropoulou, I. Voyiatzis, XGBoost and deep neural network comparison: The case of teams’ performance, in: Intelligent Tutoring Systems: 17th International Conference, ITS 2021, Virtual Event, June 7–11, 2021, Proceedings 17, Springer, 2021, pp. 343–349. 10.1007/978-3-030-80421-3_37
M. Han, Z. Chen, M. Li, H. Wu, X. Zhang, A survey of active and passive concept drift handling methods, Computational Intelligence 38 (2022) 1492–1535. 10.1111/coin.12520