Impact of Log Parsing on Deep Learning-Based Anomaly Detection

Software systems log massive amounts of data, recording important runtime information. Such logs are used, for example, for log-based anomaly detection, which aims to automatically detect abnormal behaviors of the system under analysis by processing the information recorded in its logs. Many log-based anomaly detection techniques based on deep learning models include a pre-processing step called log parsing. However, understanding the impact of log parsing on the accuracy of anomaly detection techniques has received surprisingly little attention so far. Investigating what are the key properties log parsing techniques should ideally have to help anomaly detection is therefore warranted. In this paper, we report on a comprehensive empirical study on the impact of log parsing on anomaly detection accuracy, using 13 log parsing techniques, seven anomaly detection techniques (five based on deep learning and two based on traditional machine learning) on three publicly available log datasets. Our empirical results show that, despite what is widely assumed, there is no strong correlation between log parsing accuracy and anomaly detection accuracy, regardless of the metric used for measuring log parsing accuracy. Moreover, we experimentally confirm existing theoretical results showing that it is a property that we refer to as distinguishability in log parsing results as opposed to their accuracy that plays an essential role in achieving accurate anomaly detection.

Research center :

Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SVV - Software Verification and Validation

Disciplines :

Computer science

Author, co-author :

KHAN, Zanis Ali ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > SVV > Team Domenico BIANCULLI

Shin, Donghwan

BIANCULLI, Domenico ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SVV

BRIAND, Lionel ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SVV

External co-authors :

yes

Language :

English

Title :

Impact of Log Parsing on Deep Learning-Based Anomaly Detection

Publication date :

2024

Journal title :

Empirical Software Engineering

ISSN :

1382-3256

eISSN :

1573-7616

Publisher :

Kluwer Academic Publishers, Netherlands

Volume :

Pages :

139:1--139:33

Peer reviewed :

Peer Reviewed verified by ORBi

Focus Area :

Security, Reliability and Trust

FnR Project :

FNR17373407 - Automated Log Smell Detection And Removal, 2022 (01/09/2023-31/08/2026) - Domenico Bianculli

Name of the research project :

LOGODOR - Automated Log Smell Detection and Removal

Funders :

FNR - Luxembourg National Research Fund

Funding number :

C22/IS/17373407/LOGODOR

Funding text :

This research was funded in whole, or in part, by the Luxembourg National Re- search Fund (FNR), grant reference C22/IS/17373407/LOGODOR. Lionel Briand was in part supported by the Canada Research Chair and Discovery Grant programs of the Natural Sciences and Engineering Research Council of Canada (NSERC), and the Science Foundation Ireland grant 13/RC/2094-2. For the purpose of open access, and in fulfillment of the obligations arising from the grant agreement, the authors have applied a Creative Commons Attribution 4.0 International (CC BY 4.0) license to any Author Accepted Manuscript version arising from this submission.

Data Set :

Replication package for "Impact of Log Parsing on Deep Learning-Based Anomaly Detection"

Available on ORBilu :

since 12 August 2024

Statistics

Number of views

376 (50 by Unilu)

Number of downloads

191 (23 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

K. Ali Abd Al-Hameed Spearman’s correlation coefficient in statistical analysis Int J Nonlinear Anal Appl 2022 13 1 3249 3255
Backlund H, Hedblom A, Neijman N (2011) A density-based spatial clustering of application with noise. Data Mining TNM033 pp 11–30
L. Breiman Random forests Mach Learn 2001 45 5 32 10.1023/A:1010933404324
Chen Z, Liu J, Gu W, Su Y, Lyu MR (2021) Experience report: deep learning-based system log analysis for anomaly detection. arXiv:2107.05908
Cho K, van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder–decoder approaches. In: Proceedings of SSST-8, Eighth Workshop on Syntax, semantics and structure in statistical translation, association for computational linguistics, Doha, Qatar, pp 103–111, https://doi.org/10.3115/v1/W14-4012, https://aclanthology.org/W14-4012
H. Dai H. Li C.S. Chen W. Shang T. Chen Logram: Efficient log parsing using n-gram dictionaries IEEE Trans Softw Eng (TSE) 2020 48 879 892 10.1109/TSE.2020.3007554
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Du M, Li F (2016) Spell: Streaming parsing of system event logs. In: 2016 IEEE 16th International conference on data mining (ICDM), IEEE, IEEE, Los Alamitos, CA, USA, pp 859–864. https://doi.org/10.1109/CNSM.2015.7367331
Du M, Li F, Zheng G, Srikumar V (2017) Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In: 2017 ACM Conference on computer and communications security (SIGSAC), Association for Computing Machinery, New York, NY, USA, CCS ’17, p 1285–1298. https://doi.org/10.1145/3133956.3134015
A. Elyasov Log-based testing 2012 34th International conference on software engineering (ICSE) 2012 IEEE, Los Alamitos, CA, USA IEEE 1591 1594 10.1109/ICSE.2012.6227029
Fu Q, Lou JG, Wang Y, Li J (2009) Execution anomaly detection in distributed systems through unstructured log analysis. In: 2009 IEEE international conference on data mining (ICDM), IEEE, IEEE, Los Alamitos, CA, USA, pp 149–158. https://doi.org/10.1109/ICDM.2009.60
Y. Fu M. Yan Z. Xu X. Xia X. Zhang D. Yang An empirical study of the impact of log parsers on the performance of log-based anomaly detection Empirical Softw Eng 2023 28 1 1 39 10.1007/s10664-022-10214-6
Hamooni H, Debnath B, Xu J, Zhang H, Jiang G, Mueen A (2016) Logmine: Fast pattern recognition for log analytics. In: 25th ACM International on conference on information and knowledge management (CIKM), Association for Computing Machinery, New York, NY, USA, pp 1573–1582. https://doi.org/10.1145/2983323.2983358
He P, Zhu J, Zheng Z, Lyu MR (2017) Drain: An online log parsing approach with fixed depth tree. In: 2017 IEEE International conference on web services (ICWS), IEEE, IEEE, Los Alamitos, CA, USA, pp 33–40. https://doi.org/10.1109/ICWS.2017.13
He S, Zhu J, He P, Lyu MR (2020) Loghub: A large collection of system log datasets towards automated log analytics. arXiv:2008.06448 https://arxiv.org/pdf/2008.06448.pdf
He S, He P, Chen Z, Yang T, Su Y, Lyu MR (2021) A survey on automated log analysis for reliability engineering. ACM Comput Surv 54(6). https://doi.org/10.1145/3460345
M.A. Hearst S.T. Dumais E. Osuna J. Platt B. Scholkopf Support vector machines IEEE Intell Syst Appl 1998 13 4 18 28 10.1109/5254.708428
C. Huang H. Guan A. Jiang Y. Zhang M. Spratling Y.F. Wang Registration based few-shot anomaly detection European conference on computer vision 2022 Springer, New York, NY, USA Springer 303 319
S. Jeong A.K. Jha Y. Shin W.J. Lee A log-based testing approach for detecting faults caused by incorrect assumptions about the environment IEICE Trans Inf Syst 2020 103 1 170 173 10.1587/transinf.2019EDL8149
P. Jia S. Cai B.C. Ooi P. Wang Y. Xiong Robust and transferable log-based anomaly detection Proceed ACM on Manag Data 2023 1 1 1 26 10.1145/3588918
Jiang ZM, Hassan AE, Flora P, Hamann G (2008) Abstracting execution logs to execution events for enterprise applications. In: 2008 The Eighth international conference on quality software (QSIC), IEEE, IEEE, Los Alamitos, CA, USA, pp 181–186. https://doi.org/10.1109/QSIC.2008.50
Joulin A, Grave E, Bojanowski P, Douze M, Jégou H, Mikolov T (2016) Fasttext. zip: Compressing text classification models. arXiv:1612.03651
Jurafsky D, Martin JH (2019) Vector semantics and embeddings. Speech and language processing pp 1–31
Khan ZA, Shin D, Bianculli D, Briand L (2022) Guidelines for assessing the accuracy of log message template identification techniques. In: Proceedings of the 44th International conference on software engineering (ICSE’22), ACM, ACM, New York, NY, United States, p 1095–1106
Khan ZA, Shin D, Bianculli D, Briand L (2024) Replication package for “impact of log parsing on deep learning-based anomaly detection”. https://doi.org/10.6084/m9.figshare.21995183, https://figshare.com/articles/software/21995183
Le VH, Zhang H (2021) Log-based anomaly detection without log parsing. In: 2021 36th IEEE/ACM International conference on automated software engineering (ASE), IEEE, pp 492–504
V.H. Le H. Zhang Log-based anomaly detection with deep learning: how far are we? 2022 IEEE/ACM 44th International conference on software engineering (ICSE) 2022 ACM, New York, NY, USA IEEE 1356 1367
Le VH, Zhang H (2023) Log parsing with prompt-based few-shot learning. In: International conference on software engineering (ICSE)
S. Lu X. Wei Y. Li L. Wang Detecting anomaly in big data system logs using convolutional neural network 2018 IEEE 16th Intl Conf on dependable, autonomic and secure computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech) 2018 IEEE, Los Alamitos, CA, USA IEEE 151 158
Makanju AA, Zincir-Heywood AN, Milios EE (2009) Clustering event logs using iterative partitioning. In: 15th ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), Association for Computing Machinery, New York, NY, USA, pp 1255–1264. https://doi.org/10.1145/1557019.1557154
W. Meng Y. Liu Y. Zhu S. Zhang D. Pei Y. Liu Y. Chen R. Zhang S. Tao P. Sun et al. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs IJCAI, ACM, New York, NY, USA 2019 19 4739 4745
Messaoudi S, Panichella A, Bianculli D, Briand L, Sasnauskas R (2018) A search-based approach for accurate identification of log message formats. In: 2018 IEEE/ACM 26th International conference on program comprehension (ICPC), ACM, Association for Computing Machinery, New York, NY, USA, pp 167–16710. https://doi.org/10.1145/3196321.3196340
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Mizutani M (2013) Incremental mining of system log format. In: 2013 IEEE International conference on services computing (SCC), IEEE, IEEE, Los Alamitos, CA, USA, pp 595–602. https://doi.org/10.1109/SCC.2013.73
Mvula PK, Branco P, Jourdan GV, Viktor HL (2023) Heart: Heterogeneous log anomaly detection using robust transformers. In: International Conference on discovery science, Springer, Springer, New York, NY, USA, pp 673–687. https://doi.org/10.1007/978-3-031-45275-8_45
Nagappan M, Vouk MA (2010) Abstracting log lines to log event types for mining software system logs. In: 2010 7th IEEE Working conference on mining software repositories (MSR), IEEE, IEEE, Los Alamitos, CA, USA, pp 114–117. https://doi.org/10.1109/MSR.2010.5463281
Nedelkoski S, Bogatinovski J, Acker A, Cardoso J, Kao O (2020) Self-attentive classification-based anomaly detection in unstructured logs. In: 2020 IEEE International conference on data mining (ICDM), IEEE, IEEE, New York, NY, USA, pp 1196–1201. https://doi.org/10.1109/ICDM50108.2020.00148
A. Oliner J. Stearley What supercomputers say: A study of five system logs 37th annual IEEE/IFIP international conference on dependable systems and networks (DSN’07), IEEE, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07) 2007 UK Edinburgh 575 584
Pang G, Ding C, Shen C, Hengel Avd (2021) Explainable deep few-shot anomaly detection with deviation networks. arXiv:2108.00462
F. Pedregosa G. Varoquaux A. Gramfort V. Michel B. Thirion O. Grisel M. Blondel P. Prettenhofer R. Weiss V. Dubourg et al. Scikit-learn: Machine learning in python J Mach Learn Res 2011 12 2825 2830 2854348
Rong G, Xu Y, Gu S, Zhang H, Shao D (2020) Can you capture information as you intend to? a case study on logging practice in industry. In: 2020 IEEE International conference on software maintenance and evolution (ICSME), pp 12–22. https://doi.org/10.1109/ICSME46990.2020.00012
G. Salton C. Buckley Term-weighting approaches in automatic text retrieval Inf Process & Manag 1988 24 5 513 523 10.1016/0306-4573(88)90021-0
Shima K (2016) Length matters: Clustering system log messages using length of words. arXiv:1611.03213
D. Shin Z.A. Khan D. Bianculli L. Briand A theoretical framework for understanding the relationship between log parsing and anomaly detection International conference on runtime verification 2021 Springer, Cham Springer 277 287 10.1007/978-3-030-88494-9_16
Tang L, Li T, Perng CS (2011) Logsig: Generating system events from raw textual logs. In: 20th ACM international conference on Information and knowledge management (CIKM), ACM, New York, NY, USA, pp 785–794. https://doi.org/10.1145/2063576.2063690
Tao S, Liu Y, Meng W, Wang J, Zhao Y, Su C, Tian W, Zhang M, Yang H, Chen X (2023) Da-parser: A pre-trained domain-aware parsing framework for heterogeneous log analysis. In: 2023 IEEE 47th Annual computers, software, and applications conference (COMPSAC), IEEE, pp 322–327
Vaarandi R (2003) A data clustering algorithm for mining patterns from event logs. In: 3rd IEEE Workshop on IP operations & management (IPOM), IEEE, IEEE, Los Alamitos, CA, USA, pp 119–126. https://doi.org/10.1109/IPOM.2003.1251233
Vaarandi R, Pihelgas M (2015) Logcluster - a data clustering and pattern mining algorithm for event logs. In: 2015 11th International conference on network and service management (CNSM), IEEE, Los Alamitos, CA, USA, pp 1–7. https://doi.org/10.1109/CNSM.2015.7367331
F. Wilcoxon Individual comparisons by ranking methods Breakthroughs in statistics 1992 New York, NY, USA Springer 196 202 10.1007/978-1-4612-4380-9_16
Wu X, Li H, Khomh F (2023) On the effectiveness of log representation for log-based anomaly detection. Empirical Softw Eng
L. Yang J. Chen Z. Wang W. Wang J. Jiang X. Dong W. Zhang Semi-supervised log-based anomaly detection via probabilistic label estimation 2021 IEEE/ACM 43rd International conference on software engineering (ICSE) 2021 IEEE, Madrid, Spain IEEE 1448 1460 10.1109/ICSE43902.2021.00130
Zhang X, Xu Y, Lin Q, Qiao B, Zhang H, Dang Y, Xie C, Yang X, Cheng Q, Li Z, et al. (2019) Robust log-based anomaly detection on unstable log data. In: Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Tallin, Estonia, pp 807–817
Zhu J, He S, Liu J, He P, Xie Q, Zheng Z, Lyu MR (2019) Tools and benchmarks for automated log parsing. In: 2019 IEEE/ACM 41st International conference on software engineering: software engineering in practice (ICSE-SEIP), IEEE, IEEE, Los Alamitos, CA, USA, pp 121–130. https://doi.org/10.1109/ICSE-SEIP.2019.00021