Paper published in a book (Scientific congresses, symposiums and conference proceedings)
Auto-clustering of Financial Reports Based on Formatting Style and Author’s Fingerprint
BLANCO, Braulio; BRORSSON, Mats Håkan; ZURAD, Maciej
2023 • In Koprinska, Irena (Ed.) Machine Learning and Principles and Practice of Knowledge Discovery in Databases - International Workshops of ECML PKDD 2022, Proceedings
[en] We present a new clustering algorithm of financial reports that is based on the reports’ formatting and style. The algorithm uses layout and content information to automatically generate as many clusters as needed. This allows us to reduce the effort of labeling the reports in order to train text-based machine learning models for extracting person or company names, addresses, financial categories, etc. In addition, the algorithm also produces a set of sub-clusters inside each cluster, where each sub-cluster corresponds to a set of reports made by the same author (person or firm). The information about sub-clusters allows us to evaluate the change in the author over time. We have applied the algorithm to a dataset with over 38,000 financial reports (last Annual Account presented by a company) from the Luxembourg Business Registers (LBR) and found 2,165 clusters between 2 and 850 documents with a median of 4 and an average of 14. When adding 2,500 new documents to the existing cluster set (previous annual accounts presented by companies), we found that 67.3% of the financial reports were placed in the correct cluster and sub-cluster. From the remaining documents, 65% were placed in a different subcluster because the company changed the formatting style, which is expected and correct behavior. Finally, labeling 11% of the entire dataset, we can replicate these labels up to 72% of the dataset, keeping a high feature coverage.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SEDAN - Service and Data Management in Distributed Systems ULHPC - University of Luxembourg: High Performance Computing
Disciplines :
Computer science
Author, co-author :
BLANCO, Braulio ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
BRORSSON, Mats Håkan ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
UCI Machine Learning Repository: Reuters-21578, Distribution 1.0, Text Categorization Collection Data Set, https://archive.ics.uci.edu/ml/datasets/reuters21578+text+categorization+collection
Aiello, M., Monz, C., Todoran, L., Worring, M.: Document Understanding for a Broad Class of Documents 5(1). https://doi.org/10.1007/s10032-002-0080-x
Antonacopoulos, A., Clausner, C., Papadopoulos, C., Pletschacher, S.: ICDAR2015 competition on recognition of documents with complex layouts-RDCL2015. https://doi.org/10.1109/ICDAR.2015.7333941
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to information retrieval. In: kdd, vol. 1 (2008)
Dua, D., Graff, C.: UCI machine learning repository. http://archive.ics.uci.edu/ml
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: kdd, vol. 96, pp. 226–231 (1996)
Fiscus, J.G., Doddington, G.R.: Topic detection and tracking evaluation overview. In: Topic detection and tracking, pp. 17–31. Springer (2002) https://doi.org/10. 1007/978-1-4615-0933-2_2
Hamza, H., Belaid, Y., Belaid, A., Chaudhuri, B.B.: An end-to-end administrative document analysis system. In: 2008 The Eighth IAPR International Workshop on Document Analysis Systems, pp. 175–182
Kılınç, D., Özçift, A., Bozyigit, F., Yıldırım, P., Yücalar, F., Borandag, E.: TTC 3600: A new benchmark dataset for Turkish text categorization 43(2), 174–185
Li, M., et al.: DocBank: A Benchmark Dataset for Document Layout Analysis. http://arxiv.org/abs/2006.01038
Mathew, M., Karatzas, D., Jawahar, C.V.: DocVQA: A Dataset for VQA on Document Images. http://arxiv.org/abs/2007.00398
Sinka, M., Corne, D.: A Large Benchmark Dataset for Web Document Clustering
William M. Rand: Objective criteria for the evaluation of clustering methods, vol. 66(336), pp. 846–850. https://doi.org/10.1080/01621459.1971.10482356
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR 2003, pp. 267–273. Association for Computing Machinery
Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Giles, C.L.: Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Network. http://arxiv.org/abs/1706.02337
Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: Largest dataset ever for document layout analysis. http://arxiv.org/abs/1908.07836