Auto-clustering of Financial Reports Based on Formatting Style and Author’s Fingerprint

BLANCO, Braulio; BRORSSON, Mats Håkan; ZURAD, Maciej

doi:10.1007/978-3-031-23633-4_9

Download

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

Auto-clustering of Financial Reports Based on Formatting Style and Author’s Fingerprint

BLANCO, Braulio; BRORSSON, Mats Håkan; ZURAD, Maciej

2023 • In Koprinska, Irena (Ed.) Machine Learning and Principles and Practice of Knowledge Discovery in Databases - International Workshops of ECML PKDD 2022, Proceedings

Peer reviewed

Permalink
https://hdl.handle.net/10993/57026

DOI
10.1007/978-3-031-23633-4_9

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

001_FeatureBased_Autoclustering.pdf

Author postprint (736.93 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Clustering; Financial reports; Machine learning; NLP; Unstructured data; Clusterings; Content information; Labelings; Layout information; Machine learning models; Machine-learning; New clustering algorithms; Sub-clusters; Computer Science (all); Mathematics (all)

Abstract :

[en] We present a new clustering algorithm of financial reports that is based on the reports’ formatting and style. The algorithm uses layout and content information to automatically generate as many clusters as needed. This allows us to reduce the effort of labeling the reports in order to train text-based machine learning models for extracting person or company names, addresses, financial categories, etc. In addition, the algorithm also produces a set of sub-clusters inside each cluster, where each sub-cluster corresponds to a set of reports made by the same author (person or firm). The information about sub-clusters allows us to evaluate the change in the author over time. We have applied the algorithm to a dataset with over 38,000 financial reports (last Annual Account presented by a company) from the Luxembourg Business Registers (LBR) and found 2,165 clusters between 2 and 850 documents with a median of 4 and an average of 14. When adding 2,500 new documents to the existing cluster set (previous annual accounts presented by companies), we found that 67.3% of the financial reports were placed in the correct cluster and sub-cluster. From the remaining documents, 65% were placed in a different subcluster because the company changed the formatting style, which is expected and correct behavior. Finally, labeling 11% of the entire dataset, we can replicate these labels up to 72% of the dataset, keeping a high feature coverage.

Research center :

Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SEDAN - Service and Data Management in Distributed Systems
ULHPC - University of Luxembourg: High Performance Computing
NCER-FT - FinTech National Centre of Excellence in Research

Disciplines :

Computer science

Author, co-author :

BLANCO, Braulio ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN

BRORSSON, Mats Håkan ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN

ZURAD, Maciej ; Yoba S.A., Esch-sur-Alzette, Luxembourg

External co-authors :

yes

Language :

English

Title :

Auto-clustering of Financial Reports Based on Formatting Style and Author’s Fingerprint

Publication date :

January 2023

Event name :

ECML PKDD 2022

Event place :

Grenoble, Fra

Event date :

19-09-2022 => 23-09-2022

Audience :

International

Main work title :

Machine Learning and Principles and Practice of Knowledge Discovery in Databases - International Workshops of ECML PKDD 2022, Proceedings

Editor :

Koprinska, Irena

Publisher :

Springer Science and Business Media Deutschland GmbH

ISBN/EAN :

978-3-03-123632-7

Peer reviewed :

Peer reviewed

Focus Area :

Finance

Development Goals :

9. Industry, innovation and infrastructure

Additional URL :

https://link.springer.com/content/pdf/10.1007/978-3-031-23633-4_9

FnR Project :

FNR15403349 - Sme Credit Risk Platform, 2020 (01/04/2021-31/03/2024) - Radu State

Name of the research project :

U-AGR-7012 - BRIDGES2020/IS/15403349/SCRiPT_Yoba Cont (01/04/2021 - 31/03/2024) - BRORSSON Mats Hakan

Funding text :

This work has been partly funded by the Luxembourg National Research Fund (FNR) under contract number 15403349.

Available on ORBilu :

since 29 September 2023

Statistics

Number of views

152 (15 by Unilu)

Number of downloads

129 (1 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

WoS citations^™

Bibliography

UCI Machine Learning Repository: Reuters-21578, Distribution 1.0, Text Categorization Collection Data Set, https://archive.ics.uci.edu/ml/datasets/reuters21578+text+categorization+collection
Aiello, M., Monz, C., Todoran, L., Worring, M.: Document Understanding for a Broad Class of Documents 5(1). https://doi.org/10.1007/s10032-002-0080-x
Antonacopoulos, A., Clausner, C., Papadopoulos, C., Pletschacher, S.: ICDAR2015 competition on recognition of documents with complex layouts-RDCL2015. https://doi.org/10.1109/ICDAR.2015.7333941
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to information retrieval. In: kdd, vol. 1 (2008)
Dua, D., Graff, C.: UCI machine learning repository. http://archive.ics.uci.edu/ml
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: kdd, vol. 96, pp. 226–231 (1996)
Fiscus, J.G., Doddington, G.R.: Topic detection and tracking evaluation overview. In: Topic detection and tracking, pp. 17–31. Springer (2002) https://doi.org/10. 1007/978-1-4615-0933-2_2
Hamza, H., Belaid, Y., Belaid, A., Chaudhuri, B.B.: An end-to-end administrative document analysis system. In: 2008 The Eighth IAPR International Workshop on Document Analysis Systems, pp. 175–182
Kılınç, D., Özçift, A., Bozyigit, F., Yıldırım, P., Yücalar, F., Borandag, E.: TTC 3600: A new benchmark dataset for Turkish text categorization 43(2), 174–185
Li, M., et al.: DocBank: A Benchmark Dataset for Document Layout Analysis. http://arxiv.org/abs/2006.01038
Mathew, M., Karatzas, D., Jawahar, C.V.: DocVQA: A Dataset for VQA on Document Images. http://arxiv.org/abs/2007.00398
Palacio-Niño, J.O., Berzal, F.: Evaluation Metrics for Unsupervised Learning Algorithms. http://arxiv.org/abs/1905.05667
Sinka, M., Corne, D.: A Large Benchmark Dataset for Web Document Clustering
William M. Rand: Objective criteria for the evaluation of clustering methods, vol. 66(336), pp. 846–850. https://doi.org/10.1080/01621459.1971.10482356
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR 2003, pp. 267–273. Association for Computing Machinery
Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Giles, C.L.: Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Network. http://arxiv.org/abs/1706.02337
Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: Largest dataset ever for document layout analysis. http://arxiv.org/abs/1908.07836