Paper published in a book (Scientific congresses, symposiums and conference proceedings)
Auto-clustering of Financial Reports Based on Formatting Style and Author’s Fingerprint
BLANCO, Braulio; BRORSSON, Mats Håkan; ZURAD, Maciej
2023In Koprinska, Irena (Ed.) Machine Learning and Principles and Practice of Knowledge Discovery in Databases - International Workshops of ECML PKDD 2022, Proceedings
Peer reviewed
 

Files


Full Text
001_FeatureBased_Autoclustering.pdf
Author postprint (736.93 kB)
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
Clustering; Financial reports; Machine learning; NLP; Unstructured data; Clusterings; Content information; Labelings; Layout information; Machine learning models; Machine-learning; New clustering algorithms; Sub-clusters; Computer Science (all); Mathematics (all)
Abstract :
[en] We present a new clustering algorithm of financial reports that is based on the reports’ formatting and style. The algorithm uses layout and content information to automatically generate as many clusters as needed. This allows us to reduce the effort of labeling the reports in order to train text-based machine learning models for extracting person or company names, addresses, financial categories, etc. In addition, the algorithm also produces a set of sub-clusters inside each cluster, where each sub-cluster corresponds to a set of reports made by the same author (person or firm). The information about sub-clusters allows us to evaluate the change in the author over time. We have applied the algorithm to a dataset with over 38,000 financial reports (last Annual Account presented by a company) from the Luxembourg Business Registers (LBR) and found 2,165 clusters between 2 and 850 documents with a median of 4 and an average of 14. When adding 2,500 new documents to the existing cluster set (previous annual accounts presented by companies), we found that 67.3% of the financial reports were placed in the correct cluster and sub-cluster. From the remaining documents, 65% were placed in a different subcluster because the company changed the formatting style, which is expected and correct behavior. Finally, labeling 11% of the entire dataset, we can replicate these labels up to 72% of the dataset, keeping a high feature coverage.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SEDAN - Service and Data Management in Distributed Systems
ULHPC - University of Luxembourg: High Performance Computing
Disciplines :
Computer science
Author, co-author :
BLANCO, Braulio  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
BRORSSON, Mats Håkan  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
ZURAD, Maciej ;  Yoba S.A., Esch-sur-Alzette, Luxembourg
External co-authors :
yes
Language :
English
Title :
Auto-clustering of Financial Reports Based on Formatting Style and Author’s Fingerprint
Publication date :
January 2023
Event name :
ECML PKDD 2022
Event place :
Grenoble, Fra
Event date :
19-09-2022 => 23-09-2022
Audience :
International
Main work title :
Machine Learning and Principles and Practice of Knowledge Discovery in Databases - International Workshops of ECML PKDD 2022, Proceedings
Editor :
Koprinska, Irena
Publisher :
Springer Science and Business Media Deutschland GmbH
ISBN/EAN :
978-3-03-123632-7
Peer reviewed :
Peer reviewed
Focus Area :
Finance
Development Goals :
9. Industry, innovation and infrastructure
FnR Project :
FNR15403349 - Sme Credit Risk Platform, 2020 (01/04/2021-31/03/2024) - Radu State
Name of the research project :
U-AGR-7012 - BRIDGES2020/IS/15403349/SCRiPT_Yoba Cont (01/04/2021 - 31/03/2024) - BRORSSON Mats Hakan
Funding text :
This work has been partly funded by the Luxembourg National Research Fund (FNR) under contract number 15403349.
Available on ORBilu :
since 29 September 2023

Statistics


Number of views
55 (8 by Unilu)
Number of downloads
47 (1 by Unilu)

Scopus citations®
 
0
Scopus citations®
without self-citations
0
WoS citations
 
0

Bibliography


Similar publications



Contact ORBilu