Communication publiée dans un ouvrage (Colloques, congrès, conférences scientifiques et actes)
Auto-clustering of Financial Reports Based on Formatting Style and Author’s Fingerprint
BLANCO, Braulio; BRORSSON, Mats Håkan; ZURAD, Maciej
2023In Koprinska, Irena (Ed.) Machine Learning and Principles and Practice of Knowledge Discovery in Databases - International Workshops of ECML PKDD 2022, Proceedings
Peer reviewed
 

Documents


Texte intégral
001_FeatureBased_Autoclustering.pdf
Postprint Auteur (736.93 kB)
Télécharger

Tous les documents dans ORBilu sont protégés par une licence d'utilisation.

Envoyer vers



Détails



Mots-clés :
Clustering; Financial reports; Machine learning; NLP; Unstructured data; Clusterings; Content information; Labelings; Layout information; Machine learning models; Machine-learning; New clustering algorithms; Sub-clusters; Computer Science (all); Mathematics (all)
Résumé :
[en] We present a new clustering algorithm of financial reports that is based on the reports’ formatting and style. The algorithm uses layout and content information to automatically generate as many clusters as needed. This allows us to reduce the effort of labeling the reports in order to train text-based machine learning models for extracting person or company names, addresses, financial categories, etc. In addition, the algorithm also produces a set of sub-clusters inside each cluster, where each sub-cluster corresponds to a set of reports made by the same author (person or firm). The information about sub-clusters allows us to evaluate the change in the author over time. We have applied the algorithm to a dataset with over 38,000 financial reports (last Annual Account presented by a company) from the Luxembourg Business Registers (LBR) and found 2,165 clusters between 2 and 850 documents with a median of 4 and an average of 14. When adding 2,500 new documents to the existing cluster set (previous annual accounts presented by companies), we found that 67.3% of the financial reports were placed in the correct cluster and sub-cluster. From the remaining documents, 65% were placed in a different subcluster because the company changed the formatting style, which is expected and correct behavior. Finally, labeling 11% of the entire dataset, we can replicate these labels up to 72% of the dataset, keeping a high feature coverage.
Centre de recherche :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SEDAN - Service and Data Management in Distributed Systems
ULHPC - University of Luxembourg: High Performance Computing
NCER-FT - FinTech National Centre of Excellence in Research
Disciplines :
Sciences informatiques
Auteur, co-auteur :
BLANCO, Braulio  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
BRORSSON, Mats Håkan  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
ZURAD, Maciej ;  Yoba S.A., Esch-sur-Alzette, Luxembourg
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
Auto-clustering of Financial Reports Based on Formatting Style and Author’s Fingerprint
Date de publication/diffusion :
janvier 2023
Nom de la manifestation :
ECML PKDD 2022
Lieu de la manifestation :
Grenoble, Fra
Date de la manifestation :
19-09-2022 => 23-09-2022
Manifestation à portée :
International
Titre de l'ouvrage principal :
Machine Learning and Principles and Practice of Knowledge Discovery in Databases - International Workshops of ECML PKDD 2022, Proceedings
Editeur scientifique :
Koprinska, Irena
Maison d'édition :
Springer Science and Business Media Deutschland GmbH
ISBN/EAN :
978-3-03-123632-7
Peer reviewed :
Peer reviewed
Focus Area :
Finance
Objectif de développement durable (ODD) :
9. Industrie, innovation et infrastructure
Projet FnR :
FNR15403349 - Sme Credit Risk Platform, 2020 (01/04/2021-31/03/2024) - Radu State
Intitulé du projet de recherche :
U-AGR-7012 - BRIDGES2020/IS/15403349/SCRiPT_Yoba Cont (01/04/2021 - 31/03/2024) - BRORSSON Mats Hakan
Subventionnement (détails) :
This work has been partly funded by the Luxembourg National Research Fund (FNR) under contract number 15403349.
Disponible sur ORBilu :
depuis le 29 septembre 2023

Statistiques


Nombre de vues
134 (dont 15 Unilu)
Nombre de téléchargements
107 (dont 1 Unilu)

citations Scopus®
 
0
citations Scopus®
sans auto-citations
0
OpenCitations
 
0
citations OpenAlex
 
0
citations WoS
 
0

Bibliographie


Publications similaires



Contacter ORBilu