Abstract :
[en] This thesis examines how various data sources impact the risk profiles of small and medium-sized enterprises (SMEs) in Luxembourg. Effective risk analysis and prediction is crucial for societal development, as it enhances decision-making processes related to the allocation of financial resources. By identifying and managing risks, businesses can make informed investment choices, fostering growth and optimizing returns. This contributes not only to the success of individual enterprises but also to broader economic stability and development. Most current risk prediction methods rely on financial statements, ratios, and other numerical data. Some researchers also analyze short text pieces, such as tweets, news articles, and headlines. Additionally, certain studies focus on specific sections of annual reports that discuss risk evaluation. However, these approaches are often limited to large companies.
Financial data is often seen as providing a limited perspective on a company's risk level. Therefore, it is essential that additional variables influencing risk be considered. Many factors within a business can contribute to risk. In this context, based on the data that can be collected, extracted, and generated, potential risk information will be derived from text-based insights found in annual reports, people networks, and geographic location. The proposed multidimensional risk model incorporates diverse information from reliable sources, such as the Luxembourgish Business Registry (LBR).
The current work is presented following a data pipeline process. Information is extracted using data provided by the partner company, with textual content obtained from various official company documents through OCR and PDF reading tools. Relationships between companies, audit firms, auditors, and notaries are created using extracted information from textual sources and additional datasources. The proposed Long-Text BERT model is applied to predict the risk of bankruptcy based on the annexes from annual accounts, while also categorizing pages for subsequent information extraction using a fine-tuned GPT-based model. With a proposed autoclustering algorithm, clusters of hidden accountants or consultancy firms were identified and added in the company people's network. Geolocation is performed using addresses found through information extraction and those registered in the company profile within the LBR, from which latitude and longitude coordinates are obtained. This information is integrated into a graph network, where companies relationships are analyzed to identify various risk factors and complement the text-based risk assessment.
As a first outcome of this thesis, a dataset containing both financial and non-financial information was created. This existing data was enriched using NLP tools, and a network of companies and individuals was established. Additionally, valuable insights were extracted from the textual information, achieving approximately 80% precision in risk prediction based solely on the textual data from financial annexes. Companies were also clustered based on the hidden accountant concept. Information from various data sources is integrated using graphs to calculate dimensional risk for each company. An initial user interface has been proposed to enable users to navigate and explore some data more effectively.
In conclusion, this thesis successfully developed a data pipeline to process information from Luxembourgish SMEs, leveraging publicly available information. The data was enriched using advanced Machine Learning and Deep Learning techniques to assess company risk from multiple dimensions. This approach provides decision-makers with deeper insights, enabling more informed and strategic decisions. The findings suggest that these models can be adapted for application in other countries or scaled to analyze larger enterprises. Furthermore, the analysis could be significantly enhanced by integrating additional datasources, such as social networks, and employing more sophisticated methods like Graph Neural Networks for data integration.
Institution :
Unilu - University of Luxembourg [Doctoral School in Science and Engineering], LUXEMBOURG, Luxembourg
Name of the research project :
U-AGR-7012 - BRIDGES2020/IS/15403349/SCRiPT_Yoba Cont - BRORSSON Mats Hakan