Abstract :
[en] Bankruptcy prediction is an important research area that heavily relies on
data science. It aims to help investors, managers, and regulators better
understand the operational status of corporations and predict potential
financial risks in advance. To improve prediction, researchers and
practitioners have begun to utilize a variety of different types of data,
ranging from traditional financial indicators to unstructured data, to aid in
the construction and optimization of bankruptcy forecasting models. Over time,
not only instrumentalized data improved, but also instrumentalized methodology
for data structuring, cleaning, and analysis. With the aid of advanced
analytical techniques that deploy machine learning and deep learning
algorithms, bankruptcy assessment became more accurate over time. However, due
to the sensitivity of financial data, the scarcity of valid public datasets
remains a key bottleneck for the rapid modeling and evaluation of machine
learning algorithms for targeted tasks. This study therefore introduces a
taxonomy of datasets for bankruptcy research, and summarizes their
characteristics. This paper also proposes a set of metrics to measure the
quality and the informativeness of public datasets The taxonomy, coupled with
the informativeness measure, thus aims at providing valuable insights to better
assist researchers and practitioners in developing potential applications for
various aspects of credit assessment and decision making by pointing at
appropriate datasets for their studies.