Exploiting Prototypical Explanations for Undersampling Imbalanced Datasets

[en] Among the reported solutions to the class imbalance issue, the undersampling approaches, which remove instances of insignificant samples from the majority class, are quite prevalent. However, the undersampling approaches may discard significant patterns in the datasets. A prototype, which is always an actual sample from the data, represents a group of samples in the dataset. Our hypothesis is that prototypes can fill the missing significant patterns that are discarded by undersampling methods and help to improve model performance. To confirm our intuition, we articulate prototypes to undersampling methods in the machine learning pipeline. We show that there is a statistically significant difference between the AUPR and AUROC results of undersampling methods and our approach.

Disciplines :

Computer science

Author, co-author :

ARSLAN, Yusuf ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

ALLIX, Kevin ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

Lefebvre, Clément

Boytsov, Andrey

BISSYANDE, Tegawendé François D Assise ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

External co-authors :

Language :

English

Title :

Exploiting Prototypical Explanations for Undersampling Imbalanced Datasets

Publication date :

2022

Event name :

21st IEEE International Conference on Machine Learning and Applications

Event date :

from 12-12-2022 to 14-12-2022

Audience :

International

Main work title :

2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)

Pages :

1449-1454

Peer reviewed :

Peer reviewed

FnR Project :

FNR13778825 - Explainable Machine Learning In Fintech, 2019 (01/07/2019-30/06/2022) - Jacques Klein

Funders :

FNR - Fonds National de la Recherche

Available on ORBilu :

since 13 February 2023

Statistics

Number of views

77 (0 by Unilu)

Number of downloads

124 (1 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenAlex citations

publications

supporting

mentioning

contrasting

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Bibliography

Y. Park, J. Qing, X. Shen, and B. Mozafari, "Blinkml: Approximate machine learning with probabilistic guarantees, " in Proceedings of the 45th International Conference on Very Large Data Bases, Los Angeles, CA, USA, 2018, pp. 1-18.
A. Kulkarni, D. Chong, and F. A. Batarseh, "Foundations of data imbalance and solutions for a data democracy, " in data democracy. Elsevier, 2020, pp. 83-106.
D. Devi, S. K. Biswas, and B. Purkayastha, "A review on solution to class imbalance problem: Undersampling approaches, " in 2020 ComPE. IEEE, 2020, pp. 626-631.
A. C. Davison and D. V. Hinkley, Bootstrap methods and their application. Cambridge university press, 1997, no. 1.
P. I. Good, Resampling methods. Springer, 2006.
M. Kuhn and K. Johnson, Applied predictive modeling. Springer, 2013, vol. 26.
T. Sasada, Z. Liu, T. Baba, K. Hatano, and Y. Kimura, "A resampling method for imbalanced datasets considering noise and overlap, " Procedia Computer Science, vol. 176, pp. 420-429, 2020.
T. K. Dang, T. C. Tran, L. M. Tuan, and M. V. Tiep, "Machine learning based on resampling approaches and deep reinforcement learning for credit card fraud detection systems, " Applied Sciences, vol. 11, no. 21, p. 10004, 2021.
X.-Y. Liu, J. Wu, and Z.-H. Zhou, "Exploratory undersampling for class-imbalance learning, " IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 39, no. 2, pp. 539-550, 2008.
B. Kim, R. Khanna, and O. O. Koyejo, "Examples are not enough, learn to criticize! criticism for interpretability, " in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29. Curran Associates, Inc., 2016.
K. S. Gurumoorthy, A. Dhurandhar, and G. Cecchi, "Protodash: Fast interpretable prototype selection, " arXiv preprint arXiv:1707.01212, 2017.
C. Molnar, Interpretable machine learning. Lulu.com, 2020.
M. Bach, A. Werner, and M. Palt, "The proposal of undersampling method for learning from imbalanced datasets, " Procedia Computer Science, vol. 159, pp. 125-134, 2019.
A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, Learning from imbalanced data sets. Springer, 2018, vol. 10.
G. Lemaître, F. Nogueira, and C. K. Aridas, "Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning, " The Journal of Machine Learning Research, vol. 18, no. 1, pp. 559-563, 2017.
S. Lööv, "Comparison of undersampling methods for prediction of casting defects based on process parameters, " Master's thesis, University of Skövde, 2021.
Y. Arslan, B. Lebichot, K. Allix, L. Veiber, C. Lefebvre, A. Boytsov, A. Goujon, T. F. D. A. Bissyande, and J. Klein, "On the suitability of shap explanations for refining classifications, " in Proceedings of the 14th International Conference on Agents and Artificial Intelligence (ICAART 2022), 2022.
Y. Arslan, B. Lebichot, K. Allix, L. Veiber, C. Lefebvre, A. Boytsov, A. Goujon, T. F. Bissyande, and J. Klein, "Towards refined classifications driven by shap explanations, " in Cross Domain Conference for Machine Learning and Knowledge Extraction (CD-MAKE 2022), 2022.
A. Newell, H. A. Simon et al., Human problem solving. Prentice-hall Englewood Cliffs, NJ, 1972, vol. 104, no. 9.
M. S. Cohen, J. T. Freeman, and S. Wolf, "Metarecognition in timestressed decision making: Recognizing, critiquing, and correcting, " Human factors, vol. 38, no. 2, pp. 206-219, 1996.
J. Bien and R. Tibshirani, "Prototype selection for interpretable classification, " The Annals of Applied Statistics, vol. 5, no. 4, pp. 2403-2424, 2011.
R. Gao, F. Liu, J. Zhang, B. Han, T. Liu, G. Niu, and M. Sugiyama, "Maximum mean discrepancy test is aware of adversarial attacks, " in International Conference on Machine Learning. PMLR, 2021, pp. 3564-3575.
P. Flach and M. Kull, "Precision-recall-gain curves: Pr analysis done right, " Advances in neural information processing systems, vol. 28, 2015.
A. P. Bradley, "The use of the area under the roc curve in the evaluation of machine learning algorithms, " Pattern Recognition, vol. 30, no. 7, pp. 1145-1159, 1997.
T. R. Hoens and N. V. Chawla, "Imbalanced datasets: from sampling to classifiers, " Imbalanced learning: Foundations, algorithms, and applications, pp. 43-59, 2013.
J. Davis and M. Goadrich, "The relationship between precision-recall and roc curves, " in Proceedings of the 23rd ICML, 2006, pp. 233-240.
T. Saito and M. Rehmsmeier, "The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets, " PloS one, vol. 10, no. 3, p. e0118432, 2015.
F. Wilcoxon, "Individual comparisons by ranking methods, " in Breakthroughs in statistics. Springer, 1992, pp. 196-202.
R. Turner, D. Eriksson, M. McCourt, J. Kiili, E. Laaksonen, Z. Xu, and I. Guyon, "Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the blackbox optimization challenge 2020, " in NeurIPS 2020 Competition and Demonstration Track. PMLR, 2021, pp. 3-26.
A. M. F. da Cruz, "Fairness-Aware hyperparameter optimization, " Master's thesis, Universidade do Porto, 27th July 2020.