[en] Among the reported solutions to the class imbalance issue, the undersampling approaches, which remove instances of insignificant samples from the majority class, are quite prevalent. However, the undersampling approaches may discard significant patterns in the datasets. A prototype, which is always an actual sample from the data, represents a group of samples in the dataset. Our hypothesis is that prototypes can fill the missing significant patterns that are discarded by undersampling methods and help to improve model performance. To confirm our intuition, we articulate prototypes to undersampling methods in the machine learning pipeline. We show that there is a statistically significant difference between the AUPR and AUROC results of undersampling methods and our approach.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
ARSLAN, Yusuf ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
ALLIX, Kevin ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Y. Park, J. Qing, X. Shen, and B. Mozafari, "Blinkml: Approximate machine learning with probabilistic guarantees, " in Proceedings of the 45th International Conference on Very Large Data Bases, Los Angeles, CA, USA, 2018, pp. 1-18.
A. Kulkarni, D. Chong, and F. A. Batarseh, "Foundations of data imbalance and solutions for a data democracy, " in data democracy. Elsevier, 2020, pp. 83-106.
D. Devi, S. K. Biswas, and B. Purkayastha, "A review on solution to class imbalance problem: Undersampling approaches, " in 2020 ComPE. IEEE, 2020, pp. 626-631.
A. C. Davison and D. V. Hinkley, Bootstrap methods and their application. Cambridge university press, 1997, no. 1.
P. I. Good, Resampling methods. Springer, 2006.
M. Kuhn and K. Johnson, Applied predictive modeling. Springer, 2013, vol. 26.
T. Sasada, Z. Liu, T. Baba, K. Hatano, and Y. Kimura, "A resampling method for imbalanced datasets considering noise and overlap, " Procedia Computer Science, vol. 176, pp. 420-429, 2020.
T. K. Dang, T. C. Tran, L. M. Tuan, and M. V. Tiep, "Machine learning based on resampling approaches and deep reinforcement learning for credit card fraud detection systems, " Applied Sciences, vol. 11, no. 21, p. 10004, 2021.
X.-Y. Liu, J. Wu, and Z.-H. Zhou, "Exploratory undersampling for class-imbalance learning, " IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 39, no. 2, pp. 539-550, 2008.
B. Kim, R. Khanna, and O. O. Koyejo, "Examples are not enough, learn to criticize! criticism for interpretability, " in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29. Curran Associates, Inc., 2016.
K. S. Gurumoorthy, A. Dhurandhar, and G. Cecchi, "Protodash: Fast interpretable prototype selection, " arXiv preprint arXiv:1707.01212, 2017.
C. Molnar, Interpretable machine learning. Lulu.com, 2020.
M. Bach, A. Werner, and M. Palt, "The proposal of undersampling method for learning from imbalanced datasets, " Procedia Computer Science, vol. 159, pp. 125-134, 2019.
A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, Learning from imbalanced data sets. Springer, 2018, vol. 10.
G. Lemaître, F. Nogueira, and C. K. Aridas, "Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning, " The Journal of Machine Learning Research, vol. 18, no. 1, pp. 559-563, 2017.
S. Lööv, "Comparison of undersampling methods for prediction of casting defects based on process parameters, " Master's thesis, University of Skövde, 2021.
Y. Arslan, B. Lebichot, K. Allix, L. Veiber, C. Lefebvre, A. Boytsov, A. Goujon, T. F. D. A. Bissyande, and J. Klein, "On the suitability of shap explanations for refining classifications, " in Proceedings of the 14th International Conference on Agents and Artificial Intelligence (ICAART 2022), 2022.
Y. Arslan, B. Lebichot, K. Allix, L. Veiber, C. Lefebvre, A. Boytsov, A. Goujon, T. F. Bissyande, and J. Klein, "Towards refined classifications driven by shap explanations, " in Cross Domain Conference for Machine Learning and Knowledge Extraction (CD-MAKE 2022), 2022.
A. Newell, H. A. Simon et al., Human problem solving. Prentice-hall Englewood Cliffs, NJ, 1972, vol. 104, no. 9.
M. S. Cohen, J. T. Freeman, and S. Wolf, "Metarecognition in timestressed decision making: Recognizing, critiquing, and correcting, " Human factors, vol. 38, no. 2, pp. 206-219, 1996.
J. Bien and R. Tibshirani, "Prototype selection for interpretable classification, " The Annals of Applied Statistics, vol. 5, no. 4, pp. 2403-2424, 2011.
R. Gao, F. Liu, J. Zhang, B. Han, T. Liu, G. Niu, and M. Sugiyama, "Maximum mean discrepancy test is aware of adversarial attacks, " in International Conference on Machine Learning. PMLR, 2021, pp. 3564-3575.
P. Flach and M. Kull, "Precision-recall-gain curves: Pr analysis done right, " Advances in neural information processing systems, vol. 28, 2015.
A. P. Bradley, "The use of the area under the roc curve in the evaluation of machine learning algorithms, " Pattern Recognition, vol. 30, no. 7, pp. 1145-1159, 1997.
T. R. Hoens and N. V. Chawla, "Imbalanced datasets: from sampling to classifiers, " Imbalanced learning: Foundations, algorithms, and applications, pp. 43-59, 2013.
J. Davis and M. Goadrich, "The relationship between precision-recall and roc curves, " in Proceedings of the 23rd ICML, 2006, pp. 233-240.
T. Saito and M. Rehmsmeier, "The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets, " PloS one, vol. 10, no. 3, p. e0118432, 2015.
F. Wilcoxon, "Individual comparisons by ranking methods, " in Breakthroughs in statistics. Springer, 1992, pp. 196-202.
R. Turner, D. Eriksson, M. McCourt, J. Kiili, E. Laaksonen, Z. Xu, and I. Guyon, "Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the blackbox optimization challenge 2020, " in NeurIPS 2020 Competition and Demonstration Track. PMLR, 2021, pp. 3-26.
A. M. F. da Cruz, "Fairness-Aware hyperparameter optimization, " Master's thesis, Universidade do Porto, 27th July 2020.