[en] The underlying paradigm of big data-driven machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. Is having simply more data always helpful? In 1936, The Literary Digest collected 2.3M filled in questionnaires to predict the outcome of that year's US presidential election. The outcome of this big data prediction proved to be entirely wrong, whereas George Gallup only needed 3K handpicked people to make an accurate prediction. Generally, biases occur in machine learning whenever the distributions of training set and test set are different. In this work, we provide a review of different sorts of biases in (big) data sets in machine learning. We provide definitions and discussions of the most commonly appearing biases in machine learning: class imbalance and covariate shift. We also show how these biases can be quantified and corrected. This work is an introductory text for both researchers and practitioners to become more aware of this topic and thus to derive more reliable models for their learning problems.
Disciplines :
Computer science
Author, co-author :
GLAUNER, Patrick ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Valtchev, Petko; University of Quebec in Montreal
STATE, Radu ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
External co-authors :
yes
Language :
English
Title :
Impact of Biases in Big Data
Publication date :
2018
Event name :
26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2018)
Event date :
25-04-2018 to 27-04-2018
Audience :
International
Main work title :
Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2018)
Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th annual meeting on association for computational linguistics, pages 26-33. Association for Computational Linguistics, 2001.
John PA Ioannidis. Why most published research findings are false. PLoS medicine, 2(8):e124, 2005.
Sophie Curtis. Google photos labels black people as gorillas. the telegraph. http://www.telegraph.co.uk/technology/google/11710136/Google-Photos-assigns-gorilla-tag-to-photos-of-black-people.html, 2015. [Online; accessed December 28, 2017].
Yilun Wang and Michal Kosinski. Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. Journal of Personality and Social Psychology, 2017.
Moamar Sayed-Mouchaweh and Edwin Lughofer. Learning in non-stationary environments: methods and applications. Springer Science & Business Media, 2012.
Tom M Mitchell. Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45(37):870-877, 1997.
Christopher Bishop. Pattern recognition and machine learning. Springer, New York, 2006.
Maurice C Bryson. The literary digest poll: Making of a statistical myth. The American Statistician, 30(4):184-185, 1976.
Tim Harford. Big data: are we making a big mistake? ft magazine. http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html, 2014. [Online; accessed January 15, 2016].
United States. Executive Office of the President and John Podesta. Big data: Seizing opportunities, preserving values. White House, Executive Office of the President, 2014.
Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429-449, 2002.
Jing Jiang. A literature survey on domain adaptation of statistical classifiers, 2008.
Yuchun Tang, Yan-Qing Zhang, Nitesh V Chawla, and Sven Krasser. Svms modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(1):281-288, 2009.
Patrick Glauner, Jorge Augusto Meira, Petko Valtchev, et al. The challenge of non-technical loss detection using artificial intelligence: A survey. International Journal of Computational Intelligence Systems, 10(1):760-775, 2017.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pages 1058-1066, 2013.
Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pages 3859-3869, 2017.
Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861-874, 2006.
Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442-451, 1975.
William M Stanish and Noel Taylor. Estimation of the intraclass correlation coefficient for the analysis of covariance model. The American Statistician, 37(3):221-224, 1983.
Philipp Werner, Frerk Saxen, and Ayoub Al-Hamadi. Handling data imbalance in automatic facial action intensity estimation. FERA, page 26, 2015.
Ivan Tomek. Two modifications of cnn. IEEE Trans. Systems, Man and Cybernetics, 6:769-772, 1976.
Inderjeet Mani and I Zhang. knn approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets, volume 126, 2003.
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321-357, 2002.
Gustavo EAPA Batista, Ana LC Bazzan, and Maria Carolina Monard. Balancing training data for automated annotation of keywords: a case study. In WOB, pages 10-18, 2003.
Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539-550, 2009.
Joaquim L Viegas, Paulo R Esteves, R Melício, et al. Solutions for detection of non-technical losses in the electricity grid: A review. Renewable and Sustainable Energy Reviews, 80:1256-1268, 2017.
Patrick Glauner, Angelo Migliosi, Jorge Augusto Meira, et al. Is big data sufficient for a reliable detection of non-technical losses? In 2017 19th International Conference on Intelligent System Application to Power Systems (ISAP), pages 1-6, Sept 2017.
Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, page 114. ACM, 2004.
Solomon Kullback. Letter to the editor: The kullback-leibler distance. The American Statistician, 1987.
Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227-244, 2000.
Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103-126, 2014.
James J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153-161, 1979.
Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137-144, 2007.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448-456, 2015.
Christopher M Bishop. Neural networks for pattern recognition. Oxford university press, 1995.