Communication publiée dans un ouvrage (Colloques, congrès, conférences scientifiques et actes)
Debugging machine learning pipelines
DE PAULA LOURENCO, Raoni; Freire, Juliana; Shasha, Dennis
2019 • In Proceedings of the 3rd Workshop on Data Management for End-To-End Machine Learning, DEEM 2019 - In conjunction with the 2019 ACM SIGMOD/PODS Conference
Error prones; Experimental evaluation; New approaches; Reproducibilities; Root cause; Root cause of failures; Source codes; State of the art; Software; Information Systems; Computer Science - Learning; Computer Science - Databases; Statistics - Machine Learning
Résumé :
[en] Machine learning tasks entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous or uninformative outputs, the pipeline may fail or produce incorrect results. Inferring the root cause of failures and unexpected behavior is challenging, usually requiring much human thought, and is both time consuming and error prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our source code and experimental data will be available for reproducibility and enhancement.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
DE PAULA LOURENCO, Raoni ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal ; NYU - New York University [US-NY]
Freire, Juliana; New York University, United States
Shasha, Dennis; New York University, United States
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
Debugging machine learning pipelines
Date de publication/diffusion :
30 juin 2019
Nom de la manifestation :
Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning
Lieu de la manifestation :
Amsterdam, Nld
Date de la manifestation :
30-06-2019
Titre de l'ouvrage principal :
Proceedings of the 3rd Workshop on Data Management for End-To-End Machine Learning, DEEM 2019 - In conjunction with the 2019 ACM SIGMOD/PODS Conference
Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In Proceedings of ACM SIGMOD. 331-346. https://doi.org/10.1145/2723372.2723711
Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. MacroBase: Prioritizing Attention in Fast Data. In Proceedings of the ACM SIGMOD. 541-556. https://doi.org/10.1145/3035918.3035928
Bonnie Berger, John Rompel, and Peter W. Shor. 1994. Efficient NC algorithms for set cover with applications to learning and geometry. J. Comput. System Sci. (1994). https://doi.org/10.1016/S0022-0000 (05) 80068-6
James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing Systems (NIPS) (2011), 2546-2554. https://doi.org/2012arXiv1206.2944S arXiv:1206.2944
James Bergstra and Yoshua Bengio. 2012. Random Search for Hyperparameter Optimization. J. Mach. Learn. Res. 13 (Feb. 2012), 281-305. http://dl.acm.org/citation. cfm?id=2188385.2188395
J. Bergstra, D. Yamins, and D. D. Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of ICML. I-115-I-123. http://dl.acm.org/citation. cfm?id=3042817.3042832
Tobias Bleifuß, Sebastian Kruse, and Felix Naumann. 2017. Efficient Denial Constraint Discovery with Hydra. Proc. VLDB Endow. 11, 3 (Nov. 2017), 311-323. https://doi.org/10.14778/3157794.3157800
Ang Chen, Yang Wu, Andreas Haeberlen, Boon T. Loo, and Wenchao Zhou. 2017. Data Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead. In Proceedings of CIDR.
Fernando Chirigati, Harish Doraiswamy, Theodoros Damoulas, and Juliana Freire. 2016. Data Polygamy: The Many-Many Relationships Among Urban Spatio-Temporal Data Sets. In Proceedings of ACM SIGMOD. 1011-1025. https://doi.org/10.1145/2882903.2915245
Charles J. Colbourn, Sosina S. Martirosyan, Gary L. Mullen, Dennis Shasha, George B. Sherwood, and Joseph L. Yucas. 2006. Products of mixed covering arrays of strength two. Journal of Combinatorial Designs 14, 2(2006), 124-138. https://doi.org/10.1002/jcd.20065
Nima Dolatnia, Alan Fern, and Xiaoli Fern. 2016. Bayesian Optimization with Resource Constraints and Production. In International Conference on Automated Planning and Scheduling. 115-123.
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. Proc. VLDB Endow. 8, 1 (Sept. 2014), 61-72. https://doi.org/10.14778/2735461.2735467
Google. 2015. Prudential Life Insurance Assessment. https://www.kaggle.com/c/prudential-life-insurance-assessment. Accessed: 2019-03-02.
Helga Gudmundsdottir, Babak Salimi, Magdalena Balazinska, Dan R. K. Ports, and Dan Suciu. 2017. A Demonstration of Interactive Analysis of Performance Measurements with Viska. In Proceedings of ACM SIGMOD. 1707-1710. https://doi.org/10.1145/3035918.3056448
Jiangbo Huang. 2014. Programing implementation of the Quine-McCluskey method for minimization of Boolean expression. CoRR abs/1410.1059 (2014). arXiv:1410.1059 http://arxiv.org/abs/1410.1059
F. Hutter, H. H. Hoos, and K. Leyton-Brown. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. In Proc. of LION-5. 507-523.
Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and Explanations in Databases. PVLDB 7, 13(2014), 1715-1716. http://www.vldb.org/pvldb/vol7/p1715-meliou.pdf
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of NIPS. 2951-2959. http://dl.acm.org/citation. cfm?id=2999325.2999464
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat Prabhat, and Ryan P. Adams. 2015. Scalable Bayesian Optimization Using Deep Neural Networks. In Proceedings of the ICML. 2171-2180. http://dl.acm.org/citation. cfm?id=3045118.3045349
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data X-Ray: A Diagnostic Tool for Data Errors. In Proceedings of ACM SIGMOD. 1231-1245. https://doi.org/10.1145/2723372.2750549
Xiaolan Wang, Alexandra Meliou, and Eugene Wu. 2017. QFix: Diagnosing Errors Through Query Histories. In Proceedings of ACM SIGMOD. 1369-1384. https://doi.org/10.1145/3035918.3035925