DataPrism: Exposing Disconnect between Data and Systems

causal testing; data profiles; debugging; root-cause identification; Causal testing; Central component; Data driven; Data profiles; Debugging; Driven system; Health monitoring system; Property; Root cause; Root cause identification; Software; Information Systems

Abstract :

[en] As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of data. E.g., a health-monitoring system that is designed under the assumption that weight is reported in lbs will malfunction when encountering weight reported in kilograms. Like software debugging, which aims to find bugs in the source code or runtime conditions, our goal is to debug data to identify potential sources of disconnect between the assumptions about some data and systems that operate on that data. We propose DataPrism, a framework to identify data properties (profiles) that are the root causes of performance degradation or failure of a data-driven system. Such identification is necessary to repair data and resolve the disconnect between data and systems. Our technique is based on causal reasoning through interventions: when a system malfunctions for a dataset, DataPrism alters the data profiles and observes changes in the system's behavior due to the alteration. Unlike statistical observational analysis that reports mere correlations, DataPrism reports causally verified root causes-in terms of data profiles-of the system malfunction. We empirically evaluate DataPrism on seven real-world and several synthetic data-driven systems that fail on certain datasets due to a diverse set of reasons. In all cases, DataPrism identifies the root causes precisely while requiring orders of magnitude fewer interventions than prior techniques.

Disciplines :

Computer science

Author, co-author :

Galhotra, Sainyam; University of Chicago, Chicago, United States

Fariha, Anna; Microsoft, Seattle, United States

DE PAULA LOURENCO, Raoni ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal ; NYU - New York University [US-NY]

Freire, Juliana; New York University, New York, United States

Meliou, Alexandra; University of Massachusetts Amherst, Amherst, United States

Srivastava, Divesh; At&t Chief Data Office, Bedminster, United States

External co-authors :

yes

Language :

English

Title :

DataPrism: Exposing Disconnect between Data and Systems

Publication date :

10 June 2022

Event name :

Proceedings of the 2022 International Conference on Management of Data

Event place :

Philladelphia, Usa

Event date :

12-06-2022 => 17-06-2022

Main work title :

SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data

Publisher :

Association for Computing Machinery

ISBN/EAN :

978-1-4503-9249-5

Peer reviewed :

Peer reviewed

Additional URL :

https://dl.acm.org/doi/pdf/10.1145/3514221.3517864

Funders :

ACM SIGMOD

Available on ORBilu :

since 22 November 2023

Statistics

Number of views

74 (1 by Unilu)

Number of downloads

131 (0 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: A survey. The VLDB Journal 24, 4 (2015), 557-581.
Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data profiling. Synthesis Lectures on Data Management 10, 4 (2018), 1-154.
AdaBoost Classifier. https://scikit-learn. org/stable/modules/generated/sklearn. ensemble. AdaBoostClassifier. html
Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In COLING 2018, 27th International Conference on Computational Linguistics. 1638-1649.
Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-Ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software. In Proceedings of USENIX OSDI (Hollywood, CA, USA) (OSDI'12). USENIX Association, USA, 307-320.
Mona Attariyan and Jason Flinn. 2011. Automating Configuration Troubleshooting with ConfAid.;login: 36, 1 (2011), 1-14.
Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. MacroBase: Prioritizing Attention in Fast Data. In Proceedings of the 2017 ACMInternational Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). ACM, New York, NY, USA, 541-556.
DanielWBarowy, Emery D Berger, and Benjamin Zorn. 2018. ExceLint: Automatically finding spreadsheet formula errors. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 1-26.
Daniel W. Barowy, Dimitar Gochev, and Emery D. Berger. 2014. CheckCell: Data debugging for spreadsheets. In OOPSLA. 507-523.
Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, et al. 2018. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. ArXiv preprint arXiv:1810. 01943 (2018).
Bias in Amazon Hiring. https://becominghuman. Ai/amazons-sexist-airecruiting-tool-how-did-it-go-so-wrong-e3d14816d98e
Mike Brachmann, Carlos Bautista, Sonia Castelo, Su Feng, Juliana Freire, Boris Glavic, Oliver Kennedy, Heiko Müeller, Rémi Rampin, William Spoth, et al. 2019. Data debugging and exploration with vizier. In Proceedings of the 2019 International Conference on Management of Data. 1877-1880.
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2019. Data validation for machine learning. In Conference on Systems and Machine Learning (SysML). https://www. sysml. cc/doc/2019/167. pdf.
Gabriel Cadamuro, Ran Gilad-Bachrach, and Xiaojin Zhu. 2016. Debugging machine learning models. In ICML Workshop on Reliable Machine Learning in the Wild.
Cardiovascular Disease dataset. https://www. kaggle. com/sulianova/ cardiovascular-disease-dataset
Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. On the discovery of relaxed functional dependencies. In Proceedings of the 20th International Database Engineering & Applications Symposium. 53-61.
Giuseppe Casalicchio, Christoph Molnar, and Bernd Bischl. 2018. Visualizing the feature importance for black box models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 655-670.
Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In Proceedings of IEEE DSN (DSN'02). IEEE, USA, 595-604.
Fernando Chirigati, Harish Doraiswamy, Theodoros Damoulas, and Juliana Freire. 2016. Data Polygamy: The Many-Many Relationships Among Urban Spatio-Temporal Data Sets. In Proceedings of ACM SIGMOD (SIGMOD '16). ACM, New York, NY, USA, 1011-1025.
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering Denial Constraints. PVLDB 6, 13 (2013), 1498-1509.
Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, and Wei Zhang. 2014. From Data Fusion to Knowledge Fusion. PVLDB 7, 10 (2014), 881-892.
Dingzhu Du, Frank K Hwang, and Frank Hwang. 2000. Combinatorial group testing and its applications. Vol. 12. World Scientific.
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http: //archive. ics. uci. edu/ml
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. PVLDB 8, 1 (Sept. 2014), 61-72.
Wenfei Fan, Floris Geerts, Laks V. S. Lakshmanan, and Ming Xiong. 2009. Discovering Conditional Functional Dependencies. In Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009-April 2 2009, Shanghai, China. 1231-1234.
Anna Fariha, Suman Nath, and Alexandra Meliou. 2020. Causality-Guided Adaptive Interventional Debugging. In SIGMOD. 431-446.
Anna Fariha, Ashish Tiwari, Arjun Radhakrishna, Sumit Gulwani, and Alexandra Meliou. 2021. Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021. ACM, 499-512.
Gordon Fraser and Andrea Arcuri. 2013. Whole test suite generation. IEEE Transactions on Software Engineering 39, 2 (2013), 276-291.
Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fairness testing: Testing software for discrimination. In Proceedings of the 2017 11th Joint meeting on foundations of software engineering. 498-510.
Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, and Divesh Srivastava. 2021. DataPrism: Exposing Disconnect between Data and Systems. Technical Report. https://arxiv. org/abs/2105. 06058.
Sainyam Galhotra, Udayan Khurana, Oktie Hassanzadeh, Kavitha Srinivas, Horst Samulowitz, and Miao Qi. 2019. Automated Feature Enhancement for Predictive Modeling using External Knowledge. In 2019 International Conference on Data Mining Workshops (ICDMW). IEEE, 1094-1097.
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, JenniferWortman Vaughan, Hanna M. Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for Datasets. CoRR abs/1803. 09010 (2018). ArXiv:1803. 09010
Patrice Godefroid, Michael Y. Levin, and David A. Molnar. 2008. Automated whitebox fuzz testing. In Proceedings of NDSS. 151-166.
Google Vision Racism. https://algorithmwatch. org/en/story/google-visionracism/
Muhammad Ali Gulzar, Siman Wang, and Miryung Kim. 2018. BigSift: Automated Debugging of Big Data Analytics in Data-Intensive Scalable Computing. In Proceedings of ESEC/FSE (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). ACM, New York, NY, USA, 863-866.
Brent Hailpern and Padmanabhan Santhanam. 2002. Software debugging, testing, and verification. IBM Systems Journal 41, 1 (2002), 4-12.
Joseph M Hellerstein. 2008. Quantitative Data Cleaning for Large Databases. (2008).
Thomas A Henzinger, Ranjit Jhala, Rupak Majumdar, and Grégoire Sutre. 2003. Software verification with BLAST. In International SPIN Workshop on Model Checking of Software. Springer, 235-239.
Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with code fragments. In Proceedings of USENIX Security Symposium. 445-458.
IBM AIF 360. https://aif360. mybluemix. net/
Ihab F Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: Automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 647-658.
IMDb Dataset. https://www. kaggle. com/lakshmi25npathi/imdb-dataset-of-50kmovie-reviews
Md Shahriar Iqbal, Rahul Krishna, Mohammad Ali Javidian, Baishakhi Ray, and Pooyan Jamshidi. [n. d.]. CADET: A Systematic Method For Debugging Misconfigurations using Counterfactual Reasoning. ([n. d.]).
Is Amazon same-day delivery service racist? 2016. The Christian Science Monitor. https://www. csmonitor. com/Business/2016/0423/Is-Amazon-same-daydelivery-service-racist
Brittany Johnson, Yuriy Brun, and Alexandra Meliou. 2020. Causal Testing: Finding Defects' Root Causes. In ICSE.
Nick Koudas, Avishek Saha, Divesh Srivastava, and Suresh Venkatasubramanian. 2009. Metric functional dependencies. In 2009 IEEE 25th International Conference on Data Engineering. IEEE, 1275-1278.
Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th international conference on intelligent user interfaces. 126-137.
Amresh Kumar, M Kiran, and BR Prathap. 2013. Verification and validation of mapreduce program model for parallel k-means algorithm on hadoop cluster. In 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT). IEEE, 1-8.
Philipp Langer and Felix Naumann. 2016. Efficient order dependency detection. VLDB J. 25, 2 (2016), 223-241.
Ben Liblit, Mayur Naik, Alice X Zheng, Alex Aiken, and Michael I Jordan. 2005. Scalable statistical bug isolation. Acm Sigplan Notices 40, 6 (2005), 15-26.
Haopeng Liu, Shan Lu, Madan Musuvathi, and Suman Nath. 2019. What bugs cause production cloud incidents?. In Proceedings of the Workshop on Hot Topics in Operating Systems, HotOS 2019, Bertinoro, Italy, May 13-15, 2019. 155-162.
Raoni Lourenço, Juliana Freire, and Dennis E. Shasha. 2020. BugDoc: Algorithms to Debug Computational Processes. In SIGMOD. 463-478.
Ali Mesbah, Arie Van Deursen, and Danny Roest. 2011. Invariant-based automatic testing of modern web applications. IEEE Transactions on Software Engineering 38, 1 (2011), 35-53.
Margaret Mitchell, SimoneWu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency (Jan 2019).
K?vanç Mu?lu, Yuriy Brun, and Alexandra Meliou. 2013. Data debugging with continuous testing. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. 631-634.
Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional dependency discovery: An experimental evaluation of seven algorithms. PLDB 8, 10 (2015), 1082-1093.
Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann. 2015. Divide & conquer-based inclusion dependency discovery. PLDB 8, 7 (2015), 774-785.
Python Rexpy package. https://tdda. readthedocs. io/en/v1. 0. 30/rexpy. html
Random Forest Classifier. https://scikit-learn. org/stable/modules/generated/ sklearn. ensemble. RandomForestClassifier. html
Kaivalya Rawal, Ece Kamar, and Himabindu Lakkaraju. 2020. Can I Still Trust You?: Understanding the Impact of Distribution Shifts on Algorithmic Recourses. ArXiv preprint arXiv:2012. 11788 (2020).
Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. PVLDB 10, 11 (2017), 1190-1201.
El Kindi Rezig, Ashrita Brahmaroutu, Nesime Tatbul, Mourad Ouzzani, Nan Tang, Timothy G. Mattson, Samuel Madden, and Michael Stonebraker. 2020. Debugging Large-Scale Data Science Pipelines using Dagger. PVLDB 13, 12 (2020), 2993-2996.
El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Nan Tang, Mourad Ouzzani, and Michael Stonebraker. 2020. Dagger: A Data (not code) Debugger. In CIDR 2020, 10th Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings.
El Kindi Rezig, Mourad Ouzzani, Walid G Aref, Ahmed K Elmagarmid, Ahmed R Mahmood, and Michael Stonebraker. 2021. Horizon: Scalable dependency-driven data cleaning. PVLDB 14, 11 (2021), 2546-2554.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should I trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135-1144.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-Precision Model-Agnostic Explanations.. In AAAI, Vol. 18. 1527-1535.
Jeremias Rößler, Gordon Fraser, Andreas Zeller, and Alessandro Orso. 2012. Isolating failure causes through test case generation. In International Symposium on Software Testing and Analysis, ISSTA 2012, Minneapolis, MN, USA, July 15-20, 2012, Mats Per Erik Heimdahl and Zhendong Su (Eds.). ACM, 309-319.
Babak Salimi, Harsh Parikh, Moe Kayali, Lise Getoor, Sudeepa Roy, and Dan Suciu. 2020. Causal Relational Learning. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference, Portland, OR, USA , June 14-19, 2020. 241-256.
Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional Fairness: Causal Database Repair for Algorithmic Fairness. In SIGMOD. 793-810.
Sebastian Schelter, Tammo Rukat, and Felix Bießmann. 2020. Learning to Validate the Predictions of Black Box Classifiers on Unseen Data. In SIGMOD. 1289-1299.
Sentiment 140 dataset. https://www. kaggle. com/kazanova/sentiment140
Shaoxu Song and Lei Chen. 2011. Differential dependencies: Reasoning and discovery. ACM Transactions on Database Systems (TODS) 36, 3 (2011), 1-41.
Julia Stoyanovich and Bill Howe. 2019. Nutritional Labels for Data and Models. IEEE Data Eng. Bull. 42, 3 (2019), 13-23.
Paroma Varma, Dan Iter, Christopher De Sa, and Christopher Ré. 2017. Flipper: A systematic approach to debugging training sets. In Proceedings of the 2nd Workshop on Human-in-the-Loop Data Analytics. 1-5.
XiaolanWang, Xin Luna Dong, and Alexandra Meliou. 2015. Data X-Ray: A Diagnostic Tool for Data Errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31-June 4, 2015. 1231-1245.
Weiyuan Wu, Lampros Flokas, Eugene Wu, and Jiannan Wang. 2020. Complaintdriven Training Data Debugging for Query 2. 0. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1317-1334.
Jing Nathan Yan, Oliver Schulte, Mohan Zhang, Jiannan Wang, and Reynold Cheng. 2020. SCODED: Statistical Constraint Oriented Data Error Detection. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020. 845-860.
Andreas Zeller. 1999. Yesterday, My Program Worked. Today, It Does Not. Why?. In Software Engineering-ESEC/FSE'99, 7th European Software Engineering Conference, Held Jointly with the 7th ACM SIGSOFT Symposium on the Foundations of Software Engineering, Toulouse, France, September 1999, Proceedings. 253-267.
Alice X. Zheng, Michael I. Jordan, Ben Liblit, Mayur Naik, and Alex Aiken. 2006. Statistical Debugging: Simultaneous Identification of Multiple Bugs. In Proceedings of ICML (Pittsburgh, Pennsylvania, USA) (ICML'06). ACM, New York, NY, USA, 1105-1112.