RALEIGH, Thomas ; University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > Department of Humanities (DHUM) > Philosophy
KNOKS, Aleks ; University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > Department of Humanities (DHUM) > Philosophy
A. Adadi M. Berrada Peeking inside the black-box: A survey on explainable artificial intelligence (XAI) IEEE Access 6 52138 52160
B. Babic I. Cohen The algorithmic explainability “Bait and Switch” Minnesota Law Review 108 857 909
B. Babic S. Gerke T. Evgeniou I. Cohen Beware explanations from AI in healthcare Science 373 284 286
S. Bach A. Binder G. Montavon F. Klauschen K.-R. Müller W. Samek On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation PLoS ONE 10 7 1 46
Bereska, L, & Gavves, E. (2024). Mechanistic interpretability for AI safety: A review. Retrieved September 3, 2024, from https://arxiv.org/html/2404.14082v2
F. Boge Two dimensions of opacity and the deep learning predicament Minds and Machines 32 1 43 75
F. Boge A. Mosig Put it to the test: Getting serious about explanation in Explainable Artificial Intelligence Minds and Machines 35 26 1 28
Bricken, T, Templeton, A, Baston, J, Chen, B, Jermyn, A, Conerly, T, Turner, N, Anil, C, Denison, C, Askell, A, Lasenby, R, Wu, Y, Kravec, S, Schiefer, N, Maxwell, T, Joseph, N, Tamkin, A, Nguyen, K, McLean, B, Burke, J, Hume, T, Carter, S, Henighan, T, & Olah C. (2023). Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread. Retrieved September 3, 2024, from https://transformer-circuits.pub/2023/monosemantic-features/index.html
J. Browning M. Theunissen Putting explainable AI in context: Institutional explanations for medical AI Ethics and Information Technology 24 2 1 10
J. Burrell How the machine “thinks”: Understanding opacity in machine learning algorithms Big Data & Society 3 1 1 12 4469805
R. Cao Putting representations to use Synthese 200 2 151 4408527
H. Capellen J. Dever Making AI intelligible: Philosophical foundations Oxford University Press
D. Chalmers Connectionism and compositionality: Why Fodor and Pylyshyn were wrong Philosophical Psychology 6 3 305 319
K. Creel Transparency in complex computational systems Philosophy of Science 87 4 568 589 4154686
G. Cybenko Approximation by superpositions of a sigmoidal function Mathematics of Control, Signals, and Systems 2 4 303 314 1015670
Davies, X, Nadeau, M, Prakash, N, Rott Shahan, T, & Bau, D. (2023), Discovering variable binding circuitry with desiderata. In Proceeding of the ICML 2023 Workshop Challenges of Deploying Generative AI.
J. Durairaj A. Waterhouse T. Mets T. Brodiazhenko M. Abdullah G. Studer G. Tauriello M. Akdel A. Andreeva A. Bateman T. Tenson V. Hauryliuk T. Schwede J. Pereira Uncovering new families and folds in the natural protein universe Nature 622 646 653
J. Durán K. Jongsma Who is afraid of black box algorithms? On the epistemological and ethical basis of trust in medical AI Journal of Medical Ethics 47 329 335
Elhage, N, Hume, T, Olsson, C, Schiefer, N, Henighan, T, Kravec, S, Hatfield-Dodds, Z, Lasenby, R, Drain, D, Chen, C, Grosse, R, McCandish, S, Kaplan, J, Amodei, D, Wattenberg, M, & Olah, C. (2022). Toy models of superposition. Transformer Circuits Thread. Retrieved September 3, 2024, from https://arxiv.org/pdf/2209.10652
L. Favela E. Machery Investigating the concept of representation in the neural and psychological sciences Frontiers in Psychology 14 1165622 10.3389/fpsyg.2023.1165622
W. Fleischer Understanding, idealization, and explainable AI Episteme 19 4 534 560
J. Fodor Z. Pylyshyn Connectionism and cognitive architecture: A critical analysis Cognition 28 1–2 3 71
Fong, R, & Vedaldi, A. (2017). Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (pp. 1–9).
O. Gingerich From copernicus to kepler: Heliocentrism as model and as reality Proceedings of the American Philosophical Society 117 6 513 522
J. Grindrod Large Language Models and linguistic intentionality Synthese 204 2 71
Grzankowski, A. (2024). Real sparks of artificial intelligence and the importance of inner interpretability. Inquiry 1–27.
R. Guidotti A. Monreale S. Ruggieri F. Turini F. Giannotti D. Pedreschi A survey of methods for explaining black box models Acm Computing Surveys 51 5 1 42
Hanin, B, & Selke, M. (2018). Approximating continuous functions by ReLU Nets of minimal width. Retrieved September 3, 2024, from https://arxiv.org/pdf/1710.11278
Hanna, M, Liu, O, & Variengein, A. (2023). How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In NeurIPS Conference. Retrieved September 3, 2024, from https://openreview.net/pdf?id=p4PckNQR8k
E. Hatna I. Benenson The schelling model of ethnic residential dynamics: Beyond the integrated-segregated dichotomy of patterns Journal of Artificial Societies and Social Simulation 15 1 1 6
Heimersheim, S, & Jett, J. (2023). A circuit for Python docstrings in a 4-layer attention-only transformer. Retrieved August 15, 2024, from https://www.lesswrong.com/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only
Hernandez, E, Sen Sharma, A, Haklay, T, Meng, K, Wattenberg, M, Andreas, J, Belinkov, Y, & Bau, D. (2023) Linearity of relation decoding in transformer language models. In Proceeding of the 13th International Conference on Learning Representations (ICLR 2024). Retrieved September 3, 2024, from https://openreview.net/pdf?id=w7LU2s14kE
K. Hornik M. Stitchcombe H. White Multilayer feedforward networks are universal approximators Neural networks Pergamon Press 359 366 2
Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., & Veigas, F. (2018), Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV). Proceedings of the 35thInternational Conference on Machine Learning (PMLR18) (pp. 2668–2677).
T. Kuhn The structure of scientific revolutions University of Chicago Press
Li, K, Hopkins, A, Bau, D, Viégas, Pfister, H., & Wattenberg, M. (2023). Emergent world representations: Exploring a sequence model trained on a synthetic task. In Proceedings of the 11th International Conference on Learning Representations (ICLR 2023). Retrieved September 3, 2024, from https://openreview.net/pdf?id=DeG07_TcZvT
Lindsey, J, Gurnee, W, Ameisen, E, Chen, B, Pearce, A, Turner, N. L, Citro, C, Abrahams, D, Carter, S, Hosmer, B, & Marcus, J. (2025) On the biology of a large language model. Transformer Circuits.
Lundberg, S. M, & Lee, S, (2017). A unified approach to interpreting model predictions. In NeurIPS (4765–4774).
P. Mandik Varieties of representation in evolved and embodied neural networks Biology and Philosophy 18 1 95 130
D. Marr Artificial intelligence: A personal view Artificial Intelligence 9 37 48
D. Marr Vision: A computational investigation into the human representation and processing of visual information W. H. Freeman and Company
Molnar, C. (2021), Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. Retrieved September 3, 2024, from https://originalstatic.aminer.cn/misc/pdf/Molnar-interpretable-machine-learning_compressed.pdf
G. Montavon S. Bach A. Binder W. Samek K.-R. Müller Explaining nonlinear classification decisions with deep taylor decomposition Pattern Recognition 65 211 222
B. Nanay Entity realism about mental representations Erkenntnis 87 1 75 91
Nanda, N, Chan, L, Lieberum, T, & Steinhardt, J. (2023), Progress measures for grokking via mechanistic interpretability. In Proceedings of the 11th International Conference on Learning Representations (ICLR 2023). Retrieved September 3, 2024, from https://openreview.net/forum?id=9XFSbDPmdW
Olah, C., Cammarata, N., Schubert, L., Goh. G., Petrov, M. & Carter. S. (2020). Zoom in: An introduction to circuits. Distill. Retrieved August 15, 2024, from https://distill.pub/2020/circuits/zoom-in
O’Mahoney, L., Andrearczyk, V., Muller H., & Graziani, M. (2023), Disentangling neuron representations with concept vectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 3770–3775).
A. Páez The pragmatic turn in explainable artificial intelligence (XAI) Minds and Machines 29 441 459
Quirke, P, & Barez, F. (2024), Understanding addition in transformers. Proceedings of the 12th International Conference on Learning Representations (ICLR 2024). Retrieved September 3, 2024, from https://arxiv.org/pdf/2310.13121
E. Ratti M. Graves Explainable machine learning practices: Opening another black box for reliable medical AI AI and Ethics 2 801 814
T. Räz C. Beisbart The importance of understanding deep learning Erkenntnis 89 5 1823 1840 4745690
Ribeiro, M, Singh, S, & Guestrin, C. (2016). Why should I trust you? Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16) (pp. 1135–1144). ACM Press.
Ribeiro, M, Singh, S, & Guestrin, C. (2018). Anchors: High-precision model-agnostic explanations. In Proceedings of the 32nd AAAI conference on artificial intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence (pp. 1527–1535). ACM Press.
T. Rogers A. McKane A unified framework for Schelling’s model of segregation Journal of Statistical Mechanics: Theory and Experiment 10.1088/1742-5468/2011/07/P07006
D. Rowbottom W. Peden A. Curtis-Trudel Does the no miracles argument apply to AI? Synthese 203 173 1 20 4745811
C. Rudin Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Nature Machine Intelligence 1 206 215
J. Searle Minds, brains and programs Behavioral and Brain Sciences 3 3 417 457
T. Schelling Dynamic models of segregation The Journal of Mathematical Sociology 1 2 143 186
Sharkey, L. (2024). Sparsify: A mechanistic interpretability research agenda. AI Alignment Forum. Retrieved August 15, 2024, from https://www.alignmentforum.org/posts/64MizJXzyvrYpeKqm/sparsify-a-mechanistic-interpretability-research-agenda.
P. Smolensky B.M. Loewer G. Rey Connectionism, constituency, and the language of thought Meaning in mind: Fodor and his critics Blackwell
A. Søgaard On the opacity of deep neural networks Canadian Journal of Philosophy 53 3 224 239
Speith, T. (2022). A review of taxonomies of explainable artificial intelligence (XAI) methods. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAC 2022) (pp. 2239–2250).
E. Sullivan Understanding from machine learning models The British Journal for the Philosophy of Science 73 1 109 133 4466453
E. Sullivan Inductive risk, understanding, and opaque machine learning models Philosophy of Science 89 5 1065 1074 4543684
Tan, J, & Zhang, Y. (2023). ExplainableFold: Understanding AlphaFold prediction with explainable AI. In Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’23). ACM Press.
E. Thomson G. Piccinini Neural representations observed Minds and Machines 28 1 191 235
Tigges, C, Hollingsworth, O, Geiger, A, & Nanda, N. (2023). Linear representations of sentiment in Large Language Models. Retrieved September 3, 2024, from https://arxiv.org/pdf/2310.15154
L. Titus ‘Does ChatGPT have semantic understanding?’ Cognitive Systems Research 83 101174
S. Wachter B. Mittelstadt L. Floridi Transparent, explainable, and accountable AI for robotics Science Robotics 10.1126/scirobotics.aan6080
Wang, K, Variengien, A, Conmy, A, Shlegeris, B, & Steinhardt, J. (2023). Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. In Proceedings of the 11th International Conference on Learning Representations (ICLR 2023). Retrieved September 3, 2024, from https://arxiv.org/pdf/2211.00593
C. Zednik Solving the black box problem: A normative framework for explainable artificial intelligence Philosophy & Technology 34 2 265 288
J. Zerilli Explaining machine learning decisions Philosophy of Science 89 1 1 19 4418677
Zintgraf, L, Cohen, T, Adel, T, & Welling, M. (2017). Visualizing deep neural network decisions: Prediction difference analysis. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017). Retrieved September 3, 2024, from https://openreview.net/pdf?id=BJ5UeU9xx