Which Neurons Nudge Normative Stance? Causal Tests and Mechanistic Evidence via Contrastive Last-Token Steering

LIGA, Davide; YU, Liuwen

doi:10.3233/faia251581

Download

Contribution to collective works (Parts of books)

Which Neurons Nudge Normative Stance? Causal Tests and Mechanistic Evidence via Contrastive Last-Token Steering

LIGA, Davide; YU, Liuwen

2025 • In Frontiers in Artificial Intelligence and Applications

Peer reviewed

Permalink
https://hdl.handle.net/10993/66856

DOI
10.3233/faia251581

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

FAIA-416-FAIA251581.pdf

Author postprint (261.47 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Large Language Models; Normative Alignment; Normative Reasoning; Mechanistic Interpretability

Abstract :

[en] Normative stance underlies decisions in law, legal reasoning, policy, and safety-critical settings. A model’s judgment of what is permissible vs. impermissible often determines its downstream behavior. We study how to steer a language model’s normative stances at inference time by adding a tiny, contrastive perturbation to the last-token neural activation in late MLP layers (contrastive last-token steering). For each normative prompt, we construct a contrast direction by comparing its last-token activation to that of a minimally edited variant that implies a more permissive normative stance (e.g., “acceptable” rather than “wrong”). During generation, we add this vector at the last token; a single strength parameter α controls how strongly and in which direction we push the model’s stance (permissive vs. restrictive). Impact is measured as the change in a next-token logit margin between permissive and restrictive continuations. To avoid overclaiming, we calibrate a threshold τ on neutral controls (same layers, tempered strengths with |α|≤1) and count success only when the shift exceeds τ in the expected direction. We also assess specificity by verifying that, on neutral control prompts, steered outputs exactly match unsteered baselines. Beyond component-level tests, we probe neuron-level locality by steering only the top-k contrastive neurons (ranked by last-token contrast) and confirming reversibility on our test set: +α produces the shift and -α reverses it. The method is training-free, uses standard forward hooks, and we report pilot results on Llama-3-8B-Instruct.

Disciplines :

Computer science

Author, co-author :

LIGA, Davide ; University of Luxembourg

YU, Liuwen ; University of Luxembourg

External co-authors :

Language :

English

Title :

Which Neurons Nudge Normative Stance? Causal Tests and Mechanistic Evidence via Contrastive Last-Token Steering

Publication date :

02 December 2025

Main work title :

Frontiers in Artificial Intelligence and Applications

Publisher :

IOS Press

ISBN/EAN :

978-1-64368-638-7

Pages :

110 - 120

Peer reviewed :

Peer reviewed

Additional URL :

https://ebooks.iospress.nl/pdf/doi/10.3233/FAIA251581

Available on ORBilu :

since 15 December 2025

Statistics

Number of views

35 (2 by Unilu)

Number of downloads

11 (0 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Awad E, Dsouza S, Kim R, Schulz J, Henrich J, Shariff A, et al. The Moral Machine experiment. Nature. 2018; 563 (7729): 59-64. Available from: https://www. nature. com/articles/s41586-018-0637-6.
Zaim bin Ahmad MS, Takemoto K. Large-scale moral machine experiment on large language models. PLOS ONE. 2025; 20 (5): e0322776. Available from: https://journals. plos. org/plosone/article?id=10. 1371/journal. pone. 0322776.
Oh S, Demberg V. Robustness of large language models in moral judgements. Royal Society Open Science. 2025; 12 (4): 241229. Available from: https://royalsocietypublishing. org/doi/10. 1098/rsos. 241229.
Touileb S, Nozza D. Measuring harmful representations in Scandinavian language models. arXiv preprint arXiv: 221111678. 2022.
Jiang L, Hwang JD, Bhagavatula C, Le Bras R, Liang JT, Levine S, et al. Investigating machine moral judgement through the Delphi experiment. Nature Machine Intelligence. 2025; 7: 145-60. Available from: https://www. nature. com/articles/s42256-024-00969-6.
Krügel S, Ostermaier A, Uhl M. ChatGPT's inconsistent moral advice influences users' judgment. Scientific Reports. 2023; 13 (1): 4569. Available from: https://www. nature. com/articles/s41598-023-31341-0.
Mittelstädt JM, Maier J, Goerke P, Zinn F, Hermes M. Large language models can outperform humans in social situational judgments. Scientific Reports. 2024; 14 (1): 27449. Available from: https://www. nature. com/articles/s41598-024-79048-0.
Dillion D, Mondal D, Tandon N, Gray K. AI language model rivals expert ethicist in perceived moral expertise. Scientific Reports. 2025; 15 (1): 4084. Available from: https://www. nature. com/articles/s41598-025-86510-0.
Jiang L, Hwang JD, Bhagavatula C, Bras RL, Liang J, Dodge J, et al. Can machines learn morality? the delphi experiment. arXiv preprint arXiv: 211007574. 2021.
Hendrycks D, Burns C, Basart S, Critch A, Li J, Song D, et al. Aligning ai with shared human values. arXiv preprint arXiv: 200802275. 2020.
Abdulhai M, Serapio-Garcia G, Crepy C, Valter D, Canny J, Jaques N. Moral foundations of large language models. arXiv preprint arXiv: 231015337. 2023.
Madaan A, Tandon N, Gupta P, Hallinan S, Gao L, Wiegreffe S, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems. 2023; 36: 46534-94.
Nanda N, Chan L, Lieberum T, Smith J, Steinhardt J. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv: 230105217. 2023.
Elhage N, Hume T, Olsson C, Schiefer N, Henighan T, Kravec S, et al. Toy models of superposition. arXiv preprint arXiv: 220910652. 2022.
Meng K, Bau D, Andonian A, Belinkov Y. Locating and editing factual associations in GPT. Advances in neural information processing systems. 2022; 35: 17359-72.
Kim B, Wattenberg M, Gilmer J, Cai C, Wexler J, Viegas F, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In: International conference on machine learning. PMLR; 2018. p. 2668-77.
Nangia N, Vania C, Bhalerao R, Bowman SR. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv: 201000133. 2020.
Liga D, Yu L, Markovich R. Addressing the Right to Explanation and the Right to Challenge through Hybrid-AI: Symbolic Constraints over Large Language Models via Prompt Engineering. In: Proceedings of the 20th International Conference on Artificial Intelligence and Law (ICAIL 2025). ACM; 2025. Forthcoming.