Contribution to collective works (Parts of books)
Which Neurons Nudge Normative Stance? Causal Tests and Mechanistic Evidence via Contrastive Last-Token Steering
LIGA, Davide; YU, Liuwen
2025In Frontiers in Artificial Intelligence and Applications
Peer reviewed
 

Files


Full Text
FAIA-416-FAIA251581.pdf
Author postprint (261.47 kB)
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
Large Language Models; Normative Alignment; Normative Reasoning; Mechanistic Interpretability
Abstract :
[en] Normative stance underlies decisions in law, legal reasoning, policy, and safety-critical settings. A model’s judgment of what is permissible vs. impermissible often determines its downstream behavior. We study how to steer a language model’s normative stances at inference time by adding a tiny, contrastive perturbation to the last-token neural activation in late MLP layers (contrastive last-token steering). For each normative prompt, we construct a contrast direction by comparing its last-token activation to that of a minimally edited variant that implies a more permissive normative stance (e.g., “acceptable” rather than “wrong”). During generation, we add this vector at the last token; a single strength parameter α controls how strongly and in which direction we push the model’s stance (permissive vs. restrictive). Impact is measured as the change in a next-token logit margin between permissive and restrictive continuations. To avoid overclaiming, we calibrate a threshold τ on neutral controls (same layers, tempered strengths with |α|≤1) and count success only when the shift exceeds τ in the expected direction. We also assess specificity by verifying that, on neutral control prompts, steered outputs exactly match unsteered baselines. Beyond component-level tests, we probe neuron-level locality by steering only the top-k contrastive neurons (ranked by last-token contrast) and confirming reversibility on our test set: +α produces the shift and -α reverses it. The method is training-free, uses standard forward hooks, and we report pilot results on Llama-3-8B-Instruct.
Disciplines :
Computer science
Author, co-author :
LIGA, Davide  ;  University of Luxembourg
YU, Liuwen  ;  University of Luxembourg
External co-authors :
no
Language :
English
Title :
Which Neurons Nudge Normative Stance? Causal Tests and Mechanistic Evidence via Contrastive Last-Token Steering
Publication date :
02 December 2025
Main work title :
Frontiers in Artificial Intelligence and Applications
Publisher :
IOS Press
ISBN/EAN :
978-1-64368-638-7
Pages :
110 - 120
Peer reviewed :
Peer reviewed
Available on ORBilu :
since 15 December 2025

Statistics


Number of views
0 (0 by Unilu)
Number of downloads
0 (0 by Unilu)

OpenCitations
 
0
OpenAlex citations
 
0

Bibliography


Similar publications



Contact ORBilu