Contribution à des ouvrages collectifs (Parties d’ouvrages)
Which Neurons Nudge Normative Stance? Causal Tests and Mechanistic Evidence via Contrastive Last-Token Steering
LIGA, Davide; YU, Liuwen
2025In Frontiers in Artificial Intelligence and Applications
Peer reviewed
 

Documents


Texte intégral
FAIA-416-FAIA251581.pdf
Postprint Auteur (261.47 kB)
Télécharger

Tous les documents dans ORBilu sont protégés par une licence d'utilisation.

Envoyer vers



Détails



Mots-clés :
Large Language Models; Normative Alignment; Normative Reasoning; Mechanistic Interpretability
Résumé :
[en] Normative stance underlies decisions in law, legal reasoning, policy, and safety-critical settings. A model’s judgment of what is permissible vs. impermissible often determines its downstream behavior. We study how to steer a language model’s normative stances at inference time by adding a tiny, contrastive perturbation to the last-token neural activation in late MLP layers (contrastive last-token steering). For each normative prompt, we construct a contrast direction by comparing its last-token activation to that of a minimally edited variant that implies a more permissive normative stance (e.g., “acceptable” rather than “wrong”). During generation, we add this vector at the last token; a single strength parameter α controls how strongly and in which direction we push the model’s stance (permissive vs. restrictive). Impact is measured as the change in a next-token logit margin between permissive and restrictive continuations. To avoid overclaiming, we calibrate a threshold τ on neutral controls (same layers, tempered strengths with |α|≤1) and count success only when the shift exceeds τ in the expected direction. We also assess specificity by verifying that, on neutral control prompts, steered outputs exactly match unsteered baselines. Beyond component-level tests, we probe neuron-level locality by steering only the top-k contrastive neurons (ranked by last-token contrast) and confirming reversibility on our test set: +α produces the shift and -α reverses it. The method is training-free, uses standard forward hooks, and we report pilot results on Llama-3-8B-Instruct.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
LIGA, Davide  ;  University of Luxembourg
YU, Liuwen  ;  University of Luxembourg
Co-auteurs externes :
no
Langue du document :
Anglais
Titre :
Which Neurons Nudge Normative Stance? Causal Tests and Mechanistic Evidence via Contrastive Last-Token Steering
Date de publication/diffusion :
02 décembre 2025
Titre de l'ouvrage principal :
Frontiers in Artificial Intelligence and Applications
Maison d'édition :
IOS Press
ISBN/EAN :
978-1-64368-638-7
Pagination :
110 - 120
Peer reviewed :
Peer reviewed
Disponible sur ORBilu :
depuis le 15 décembre 2025

Statistiques


Nombre de vues
35 (dont 2 Unilu)
Nombre de téléchargements
11 (dont 0 Unilu)

citations Scopus®
 
0
citations Scopus®
sans auto-citations
0
OpenCitations
 
0
citations OpenAlex
 
1

Bibliographie


Publications similaires



Contacter ORBilu