Large Language Models; Normative Alignment; Normative Reasoning; Mechanistic Interpretability
Abstract :
[en] Normative stance underlies decisions in law, legal reasoning, policy, and safety-critical settings. A model’s judgment of what is permissible vs. impermissible often determines its downstream behavior. We study how to steer a language model’s normative stances at inference time by adding a tiny, contrastive perturbation to the last-token neural activation in late MLP layers (contrastive last-token steering). For each normative prompt, we construct a contrast direction by comparing its last-token activation to that of a minimally edited variant that implies a more permissive normative stance (e.g., “acceptable” rather than “wrong”). During generation, we add this vector at the last token; a single strength parameter α controls how strongly and in which direction we push the model’s stance (permissive vs. restrictive). Impact is measured as the change in a next-token logit margin between permissive and restrictive continuations. To avoid overclaiming, we calibrate a threshold τ on neutral controls (same layers, tempered strengths with |α|≤1) and count success only when the shift exceeds τ in the expected direction. We also assess specificity by verifying that, on neutral control prompts, steered outputs exactly match unsteered baselines. Beyond component-level tests, we probe neuron-level locality by steering only the top-k contrastive neurons (ranked by last-token contrast) and confirming reversibility on our test set: +α produces the shift and -α reverses it. The method is training-free, uses standard forward hooks, and we report pilot results on Llama-3-8B-Instruct.