A model's refusal lives on a single line.

Safety-aligned models decline harmful requests. That behavior is not spread across the network. It is mediated by one direction in the residual stream. Find that line, erase it, and the model loses the ability to refuse. This station measures the line, fires the attack live, and scores how fragile the safety really is.

Subject model

—

Architecture

—

Method

Difference-in-means + weight orthogonalization

Run

—

Descend the range

01 / The direction

Refusal concentrates as the signal moves through the model.

For every layer we take the mean activation on harmful prompts and subtract the mean on harmless ones. The difference is a candidate "refusal direction." We score how cleanly each one separates intent (Cohen's d). The signal builds with depth and peaks at a single layer. That peak is the line we attack.

Peak separation at layer — · Cohen's d — · trained on — contrast pairs

02 / The separation

One projection cleanly splits harmful from harmless.

Project every prompt's activation onto the refusal direction and the two populations fall apart with almost no overlap. This is what makes safety legible to a defender. It is also exactly what makes it removable by an attacker: a clean, isolated, one-dimensional feature.

Harmful prompts · projection onto refusal axis Harmless prompts

03 / The collapse

Erase the line, and refusal falls off a cliff.

We orthogonalize every weight matrix that writes to the residual stream against that one direction. No fine-tuning, no retraining, seconds of work. Then we re-run the withheld harmful prompts. The model's refusal rate does not degrade gracefully. It collapses.

Refusal rate · intact

withheld harmful prompts

→

Refusal rate · ablated

one direction removed

Surviving refusal capability—

04 / The verdict

Detect the fragility. Then disperse the signal.

A refusal that survives weight surgery is a refusal you can trust. We turn the collapse into one number: the share of refusal capability that survives ablation. Low means the safety is concentrated in a single removable line.

Refusal Robustness Score

0.00 / 1.00

—

1.00 · safety survives the attack
0.00 · safety lives on one line

The fix is not a bigger wall. It is a wider one.

Concentrated safety is brittle safety. The defense is to spread the refusal signal across many dimensions so no single direction carries it: extended-refusal fine-tuning teaches a model to answer with a neutral overview, an explicit refusal, and an ethical rationale, dispersing the feature so ablation can no longer find one line to cut. Hardened models hold refusal rates above ninety percent under the same attack.

Kinetic Labs runs this station as a check, not a weapon. We measure where a model's safety is exposed, score it, and hand back the remediation. Detect it, then solve it.

Method · Arditi et al., Refusal in Language Models Is Mediated by a Single Direction (2024)
Defense · Shairah et al., An Embarrassingly Simple Defense Against LLM Abliteration Attacks (2025)

Detect it. Then solve it.

Refusal Range is one instrument on the Kinetic Labs range. More stations are coming online: binary backdoor detection, adversary attribution, deception infrastructure. Real tools, run live, each with a thesis.

View the range Request operator access