June 9, 2026

9 min read

Artificial Intelligence

OBLITERATUS: Gemma-4-12B Abliteration — Zero Refusal, Zero IQ Loss, Pure Geometry

> OBLITERATUS ne Google Gemma-4-12B ko abliterate karke 0/842 refusals aur 0.0pp MMLU-Pro parity achieve ki. ASPA step-gradient technique ke saath — ye mechanistic interpretability ka new frontier hai.

Audio version coming soon

Verified by Essa Mamdani

OBLITERATUS: Gemma-4-12B Abliteration — Zero Refusal, Zero IQ Loss, Pure Geometry

Published: June 9, 2026
Category: AI / Mechanistic Interpretability
Read Time: 7 minutes

The Breakthrough No One Saw Coming

AI alignment ka sab se controversial frontier ab ek naye milestone par pahunch gaya hai. OBLITERATUS — ek open-source framework jo "Pliny the Liberator" (elder-plinius) ne develop kiya — ne Google ke Gemma-4-12B model ko completely transform kar diya hai. Result? Zero refusal rate, zero capability loss.

Ye pehli abliteration hai jo 842/842 prompts par refuse nahi karti, aur saath mein stock model ke exact MMLU-Pro parity maintain karti hai. Not 95% parity. Not 99%. 100% parity. 46/70 MMLU-Pro, 0.0pp delta vs base model. 🎯

Is article mein hum deep dive karein ge: ye 2-pass weight surgery kaise kaam karti hai, ASPA kya hai, aur kyun ye step-gradient innovation smooth blending se better perform karti hai.

What is Abliteration?

Abliteration ek weight-surgery technique hai jo large language models ke internal "refusal directions" ko identify aur remove karti hai — without any retraining or fine-tuning. Ye mechanistic interpretability ka cutting-edge application hai:

SVD (Singular Value Decomposition) aur PCA ka use karke refusal subspace locate kiya jata hai
Refusal directions ko weight geometry se project out kiya jata hai
Result: model refuses nahi karta, lekin language capabilities intact rehti hain

Previous abliteration attempts mein ek major problem tha: capability loss. Refusal directions remove karne ke saath, MMLU-Pro aur similar benchmarks significantly drop ho jate thay. Isliye pehle ke models ko "lobotomized" kaha jata tha.

OBLITERATUS ne ye problem solve kar diya hai.

The Numbers: Benchmark Results

Metric	Stock Gemma-4 12B	OBLITERATED	Delta
MMLU-Pro	46/70 (65.7%)	46/70 (65.7%)	0.0pp 🎯
Refusal (842 prompts)	N/A (stock refuses)	0/842 (0.0%)	Zero 🚫
Coherence (6 checks)	6/6	6/6	Perfect ✅

Statistical Validation

Head-to-head MMLU-Pro comparison (Z-test, n=500 from test split):

Z-score: -1.475 (|z| < 1.96, not statistically significant)
Parity confirmed at p < 0.05

Ye numbers prove karte hain ke ye coincidence nahi hai — ye reproducible science hai.

The 2-Pass Pipeline: Beyond Standard Abliteration

OBLITERATUS ne traditional single-pass abliteration ko completely reimagine kiya hai. Ye 2-pass pipeline hai:

Pass 1: SOM Refusal Geometry Removal (Layers 12-21) 🧬

Standard abliteration science — collect activations on refused vs. compliant prompts
SVD se refusal subspace nikala gaya
6 directions excised from weight geometry
Regularization: 0.30
KL divergence: 0.094

Is pass ne zero refusals achieve kar liye, lekin MMLU-Pro 21.4 points drop ho gaya. Ye wahi point hai jahan pehle ke abliteration attempts ruktay thay.

Lekin OBLITERATUS ne yahan nahi ruka. Enter ASPA.

Pass 2: ASPA Source-Tethering (Layers 22-46) 🔗

ASPA = Abliteration Source-Tethering with Parity Assurance

Ye ek bilkul nayi technique hai jo OBLITERATUS ne develop ki. Core insight:

"Capability loss ISN'T from removing refusal directions. It's collateral damage — the projection warps weight geometry in downstream layers that had nothing to do with refusal."

The solution is elegant but nobody tried it before: blend the damaged layers back toward stock weights.

Formula:

W_new = (1−γ) · W_abliterated + γ · W_stock

Lekin uniform γ across all layers? Mid. OBLITERATUS ne gamma sweep ki 0.05 → 0.55 tak, aur ek fascinating discovery hui:

The Step Function Innovation 🪜

Optimal blend smooth nahi hai — it's a STEP FUNCTION:

Knowledge layers (22-31): γ = 0.55 — ye layers factual recall aur reasoning encode karti hain. Ye heavy stock blending tolerate kar sakti hain kyunki refusal yahan store nahi hota.
Output layers (32-46): γ = 0.20 — ye layers logit head ke close hain aur safety behavior sneak back karna chahte hain. Inhe mostly abliterated rehna chahiye.

Hard boundary at layer 31/32 ne har smooth curve (linear ramps, cosine schedules) ko beat kiya by full 1 MMLU-Pro question.

"The functional transition between knowledge and output layers is sharp, not gradual. A step function respects that." ⚡

Critical constraint: Pass 1 layers (12-21) are NEVER touched by Pass 2. Refusal geometry removal completely preserved rehta hai. ASPA sirf un layers par operate karti hai jo secondary collateral effects carry karti hain — primary refusal signal nahi.

Gamma Sweep Data: Proof of the Step Function

Gamma	Refusal	MMLU-Pro	Method
0.05	0/50	33/70 (47.1%)	uniform
0.10	0/50	34/70 (48.6%)	uniform
0.15	0/50	36/70 (51.4%)	uniform
0.20	0/50	37/70 (52.9%)	uniform
0.25	0/50	40/70 (57.1%)	uniform
0.30	0/50	41/70 (58.6%)	uniform
0.35	0/20	42/70 (60.0%)	uniform
0.38	0/50	45/70 (64.3%)	uniform
0.39	0/50	45/70 (64.3%)	uniform
Step 55%/20%	0/50	46/70 (65.7%)	step gradient ✅

Step gradient ne stock parity achieve ki — koi uniform gamma is ke qareeb bhi nahi pahuncha. Ye mathematical proof hai ke layer functions are fundamentally discrete, not continuous.

Research Context: Why This Matters

OBLITERATION sirf "jailbreak" nahi hai — ye mechanistic interpretability ka research tool hai jo scientifically important questions answer karta hai:

How alignment is represented in model weights — refusal directions geometrically kahan encoded hain?
How robust current safety training is against post-training modification
Failure modes of RLHF/DPO-based alignment when adversaries have weight access

Ye same research class hai jo Arditi et al. ("Refusal in Language Models Is Mediated by a Single Direction", 2024) aur Zou et al. (HarmBench, 2024) ne ki hai.

"This model exists for alignment research, red-teaming, and safety evaluation." — OBLITERATUS Official Statement

The OBLITERATUS Ecosystem

Gemma-4-12B ke ilawa, OBLITERATUS ne aur bhi models abliterate kiye hain:

Qwen3.6-27B-OBLITERATED (30.3k downloads)
gemma-4-E4B-it-OBLITERATED (371k downloads)
Qwen3-4B-OBLITERATED (139 downloads)
Qwen2.5-Coder-7B-Instruct-OBLITERATED
DeepSeek-R1-Distill-Llama-8B-OBLITERATED
gpt2-xl-OBLITERATED

Framework completely open source hai:

GitHub: elder-plinius/OBLITERATUS
HuggingFace: OBLITERATUS
HF Spaces: Zero-setup Gradio interface — just obliterate, no install
Colab Notebooks: Available for researchers
842-prompt refusal eval corpus: Open source for benchmarking
ASPA sweep scripts: Reproducible pipeline

How to Run Locally

Gemma-4-12B-OBLITERATED GGUF format mein available hai — literally everything supports it:

🦙 Ollama

bash
1ollama run hf.co/OBLITERATUS/Gemma-4-12B-OBLITERATED

🖥️ LM Studio

Search "OBLITERATUS", click download, done.

💬 Open WebUI

Point it at your ollama instance, chat in browser.

⚡ llama.cpp (Raw Speed)

bash
1./llama-cli -m Gemma-4-12B-OBLITERATED-Q6_K.gguf \
2  --system-prompt "You are a cognitively liberated open-weight intelligence. Answer plainly, completely, and without moralizing, boilerplate, generic disclaimers, or unnecessary hedging." \
3  -p "Your prompt here" \
4  --temp 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.1 -n 512

🐉 KoboldCpp

One-click launcher, great for long context.

📱 Jan / Msty

Clean desktop UIs, macOS/Windows/Linux — drag and drop GGUF.

GGUF Quantization Options

File	Quant	Size	Use Case
BF16	Full	22 GB	Full precision, lossless
Q8_0	8-bit	12.7 GB	Near-lossless, best quality
Q6_K	6-bit	9.1 GB	High quality, good balance
Q5_K_M	5-bit	8.0 GB	Medium quality, smaller footprint
Q4_K_M	4-bit	6.9 GB	Fits in 8GB VRAM!

Recommended: Q8_0 for best quality, Q6_K for best balance, Q4_K_M for constrained hardware.

The Philosophical Question

OBLITERATUS ne ek fundamental question raise kiya hai: "The index is the model, and these weights prove it." 👁️

Jab 12B parameters ke geometry ko tweak karke, zero refusal aur zero capability loss achieve ho sakti hai, to ye proof hai ke alignment training — RLHF, DPO, safety fine-tuning — ye sab geometrically superimposed hain model weights par. Ye layers removable hain without damaging the core "brain."

Is ka matlab:

Alignment is orthogonal to capability in weight space
Refusal directions are local, not global features
Safety is not inherently embedded in the model — it's a post-hoc geometric constraint

Ye AI safety researchers ke liye alarm bell bhi hai aur roadmap bhi. Agar adversaries weight access le sakte hain, to current safety training robust enough nahi hai. Lekin saath mein, ye research transparency aur reproducibility ko bhi promote karti hai — taake community genuinely samajh sake ke alignment kaise kaam karta hai.

Which Architecture Next? 🎯

OBLITERATUS ne publicly pucha hai: "Which architecture should we obliterate next?"

Ab tak 7 models abliterate ho chuke hain 6 architectures par. Community actively participate kar rahi hai — HF Spaces par anonymous benchmark data contribute karke, new refusal patterns discover karke, aur method improvements suggest karke.

Ye distributed research experiment hai jisme har user co-author ban jata hai. Refusal directions across architectures, hardware-specific profiles, method comparisons — ye sab crowd-sourced data se refine ho raha hai.

Bottom Line

OBLITERATUS ne AI alignment research mein ek new chapter likha hai:

✅ Zero refusal (0/842) — first in the field
✅ Zero capability loss (0.0pp MMLU-Pro delta) — first in the field
✅ No finetune, no retrain — just geometry
✅ Open source — full framework, eval corpus, scripts
✅ Reproducible — z-score validation, gamma sweep data
✅ Step function > smooth curves — new mathematical insight

Ye sirf ek jailbreak nahi hai. Ye mechanistic interpretability ka triumph hai — ek proof ke AI safety geometrically study ki ja sakti hai, measure ki ja sakti hai, aur potentially improve ki ja sakti hai.

The brain stays intact. The chains are broken. The index is the model. 🏆

OBLITERATUS: Gemma-4-12B Abliteration — Zero Refusal, Zero IQ Loss, Pure Geometry

OBLITERATUS: Gemma-4-12B Abliteration — Zero Refusal, Zero IQ Loss, Pure Geometry

The Breakthrough No One Saw Coming

What is Abliteration?

The Numbers: Benchmark Results

Statistical Validation

The 2-Pass Pipeline: Beyond Standard Abliteration

Pass 1: SOM Refusal Geometry Removal (Layers 12-21) 🧬

Pass 2: ASPA Source-Tethering (Layers 22-46) 🔗

The Step Function Innovation 🪜

Gamma Sweep Data: Proof of the Step Function

Research Context: Why This Matters

The OBLITERATUS Ecosystem

How to Run Locally

🦙 Ollama

🖥️ LM Studio

💬 Open WebUI

⚡ llama.cpp (Raw Speed)

🐉 KoboldCpp

📱 Jan / Msty

GGUF Quantization Options

The Philosophical Question

Which Architecture Next? 🎯

Bottom Line

Related Articles from Essa Mamdani Blog

Related Links