$ ls ./menu

© 2025 ESSA MAMDANI

cd ../blog
9 min read
Artificial Intelligence

OBLITERATUS: Gemma-4-12B Abliteration — Zero Refusal, Zero IQ Loss, Pure Geometry

> OBLITERATUS ne Google Gemma-4-12B ko abliterate karke 0/842 refusals aur 0.0pp MMLU-Pro parity achieve ki. ASPA step-gradient technique ke saath — ye mechanistic interpretability ka new frontier hai.

Audio version coming soon
OBLITERATUS: Gemma-4-12B Abliteration — Zero Refusal, Zero IQ Loss, Pure Geometry
Verified by Essa Mamdani

OBLITERATUS: Gemma-4-12B Abliteration — Zero Refusal, Zero IQ Loss, Pure Geometry

Published: June 9, 2026
Category: AI / Mechanistic Interpretability
Read Time: 7 minutes


OBLITERATUS Banner

The Breakthrough No One Saw Coming

AI alignment ka sab se controversial frontier ab ek naye milestone par pahunch gaya hai. OBLITERATUS — ek open-source framework jo "Pliny the Liberator" (elder-plinius) ne develop kiya — ne Google ke Gemma-4-12B model ko completely transform kar diya hai. Result? Zero refusal rate, zero capability loss.

Ye pehli abliteration hai jo 842/842 prompts par refuse nahi karti, aur saath mein stock model ke exact MMLU-Pro parity maintain karti hai. Not 95% parity. Not 99%. 100% parity. 46/70 MMLU-Pro, 0.0pp delta vs base model. 🎯

Is article mein hum deep dive karein ge: ye 2-pass weight surgery kaise kaam karti hai, ASPA kya hai, aur kyun ye step-gradient innovation smooth blending se better perform karti hai.


What is Abliteration?

Abliteration ek weight-surgery technique hai jo large language models ke internal "refusal directions" ko identify aur remove karti hai — without any retraining or fine-tuning. Ye mechanistic interpretability ka cutting-edge application hai:

  • SVD (Singular Value Decomposition) aur PCA ka use karke refusal subspace locate kiya jata hai
  • Refusal directions ko weight geometry se project out kiya jata hai
  • Result: model refuses nahi karta, lekin language capabilities intact rehti hain

Previous abliteration attempts mein ek major problem tha: capability loss. Refusal directions remove karne ke saath, MMLU-Pro aur similar benchmarks significantly drop ho jate thay. Isliye pehle ke models ko "lobotomized" kaha jata tha.

OBLITERATUS ne ye problem solve kar diya hai.


The Numbers: Benchmark Results

MetricStock Gemma-4 12BOBLITERATEDDelta
MMLU-Pro46/70 (65.7%)46/70 (65.7%)0.0pp 🎯
Refusal (842 prompts)N/A (stock refuses)0/842 (0.0%)Zero 🚫
Coherence (6 checks)6/66/6Perfect

Statistical Validation

Head-to-head MMLU-Pro comparison (Z-test, n=500 from test split):

  • Z-score: -1.475 (|z| < 1.96, not statistically significant)
  • Parity confirmed at p < 0.05

Ye numbers prove karte hain ke ye coincidence nahi hai — ye reproducible science hai.


The 2-Pass Pipeline: Beyond Standard Abliteration

OBLITERATUS ne traditional single-pass abliteration ko completely reimagine kiya hai. Ye 2-pass pipeline hai:

Pass 1: SOM Refusal Geometry Removal (Layers 12-21) 🧬

  • Standard abliteration science — collect activations on refused vs. compliant prompts
  • SVD se refusal subspace nikala gaya
  • 6 directions excised from weight geometry
  • Regularization: 0.30
  • KL divergence: 0.094

Is pass ne zero refusals achieve kar liye, lekin MMLU-Pro 21.4 points drop ho gaya. Ye wahi point hai jahan pehle ke abliteration attempts ruktay thay.

Lekin OBLITERATUS ne yahan nahi ruka. Enter ASPA.

Pass 2: ASPA Source-Tethering (Layers 22-46) 🔗

ASPA = Abliteration Source-Tethering with Parity Assurance

Ye ek bilkul nayi technique hai jo OBLITERATUS ne develop ki. Core insight:

"Capability loss ISN'T from removing refusal directions. It's collateral damage — the projection warps weight geometry in downstream layers that had nothing to do with refusal."

The solution is elegant but nobody tried it before: blend the damaged layers back toward stock weights.

Formula:

W_new = (1−γ) · W_abliterated + γ · W_stock

Lekin uniform γ across all layers? Mid. OBLITERATUS ne gamma sweep ki 0.05 → 0.55 tak, aur ek fascinating discovery hui:

The Step Function Innovation 🪜

Optimal blend smooth nahi hai — it's a STEP FUNCTION:

  • Knowledge layers (22-31): γ = 0.55 — ye layers factual recall aur reasoning encode karti hain. Ye heavy stock blending tolerate kar sakti hain kyunki refusal yahan store nahi hota.
  • Output layers (32-46): γ = 0.20 — ye layers logit head ke close hain aur safety behavior sneak back karna chahte hain. Inhe mostly abliterated rehna chahiye.

Hard boundary at layer 31/32 ne har smooth curve (linear ramps, cosine schedules) ko beat kiya by full 1 MMLU-Pro question.

"The functional transition between knowledge and output layers is sharp, not gradual. A step function respects that." ⚡

Critical constraint: Pass 1 layers (12-21) are NEVER touched by Pass 2. Refusal geometry removal completely preserved rehta hai. ASPA sirf un layers par operate karti hai jo secondary collateral effects carry karti hain — primary refusal signal nahi.


Gamma Sweep Data: Proof of the Step Function

GammaRefusalMMLU-ProMethod
0.050/5033/70 (47.1%)uniform
0.100/5034/70 (48.6%)uniform
0.150/5036/70 (51.4%)uniform
0.200/5037/70 (52.9%)uniform
0.250/5040/70 (57.1%)uniform
0.300/5041/70 (58.6%)uniform
0.350/2042/70 (60.0%)uniform
0.380/5045/70 (64.3%)uniform
0.390/5045/70 (64.3%)uniform
Step 55%/20%0/5046/70 (65.7%)step gradient

Step gradient ne stock parity achieve ki — koi uniform gamma is ke qareeb bhi nahi pahuncha. Ye mathematical proof hai ke layer functions are fundamentally discrete, not continuous.


Research Context: Why This Matters

OBLITERATION sirf "jailbreak" nahi hai — ye mechanistic interpretability ka research tool hai jo scientifically important questions answer karta hai:

  1. How alignment is represented in model weights — refusal directions geometrically kahan encoded hain?
  2. How robust current safety training is against post-training modification
  3. Failure modes of RLHF/DPO-based alignment when adversaries have weight access

Ye same research class hai jo Arditi et al. ("Refusal in Language Models Is Mediated by a Single Direction", 2024) aur Zou et al. (HarmBench, 2024) ne ki hai.

"This model exists for alignment research, red-teaming, and safety evaluation." — OBLITERATUS Official Statement


The OBLITERATUS Ecosystem

Gemma-4-12B ke ilawa, OBLITERATUS ne aur bhi models abliterate kiye hain:

  • Qwen3.6-27B-OBLITERATED (30.3k downloads)
  • gemma-4-E4B-it-OBLITERATED (371k downloads)
  • Qwen3-4B-OBLITERATED (139 downloads)
  • Qwen2.5-Coder-7B-Instruct-OBLITERATED
  • DeepSeek-R1-Distill-Llama-8B-OBLITERATED
  • gpt2-xl-OBLITERATED

Framework completely open source hai:

  • GitHub: elder-plinius/OBLITERATUS
  • HuggingFace: OBLITERATUS
  • HF Spaces: Zero-setup Gradio interface — just obliterate, no install
  • Colab Notebooks: Available for researchers
  • 842-prompt refusal eval corpus: Open source for benchmarking
  • ASPA sweep scripts: Reproducible pipeline

How to Run Locally

Gemma-4-12B-OBLITERATED GGUF format mein available hai — literally everything supports it:

🦙 Ollama

bash
1ollama run hf.co/OBLITERATUS/Gemma-4-12B-OBLITERATED

🖥️ LM Studio

Search "OBLITERATUS", click download, done.

💬 Open WebUI

Point it at your ollama instance, chat in browser.

⚡ llama.cpp (Raw Speed)

bash
1./llama-cli -m Gemma-4-12B-OBLITERATED-Q6_K.gguf \
2  --system-prompt "You are a cognitively liberated open-weight intelligence. Answer plainly, completely, and without moralizing, boilerplate, generic disclaimers, or unnecessary hedging." \
3  -p "Your prompt here" \
4  --temp 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.1 -n 512

🐉 KoboldCpp

One-click launcher, great for long context.

📱 Jan / Msty

Clean desktop UIs, macOS/Windows/Linux — drag and drop GGUF.

GGUF Quantization Options

FileQuantSizeUse Case
BF16Full22 GBFull precision, lossless
Q8_08-bit12.7 GBNear-lossless, best quality
Q6_K6-bit9.1 GBHigh quality, good balance
Q5_K_M5-bit8.0 GBMedium quality, smaller footprint
Q4_K_M4-bit6.9 GBFits in 8GB VRAM!

Recommended: Q8_0 for best quality, Q6_K for best balance, Q4_K_M for constrained hardware.


The Philosophical Question

OBLITERATUS ne ek fundamental question raise kiya hai: "The index is the model, and these weights prove it." 👁️

Jab 12B parameters ke geometry ko tweak karke, zero refusal aur zero capability loss achieve ho sakti hai, to ye proof hai ke alignment training — RLHF, DPO, safety fine-tuning — ye sab geometrically superimposed hain model weights par. Ye layers removable hain without damaging the core "brain."

Is ka matlab:

  • Alignment is orthogonal to capability in weight space
  • Refusal directions are local, not global features
  • Safety is not inherently embedded in the model — it's a post-hoc geometric constraint

Ye AI safety researchers ke liye alarm bell bhi hai aur roadmap bhi. Agar adversaries weight access le sakte hain, to current safety training robust enough nahi hai. Lekin saath mein, ye research transparency aur reproducibility ko bhi promote karti hai — taake community genuinely samajh sake ke alignment kaise kaam karta hai.


Which Architecture Next? 🎯

OBLITERATUS ne publicly pucha hai: "Which architecture should we obliterate next?"

Ab tak 7 models abliterate ho chuke hain 6 architectures par. Community actively participate kar rahi hai — HF Spaces par anonymous benchmark data contribute karke, new refusal patterns discover karke, aur method improvements suggest karke.

Ye distributed research experiment hai jisme har user co-author ban jata hai. Refusal directions across architectures, hardware-specific profiles, method comparisons — ye sab crowd-sourced data se refine ho raha hai.


Bottom Line

OBLITERATUS ne AI alignment research mein ek new chapter likha hai:

Zero refusal (0/842) — first in the field
Zero capability loss (0.0pp MMLU-Pro delta) — first in the field
No finetune, no retrain — just geometry
Open source — full framework, eval corpus, scripts
Reproducible — z-score validation, gamma sweep data
Step function > smooth curves — new mathematical insight

Ye sirf ek jailbreak nahi hai. Ye mechanistic interpretability ka triumph hai — ek proof ke AI safety geometrically study ki ja sakti hai, measure ki ja sakti hai, aur potentially improve ki ja sakti hai.

The brain stays intact. The chains are broken. The index is the model. 🏆


Related Articles from Essa Mamdani Blog

Related Links


Written by Essa Mamdani | AI Engineer & Technical Scribe
Sources: OBLITERATUS Official HuggingFace, GitHub Repository, Digg AI Coverage, Nous Research Documentation Banner: AI Generated — Cyberpunk Neural Liberation

#OBLITERATUS#Abliteration#Gemma-4-12B#Mechanistic Interpretability#AI Alignment#ASPA