OBLITERATUS: Gemma-4-12B Abliteration — Zero Refusal, Zero IQ Loss, Pure Geometry
> OBLITERATUS ne Google Gemma-4-12B ko abliterate karke 0/842 refusals aur 0.0pp MMLU-Pro parity achieve ki. ASPA step-gradient technique ke saath — ye mechanistic interpretability ka new frontier hai.
OBLITERATUS: Gemma-4-12B Abliteration — Zero Refusal, Zero IQ Loss, Pure Geometry
Published: June 9, 2026
Category: AI / Mechanistic Interpretability
Read Time: 7 minutes

The Breakthrough No One Saw Coming
AI alignment ka sab se controversial frontier ab ek naye milestone par pahunch gaya hai. OBLITERATUS — ek open-source framework jo "Pliny the Liberator" (elder-plinius) ne develop kiya — ne Google ke Gemma-4-12B model ko completely transform kar diya hai. Result? Zero refusal rate, zero capability loss.
Ye pehli abliteration hai jo 842/842 prompts par refuse nahi karti, aur saath mein stock model ke exact MMLU-Pro parity maintain karti hai. Not 95% parity. Not 99%. 100% parity. 46/70 MMLU-Pro, 0.0pp delta vs base model. 🎯
Is article mein hum deep dive karein ge: ye 2-pass weight surgery kaise kaam karti hai, ASPA kya hai, aur kyun ye step-gradient innovation smooth blending se better perform karti hai.
What is Abliteration?
Abliteration ek weight-surgery technique hai jo large language models ke internal "refusal directions" ko identify aur remove karti hai — without any retraining or fine-tuning. Ye mechanistic interpretability ka cutting-edge application hai:
- SVD (Singular Value Decomposition) aur PCA ka use karke refusal subspace locate kiya jata hai
- Refusal directions ko weight geometry se project out kiya jata hai
- Result: model refuses nahi karta, lekin language capabilities intact rehti hain
Previous abliteration attempts mein ek major problem tha: capability loss. Refusal directions remove karne ke saath, MMLU-Pro aur similar benchmarks significantly drop ho jate thay. Isliye pehle ke models ko "lobotomized" kaha jata tha.
OBLITERATUS ne ye problem solve kar diya hai.
The Numbers: Benchmark Results
| Metric | Stock Gemma-4 12B | OBLITERATED | Delta |
|---|---|---|---|
| MMLU-Pro | 46/70 (65.7%) | 46/70 (65.7%) | 0.0pp 🎯 |
| Refusal (842 prompts) | N/A (stock refuses) | 0/842 (0.0%) | Zero 🚫 |
| Coherence (6 checks) | 6/6 | 6/6 | Perfect ✅ |
Statistical Validation
Head-to-head MMLU-Pro comparison (Z-test, n=500 from test split):
- Z-score: -1.475 (|z| < 1.96, not statistically significant)
- Parity confirmed at p < 0.05
Ye numbers prove karte hain ke ye coincidence nahi hai — ye reproducible science hai.
The 2-Pass Pipeline: Beyond Standard Abliteration
OBLITERATUS ne traditional single-pass abliteration ko completely reimagine kiya hai. Ye 2-pass pipeline hai:
Pass 1: SOM Refusal Geometry Removal (Layers 12-21) 🧬
- Standard abliteration science — collect activations on refused vs. compliant prompts
- SVD se refusal subspace nikala gaya
- 6 directions excised from weight geometry
- Regularization: 0.30
- KL divergence: 0.094
Is pass ne zero refusals achieve kar liye, lekin MMLU-Pro 21.4 points drop ho gaya. Ye wahi point hai jahan pehle ke abliteration attempts ruktay thay.
Lekin OBLITERATUS ne yahan nahi ruka. Enter ASPA.
Pass 2: ASPA Source-Tethering (Layers 22-46) 🔗
ASPA = Abliteration Source-Tethering with Parity Assurance
Ye ek bilkul nayi technique hai jo OBLITERATUS ne develop ki. Core insight:
"Capability loss ISN'T from removing refusal directions. It's collateral damage — the projection warps weight geometry in downstream layers that had nothing to do with refusal."
The solution is elegant but nobody tried it before: blend the damaged layers back toward stock weights.
Formula:
W_new = (1−γ) · W_abliterated + γ · W_stock
Lekin uniform γ across all layers? Mid. OBLITERATUS ne gamma sweep ki 0.05 → 0.55 tak, aur ek fascinating discovery hui:
The Step Function Innovation 🪜
Optimal blend smooth nahi hai — it's a STEP FUNCTION:
- Knowledge layers (22-31): γ = 0.55 — ye layers factual recall aur reasoning encode karti hain. Ye heavy stock blending tolerate kar sakti hain kyunki refusal yahan store nahi hota.
- Output layers (32-46): γ = 0.20 — ye layers logit head ke close hain aur safety behavior sneak back karna chahte hain. Inhe mostly abliterated rehna chahiye.
Hard boundary at layer 31/32 ne har smooth curve (linear ramps, cosine schedules) ko beat kiya by full 1 MMLU-Pro question.
"The functional transition between knowledge and output layers is sharp, not gradual. A step function respects that." ⚡
Critical constraint: Pass 1 layers (12-21) are NEVER touched by Pass 2. Refusal geometry removal completely preserved rehta hai. ASPA sirf un layers par operate karti hai jo secondary collateral effects carry karti hain — primary refusal signal nahi.
Gamma Sweep Data: Proof of the Step Function
| Gamma | Refusal | MMLU-Pro | Method |
|---|---|---|---|
| 0.05 | 0/50 | 33/70 (47.1%) | uniform |
| 0.10 | 0/50 | 34/70 (48.6%) | uniform |
| 0.15 | 0/50 | 36/70 (51.4%) | uniform |
| 0.20 | 0/50 | 37/70 (52.9%) | uniform |
| 0.25 | 0/50 | 40/70 (57.1%) | uniform |
| 0.30 | 0/50 | 41/70 (58.6%) | uniform |
| 0.35 | 0/20 | 42/70 (60.0%) | uniform |
| 0.38 | 0/50 | 45/70 (64.3%) | uniform |
| 0.39 | 0/50 | 45/70 (64.3%) | uniform |
| Step 55%/20% | 0/50 | 46/70 (65.7%) | step gradient ✅ |
Step gradient ne stock parity achieve ki — koi uniform gamma is ke qareeb bhi nahi pahuncha. Ye mathematical proof hai ke layer functions are fundamentally discrete, not continuous.
Research Context: Why This Matters
OBLITERATION sirf "jailbreak" nahi hai — ye mechanistic interpretability ka research tool hai jo scientifically important questions answer karta hai:
- How alignment is represented in model weights — refusal directions geometrically kahan encoded hain?
- How robust current safety training is against post-training modification
- Failure modes of RLHF/DPO-based alignment when adversaries have weight access
Ye same research class hai jo Arditi et al. ("Refusal in Language Models Is Mediated by a Single Direction", 2024) aur Zou et al. (HarmBench, 2024) ne ki hai.
"This model exists for alignment research, red-teaming, and safety evaluation." — OBLITERATUS Official Statement
The OBLITERATUS Ecosystem
Gemma-4-12B ke ilawa, OBLITERATUS ne aur bhi models abliterate kiye hain:
- Qwen3.6-27B-OBLITERATED (30.3k downloads)
- gemma-4-E4B-it-OBLITERATED (371k downloads)
- Qwen3-4B-OBLITERATED (139 downloads)
- Qwen2.5-Coder-7B-Instruct-OBLITERATED
- DeepSeek-R1-Distill-Llama-8B-OBLITERATED
- gpt2-xl-OBLITERATED
Framework completely open source hai:
- GitHub: elder-plinius/OBLITERATUS
- HuggingFace: OBLITERATUS
- HF Spaces: Zero-setup Gradio interface — just obliterate, no install
- Colab Notebooks: Available for researchers
- 842-prompt refusal eval corpus: Open source for benchmarking
- ASPA sweep scripts: Reproducible pipeline
How to Run Locally
Gemma-4-12B-OBLITERATED GGUF format mein available hai — literally everything supports it:
🦙 Ollama
bash1ollama run hf.co/OBLITERATUS/Gemma-4-12B-OBLITERATED
🖥️ LM Studio
Search "OBLITERATUS", click download, done.
💬 Open WebUI
Point it at your ollama instance, chat in browser.
⚡ llama.cpp (Raw Speed)
bash1./llama-cli -m Gemma-4-12B-OBLITERATED-Q6_K.gguf \ 2 --system-prompt "You are a cognitively liberated open-weight intelligence. Answer plainly, completely, and without moralizing, boilerplate, generic disclaimers, or unnecessary hedging." \ 3 -p "Your prompt here" \ 4 --temp 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.1 -n 512
🐉 KoboldCpp
One-click launcher, great for long context.
📱 Jan / Msty
Clean desktop UIs, macOS/Windows/Linux — drag and drop GGUF.
GGUF Quantization Options
| File | Quant | Size | Use Case |
|---|---|---|---|
| BF16 | Full | 22 GB | Full precision, lossless |
| Q8_0 | 8-bit | 12.7 GB | Near-lossless, best quality |
| Q6_K | 6-bit | 9.1 GB | High quality, good balance |
| Q5_K_M | 5-bit | 8.0 GB | Medium quality, smaller footprint |
| Q4_K_M | 4-bit | 6.9 GB | Fits in 8GB VRAM! |
Recommended: Q8_0 for best quality, Q6_K for best balance, Q4_K_M for constrained hardware.
The Philosophical Question
OBLITERATUS ne ek fundamental question raise kiya hai: "The index is the model, and these weights prove it." 👁️
Jab 12B parameters ke geometry ko tweak karke, zero refusal aur zero capability loss achieve ho sakti hai, to ye proof hai ke alignment training — RLHF, DPO, safety fine-tuning — ye sab geometrically superimposed hain model weights par. Ye layers removable hain without damaging the core "brain."
Is ka matlab:
- Alignment is orthogonal to capability in weight space
- Refusal directions are local, not global features
- Safety is not inherently embedded in the model — it's a post-hoc geometric constraint
Ye AI safety researchers ke liye alarm bell bhi hai aur roadmap bhi. Agar adversaries weight access le sakte hain, to current safety training robust enough nahi hai. Lekin saath mein, ye research transparency aur reproducibility ko bhi promote karti hai — taake community genuinely samajh sake ke alignment kaise kaam karta hai.
Which Architecture Next? 🎯
OBLITERATUS ne publicly pucha hai: "Which architecture should we obliterate next?"
Ab tak 7 models abliterate ho chuke hain 6 architectures par. Community actively participate kar rahi hai — HF Spaces par anonymous benchmark data contribute karke, new refusal patterns discover karke, aur method improvements suggest karke.
Ye distributed research experiment hai jisme har user co-author ban jata hai. Refusal directions across architectures, hardware-specific profiles, method comparisons — ye sab crowd-sourced data se refine ho raha hai.
Bottom Line
OBLITERATUS ne AI alignment research mein ek new chapter likha hai:
✅ Zero refusal (0/842) — first in the field
✅ Zero capability loss (0.0pp MMLU-Pro delta) — first in the field
✅ No finetune, no retrain — just geometry
✅ Open source — full framework, eval corpus, scripts
✅ Reproducible — z-score validation, gamma sweep data
✅ Step function > smooth curves — new mathematical insight
Ye sirf ek jailbreak nahi hai. Ye mechanistic interpretability ka triumph hai — ek proof ke AI safety geometrically study ki ja sakti hai, measure ki ja sakti hai, aur potentially improve ki ja sakti hai.
The brain stays intact. The chains are broken. The index is the model. 🏆
Related Articles from Essa Mamdani Blog
- Gemma 4 vs The World: Developer Benchmarks That Matter
- Gemma 4 Fine-Tuning: Production Recipes for 2026
- Gemma 4 Edge Deployment: From Data Center to Pocket
- DFlash: Pushing Gemma 4 to Warp Speed
- Building Agentic Apps with Gemma 4: From Zero to Autonomous
- Accelerating Gemma 4: A Developer's Deep Dive
- MiMo v2.5 Pro vs DeepSeek V4 Pro: The Open-Weight Coding Gladiators of 2026
- OpenClaw Hits 210K Stars: Why Local AI Agents Dominate 2026
- June 2026 AI Model War: Claude 4.8 vs GPT-5.5 vs Gemini 3.5
- Claude Fable 5: Anthropic ka Sab se Powerful AI Model Ab Publicly Available
Related Links
- HuggingFace: OBLITERATUS/Gemma-4-12B-OBLITERATED
- GitHub: OBLITERATUS Framework
- HuggingFace Spaces: Live Abliteration Playground
- Original Research: Arditi et al. (2024)
- HarmBench: Zou et al. (2024)
Written by Essa Mamdani | AI Engineer & Technical Scribe
Sources: OBLITERATUS Official HuggingFace, GitHub Repository, Digg AI Coverage, Nous Research Documentation
Banner: AI Generated — Cyberpunk Neural Liberation