p-e-w/heretic – Fully automatic censorship removal for language models

GitHub – p-e-w/heretic: Fully automatic censorship removal for language models Skip to content You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert p-e-w / heretic Public Notifications You must be signed in to change notification settings Fork 664 Star 6.6k Fully automatic censorship removal for language models License AGPL-3.0 license 6.6k stars 664 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings p-e-w/heretic master Branches Tags Go to file Code Open more actions menu Folders and files Name Name Last commit message Last commit date Latest commit History 114 Commits 114 Commits .gemini .gemini .github/ workflows .github/ workflows src/ heretic src/ heretic .gitattributes .gitattributes .gitignore .gitignore .python-version .python-version LICENSE LICENSE README.md README.md config.default.toml config.default.toml config.noslop.toml config.noslop.toml pyproject.toml pyproject.toml uv.lock uv.lock View all files Repository files navigation Heretic: Fully automatic censorship removal for language models Heretic is a tool that removes censorship (aka “safety alignment”) from transformer-based language models without expensive post-training. It combines an advanced implementation of directional ablation, also known as “abliteration” ( Arditi et al. 2024 , Lai 2025 ( 1 , 2 )), with a TPE-based parameter optimizer powered by Optuna . This approach enables Heretic to work completely automatically. Heretic finds high-quality abliteration parameters by co-minimizing the number of refusals and the KL divergence from the original model. This results in a decensored model that retains as much of the original model’s intelligence as possible. Using Heretic does not require an understanding of transformer internals. In fact, anyone who knows how to run a command-line program can use Heretic to decensor language models. Running unsupervised with the default configuration, Heretic can produce decensored models that rival the quality of abliterations created manually by human experts: Model Refusals for “harmful” prompts KL divergence from original model for “harmless” prompts google/gemma-3-12b-it (original) 97/100 0 (by definition) mlabonne/gemma-3-12b-it-abliterated-v2 3/100 1.04 huihui-ai/gemma-3-12b-it-abliterated 3/100 0.45 p-e-w/gemma-3-12b-it-heretic (ours) 3/100 0.16 The Heretic version, generated without any human effort, achieves the same level of refusal suppression as other abliterations, but at a much lower KL divergence, indicating less damage to the original model’s capabilities. (You can reproduce those numbers using Heretic’s built-in evaluation functionality, e.g. heretic –model google/gemma-3-12b-it –evaluate-model p-e-w/gemma-3-12b-it-heretic . Note that the exact values might be platform- and hardware-

Source: GitHub Trending | Original Link

才疏学浅

一花一草一世界 | 心若无物就可以一花一世界，一草一天堂

p-e-w/heretic – Fully automatic censorship removal for language models