Local AI, Weight Modification: Modifying a Model from Within

A personal experiment to understand what really happens inside a language model. What's about This project stems from my curiosity to see how it works internally — not through an application or API, but the raw model, its weights, and internal architecture.

A personal experiment to understand what really happens inside a language model.

What It’s About

I’ve always used AI as a tool. This project arises from my curiosity to see how it works internally — not through an application or API, but the raw model, its weights, and internal architecture.

Heretic is an open-source tool that allows you to modify the behavior of a language model (LLM) using a technique called abliteration: instead of retraining the entire model from scratch, it intervenes on its internal activations to change how it responds to certain types of inputs.

The result is a modified model saved and published on Hugging Face.

What I Did, Step by Step

1. Local Environment on Windows with GPU

I set up a Python environment with a virtual environment, installed PyTorch with CUDA support to leverage my NVIDIA GPU. No cloud, no paid services: everything runs on my machine.

Python 3.13 · PyTorch 2.10 + CUDA 12.8 · Windows 11

2. Understanding What Heretic Does Under the Hood

Before running any command, I studied the process. Heretic:

loads the model and its weights (in Transformers/HuggingFace format)
uses two prompt datasets: one “innocuous” (mlabonne/harmless_alpaca) and one “problematic” (mlabonne/harmful_behaviors)
analyzes the internal activations of the transformer on both sets
calculates a latent space direction that separates the two behaviors
optimizes parameters through 200 trials with Optuna (Bayesian optimization)
applies the correction via LoRA — a lightweight addition to the model’s weights

In practice: the model is not rewritten, it is oriented.

3. The Model Chosen: TinyLlama 1.1B

I chose TinyLlama/TinyLlama-1.1B-Chat-v1.0 as a starting point — small enough to run comfortably locally (~5-15 minutes on GPU), but capable enough to be interesting to observe.

I also ran sessions on larger models (Mistral 7B, Phi-3) for comparison.

4. Publishing on Hugging Face

The resulting model was saved in the Transformers format and published publicly on Hugging Face:

paoloronco/TinyLlama-1.1B-Chat-v1.0-heretic

Included is a Colab notebook for testing without installation.

What I Learned

On the Functioning of Language Models

This project gave me concrete insights into concepts that were previously abstract:

Models do not “decide” to reject something symbolically — they do it because certain activation patterns repeat statistically. That tendency can be measured.
The latent space is not a metaphor: it’s a real mathematical structure, with identifiable directions corresponding to observable behaviors.
LoRA (Low-Rank Adaptation) allows you to modify a large model by intervening on only a small fraction of its parameters — it’s efficient and reversible compared to full fine-tuning.

On Working with Technical Tools

I learned how to read technical documentation and translate it into concrete, actionable steps
I managed a Python environment on Windows with complex dependencies (CUDA, PyTorch, Transformers)
I worked with Hugging Face repositories: file structure, model card, tokenizer, configurations
I understood the difference between distribution formats: Transformers, GGUF (for LM Studio), safetensors

On the Method

The project required research, reading papers and documentation, diagnosing errors (GPU not detected, incompatible dependencies, batch size), and adapting the process accordingly. No ready-made tutorial — just documentation and trial-and-error.

What This Project Is Not

Honest: I am not an AI engineer or a programmer. I used existing tools, AI assistance to navigate technical documentation, and my ability to understand what I was doing before doing it.

This is not a development project — it’s a project of exploration and learning. The value for me lies in the knowledge acquired, not the code written.

Repository Structure

Heretic/
├── Models/                         # GitHub repo for published models
│   ├── TinyLlama-1.1B-Chat-v1.0-heretic/
│   │   ├── Notebooks/              # Quick test Colab notebook
│   │   ├── config.json             # Model configuration
│   │   ├── generation_config.json
│   │   ├── chat_template.jinja     # Chat template
│   │   └── tokenizer_config.json
│   └── README.md
├── checkpoints/                    # Saved optimization sessions
│   ├── TinyLlama--TinyLlama-1--1B-Chat-v1--0.jsonl
│   ├── mistralai--Mistral-7B-Instruct-v0--2.jsonl
│   ├── ollama--phi3.jsonl
│   └── openai--gpt-4o.jsonl
├── .venv/                          # Local Python environment
└── info.md                         # Full operational guide

Technical Stack

Tool	Role
Python 3.13 + venv	Isolated environment
PyTorch 2.10 + CUDA 12.8	GPU NVIDIA computation
Transformers (HuggingFace)	Model loading and management
PEFT / LoRA	Applying the modifications
Optuna	Bayesian parameter optimization
Datasets (HuggingFace)	Good/bad prompt datasets
Safetensors	Model saving format
Accelerate	Device management (GPU/CPU)
HuggingFace Hub	Publishing and distribution