This is a comprehensive, step-by-step guide to fine-tuning DeepSeek models under 10B parameters using Google Colab or any GPU with 24–30 GB VRAM and 300–500 GB of storage.
We leverage 4-bit quantization and LoRA adapters—cutting-edge parameter-efficient fine-tuning (PEFT) techniques that enable large model training on modest hardware. You'll learn how to set up the environment, load the model with BitsAndBytesConfig
, prepare data, convert the model for k-bit training, configure LoRA, and train effectively.
We also explore advanced options like gradient checkpointing and post-quantization methods. By the end, you’ll have a lightweight adapter that can be reused or shared—without needing to modify the full 8B parameter model.
1. Environment and Prerequisites
- Python ≥ 3.10
- Hardware: A GPU with ≥ 16 GB VRAM (e.g., RTX 3090) or multi-GPU setup
Libraries:
!pip install peft # PEFT library for adapters :contentReference[oaicite:0]{index=0} !pip install bitsandbytes # 4-bit and 8-bit quantization :contentReference[oaicite:1]{index=1} !pip install transformers # Hugging Face Transformers :contentReference[oaicite:2]{index=2} !pip install datasets # HF Datasets for data loading:contentReference[oaicite:3]{index=3}
- Authentication:
huggingface-cli login
if pulling from private repos.
2. Loading the Pre-trained Model in 4-bit
Use BitsAndBytesConfig
to quantize weights to 4 bits, reducing memory by ~4× with minimal accuracy loss.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
llm_int8_enable_fp32_cpu_offload=True
) # Double quant + NF4 yields stable 4-bit training :contentReference[oaicite:4]{index=4}
model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16
)
bnb_4bit_quant_type="nf4"
: Nibble-float quant (“NF4”) balances range and precision Hugging Facefp32_cpu_offload
: Offloads key weights to CPU in fp32 for stability Hugging Face
3. Preparing and Tokenizing Your Dataset
3.1 Loading Data
If your data is a list of dicts ({"prompt":…, "response":…}
), convert it to a Hugging Face Dataset
:
from datasets import Dataset
import json
with open("files/data.json") as f:
data = json.load(f)
dataset = Dataset.from_list(data)
3.2 Tokenization Function
Customize prompt–response formatting to guide the model:
def tokenize(example):
text = f"{example['prompt']}\n### Response:\n{example['response']}"
return tokenizer(text, truncation=True, padding="max_length", max_length=512)
tokenized_dataset = dataset.map(tokenize, batched=False)
- Max length: 512 tokens is a good trade-off for dialogue tasks.
- Batching: Non-batched mapping avoids variable sequence errors Hugging Face
4. Preparing the Model for k-bit Training
Quantized models need a small adaptation before training. PEFT’s helper does this:
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
This function will:
- Freeze original weights.
- Enable gradient checkpointing where supported.
- Ensure all submodules use compatible dtypes. Hugging FaceMedium
5. Configuring and Applying LoRA Adapters
LoRA introduces trainable low-rank matrices into attention projections:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
) # Conservative dropout and biasless LoRA :contentReference[oaicite:10]{index=10}
model = get_peft_model(model, lora_config)
r
(rank): Low-rank size (16–64) controls capacity vs. speed.lora_alpha
: Scales LoRA updates, often set equal tor
.target_modules
: Adapt all query/key/value/output projections Hugging Face
6. Setting Up Training
Leverage the Hugging Face Trainer
with DataCollatorForLanguageModeling
:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
training_args = TrainingArguments(
output_dir="./checkpoints",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
report_to="none"
)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
data_collator=data_collator
)
trainer.train()
gradient_accumulation_steps
: Simulates larger batches without extra VRAM.fp16
: Mixed-precision speeds up training on NVIDIA cards.
7. Saving and Sharing Your Adapter
After training, you only need to save the small LoRA adapter:
model.save_pretrained("DeepSeek-8B-LoRA")
tokenizer.save_pretrained("DeepSeek-8B-LoRA")
Users can then load your adapter on top of the base model:
from peft import PeftModel
peft_model = PeftModel.from_pretrained(
base_model=AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto"),
adapter_model="DeepSeek-8B-LoRA"
)
8. Advanced Options and Extensions
Gradient Checkpointing:
model.gradient_checkpointing_enable()
Reduces memory at the cost of extra computation Medium
QLoRA (“all-linear”): Apply LoRA to every linear layer:
LoraConfig(target_modules="all-linear", ...)
Improves coverage on non-standard architectures Hugging Face
GPTQ Post-Training: After LoRA, apply GPTQ quantization:
from transformers import GPTQConfig gptq_conf = GPTQConfig(bits=4, group_size=128, dataset="wikitext2", tokenizer=tokenizer) q_model = AutoModelForCausalLM.from_pretrained(..., quantization_config=gptq_conf)
Saves even more memory for inference Hugging Face
HQQ Quantization: Half-Quadratic Quantization support with
hqq
library:bash
Works similarly to BitsAndBytes + PEFT Hugging Face
By following these steps, you leverage the latest PEFT, BitsAndBytes, and Transformers features to fine-tune an 8 billion-parameter model on modest hardware, yielding a compact LoRA adapter you can easily share and deploy.