Train DeepSeek Models Efficiently: LoRA + 4-Bit Guide for <10B Models

Learn how to fine-tune DeepSeek 8B models efficiently using LoRA and 4-bit quantization on Google Colab or any 30GB GPU. This step-by-step guide walks you through model loading, dataset preparation, training setup, and advanced optimization techniques—enabling scalable, low-cost customization with parameter-efficient methods.

This is a comprehensive, step-by-step guide to fine-tuning DeepSeek models under 10B parameters using Google Colab or any GPU with 24–30 GB VRAM and 300–500 GB of storage.

We leverage 4-bit quantization and LoRA adapters—cutting-edge parameter-efficient fine-tuning (PEFT) techniques that enable large model training on modest hardware. You'll learn how to set up the environment, load the model with BitsAndBytesConfig, prepare data, convert the model for k-bit training, configure LoRA, and train effectively.

We also explore advanced options like gradient checkpointing and post-quantization methods. By the end, you’ll have a lightweight adapter that can be reused or shared—without needing to modify the full 8B parameter model.

1. Environment and Prerequisites

Python ≥ 3.10
Hardware: A GPU with ≥ 16 GB VRAM (e.g., RTX 3090) or multi-GPU setup

Libraries:

!pip install peft               # PEFT library for adapters   :contentReference[oaicite:0]{index=0}   
!pip install bitsandbytes       # 4-bit and 8-bit quantization :contentReference[oaicite:1]{index=1}   
!pip install transformers       # Hugging Face Transformers   :contentReference[oaicite:2]{index=2}   
!pip install datasets           # HF Datasets for data loading:contentReference[oaicite:3]{index=3}

Authentication: huggingface-cli login if pulling from private repos.

2. Loading the Pre-trained Model in 4-bit

Use BitsAndBytesConfig to quantize weights to 4 bits, reducing memory by ~4× with minimal accuracy loss.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    llm_int8_enable_fp32_cpu_offload=True
)  # Double quant + NF4 yields stable 4-bit training :contentReference[oaicite:4]{index=4}

model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

bnb_4bit_quant_type="nf4": Nibble-float quant (“NF4”) balances range and precision Hugging Face
fp32_cpu_offload: Offloads key weights to CPU in fp32 for stability Hugging Face

3. Preparing and Tokenizing Your Dataset

3.1 Loading Data

If your data is a list of dicts ({"prompt":…, "response":…}), convert it to a Hugging Face Dataset:

from datasets import Dataset
import json

with open("files/data.json") as f:
    data = json.load(f)

dataset = Dataset.from_list(data)

3.2 Tokenization Function

Customize prompt–response formatting to guide the model:

def tokenize(example):
    text = f"{example['prompt']}\n### Response:\n{example['response']}"
    return tokenizer(text, truncation=True, padding="max_length", max_length=512)
    
tokenized_dataset = dataset.map(tokenize, batched=False)

Max length: 512 tokens is a good trade-off for dialogue tasks.
Batching: Non-batched mapping avoids variable sequence errors Hugging Face

4. Preparing the Model for k-bit Training

Quantized models need a small adaptation before training. PEFT’s helper does this:

from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

This function will:

Freeze original weights.
Enable gradient checkpointing where supported.
Ensure all submodules use compatible dtypes. Hugging Face Medium

5. Configuring and Applying LoRA Adapters

LoRA introduces trainable low-rank matrices into attention projections:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)  # Conservative dropout and biasless LoRA :contentReference[oaicite:10]{index=10}

model = get_peft_model(model, lora_config)

r (rank): Low-rank size (16–64) controls capacity vs. speed.
lora_alpha: Scales LoRA updates, often set equal to r.
target_modules: Adapt all query/key/value/output projections Hugging Face

6. Setting Up Training

Leverage the Hugging Face Trainer with DataCollatorForLanguageModeling:

from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

training_args = TrainingArguments(
    output_dir="./checkpoints",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none"
)

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)
trainer.train()

gradient_accumulation_steps: Simulates larger batches without extra VRAM.
fp16: Mixed-precision speeds up training on NVIDIA cards.

7. Saving and Sharing Your Adapter

After training, you only need to save the small LoRA adapter:

model.save_pretrained("DeepSeek-8B-LoRA")
tokenizer.save_pretrained("DeepSeek-8B-LoRA")

Users can then load your adapter on top of the base model:

from peft import PeftModel

peft_model = PeftModel.from_pretrained(
    base_model=AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto"),
    adapter_model="DeepSeek-8B-LoRA"
)

8. Advanced Options and Extensions

Gradient Checkpointing:
model.gradient_checkpointing_enable()
Reduces memory at the cost of extra computation Medium
QLoRA (“all-linear”): Apply LoRA to every linear layer:
LoraConfig(target_modules="all-linear", ...)
Improves coverage on non-standard architectures Hugging Face

GPTQ Post-Training: After LoRA, apply GPTQ quantization:

from transformers import GPTQConfig
gptq_conf = GPTQConfig(bits=4, group_size=128, dataset="wikitext2", tokenizer=tokenizer)
q_model = AutoModelForCausalLM.from_pretrained(..., quantization_config=gptq_conf)

Saves even more memory for inference Hugging Face

HQQ Quantization: Half-Quadratic Quantization support with hqq library:
bash
Works similarly to BitsAndBytes + PEFT Hugging Face

By following these steps, you leverage the latest PEFT, BitsAndBytes, and Transformers features to fine-tune an 8 billion-parameter model on modest hardware, yielding a compact LoRA adapter you can easily share and deploy.

DataGuru

Train DeepSeek Models Efficiently: LoRA + 4-Bit Guide for <10B Models

1. Environment and Prerequisites

2. Loading the Pre-trained Model in 4-bit

3. Preparing and Tokenizing Your Dataset

3.1 Loading Data

3.2 Tokenization Function

4. Preparing the Model for k-bit Training

5. Configuring and Applying LoRA Adapters

6. Setting Up Training

7. Saving and Sharing Your Adapter

8. Advanced Options and Extensions

Search

Categories

Recent Posts

Train DeepSeek Models Efficiently: LoRA + 4-Bit Guide for <10B Models

Extracting PDFs with Scrapy and Implementing Date Tracking

How I Extracted Product Data from Daraz Using Scrapy

The Power of Automation

Scraping E-commerce Websites Without Getting Blocked

Tags