Train DeepSeek Models Efficiently: LoRA + 4-Bit Guide for <10B Models

Learn how to fine-tune DeepSeek 8B models efficiently using LoRA and 4-bit quantization on Google Colab or any 30GB GPU. This step-by-step guide walks you through model loading, dataset preparation, training setup, and advanced optimization techniques—enabling scalable, low-cost customization with parameter-efficient methods.

Train DeepSeek Models Efficiently: LoRA + 4-Bit Guide for <10B Models

This is a comprehensive, step-by-step guide to fine-tuning DeepSeek models under 10B parameters using Google Colab or any GPU with 24–30 GB VRAM and 300–500 GB of storage. 

We leverage 4-bit quantization and LoRA adapters—cutting-edge parameter-efficient fine-tuning (PEFT) techniques that enable large model training on modest hardware. You'll learn how to set up the environment, load the model with BitsAndBytesConfig, prepare data, convert the model for k-bit training, configure LoRA, and train effectively. 

We also explore advanced options like gradient checkpointing and post-quantization methods. By the end, you’ll have a lightweight adapter that can be reused or shared—without needing to modify the full 8B parameter model.

 

1. Environment and Prerequisites

  • Python ≥ 3.10
  • Hardware: A GPU with ≥ 16 GB VRAM (e.g., RTX 3090) or multi-GPU setup
  • Libraries:

     

    !pip install peft               # PEFT library for adapters   :contentReference[oaicite:0]{index=0}   
    !pip install bitsandbytes       # 4-bit and 8-bit quantization :contentReference[oaicite:1]{index=1}   
    !pip install transformers       # Hugging Face Transformers   :contentReference[oaicite:2]{index=2}   
    !pip install datasets           # HF Datasets for data loading:contentReference[oaicite:3]{index=3}   
  • Authentication: huggingface-cli login if pulling from private repos.

2. Loading the Pre-trained Model in 4-bit

Use BitsAndBytesConfig to quantize weights to 4 bits, reducing memory by ~4× with minimal accuracy loss.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    llm_int8_enable_fp32_cpu_offload=True
)  # Double quant + NF4 yields stable 4-bit training :contentReference[oaicite:4]{index=4}

model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
  • bnb_4bit_quant_type="nf4": Nibble-float quant (“NF4”) balances range and precision Hugging Face
  • fp32_cpu_offload: Offloads key weights to CPU in fp32 for stability Hugging Face

3. Preparing and Tokenizing Your Dataset

3.1 Loading Data

If your data is a list of dicts ({"prompt":…, "response":…}), convert it to a Hugging Face Dataset:

from datasets import Dataset
import json

with open("files/data.json") as f:
    data = json.load(f)

dataset = Dataset.from_list(data)

3.2 Tokenization Function

Customize prompt–response formatting to guide the model:

def tokenize(example):
    text = f"{example['prompt']}\n### Response:\n{example['response']}"
    return tokenizer(text, truncation=True, padding="max_length", max_length=512)
    
tokenized_dataset = dataset.map(tokenize, batched=False)
  • Max length: 512 tokens is a good trade-off for dialogue tasks.
  • Batching: Non-batched mapping avoids variable sequence errors Hugging Face

4. Preparing the Model for k-bit Training

Quantized models need a small adaptation before training. PEFT’s helper does this:

from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

This function will:

  1. Freeze original weights.
  2. Enable gradient checkpointing where supported.
  3. Ensure all submodules use compatible dtypes. Hugging FaceMedium

5. Configuring and Applying LoRA Adapters

LoRA introduces trainable low-rank matrices into attention projections:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)  # Conservative dropout and biasless LoRA :contentReference[oaicite:10]{index=10}

model = get_peft_model(model, lora_config)
  • r (rank): Low-rank size (16–64) controls capacity vs. speed.
  • lora_alpha: Scales LoRA updates, often set equal to r.
  • target_modules: Adapt all query/key/value/output projections Hugging Face

6. Setting Up Training

Leverage the Hugging Face Trainer with DataCollatorForLanguageModeling:

from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

training_args = TrainingArguments(
    output_dir="./checkpoints",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none"
)

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)
trainer.train()
  • gradient_accumulation_steps: Simulates larger batches without extra VRAM.
  • fp16: Mixed-precision speeds up training on NVIDIA cards.

7. Saving and Sharing Your Adapter

After training, you only need to save the small LoRA adapter:

model.save_pretrained("DeepSeek-8B-LoRA")
tokenizer.save_pretrained("DeepSeek-8B-LoRA")

Users can then load your adapter on top of the base model:

from peft import PeftModel

peft_model = PeftModel.from_pretrained(
    base_model=AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto"),
    adapter_model="DeepSeek-8B-LoRA"
)

8. Advanced Options and Extensions

  • Gradient Checkpointing:

    model.gradient_checkpointing_enable() 

    Reduces memory at the cost of extra computation Medium

  • QLoRA (“all-linear”): Apply LoRA to every linear layer:

    LoraConfig(target_modules="all-linear", ...) 

    Improves coverage on non-standard architectures Hugging Face

  • GPTQ Post-Training: After LoRA, apply GPTQ quantization:

    from transformers import GPTQConfig
    gptq_conf = GPTQConfig(bits=4, group_size=128, dataset="wikitext2", tokenizer=tokenizer)
    q_model = AutoModelForCausalLM.from_pretrained(..., quantization_config=gptq_conf)

    Saves even more memory for inference Hugging Face

  • HQQ Quantization: Half-Quadratic Quantization support with hqq library:

    bash

    Works similarly to BitsAndBytes + PEFT Hugging Face


By following these steps, you leverage the latest PEFT, BitsAndBytes, and Transformers features to fine-tune an 8 billion-parameter model on modest hardware, yielding a compact LoRA adapter you can easily share and deploy.