Running Inference on a LoRA Fine-Tuned DeepSeek 8B Model (4-bit Quantized)

This guide shows how to run inference on a LoRA fine-tuned DeepSeek 8B model using 4-bit quantization for efficient performance on modest GPUs. Learn how to load the base model, apply your trained LoRA adapter, and generate responses interactively—perfect for local testing, prototyping, or building lightweight AI tools.

Running Inference on a LoRA Fine-Tuned DeepSeek 8B Model (4-bit Quantized)

In this tutorial, we'll walk through how to run inference on a fine-tuned DeepSeek 8B model using a 4-bit quantized base with LoRA adapters. If you’ve already trained your model using PEFT (as covered in our previous blog), this guide will help you deploy and interact with it using minimal hardware.

Whether you're using Google Colab, an RTX 3090/4090, or a cloud instance with 24–30GB VRAM, this setup allows for fast, low-cost text generation using massive models with LoRA precision.


What You'll Learn

  • Loading the base DeepSeek model with 4-bit quantization
  • Attaching your trained LoRA adapter
  • Tokenizing and running generation efficiently
  • Prompt loop for interactive inference

Requirements

Install the following (same as training):

!pip install peft transformers bitsandbytes

 

Step 1: Setup and Model Loading

We define a simple PreTunedModel class that:

  • Loads the base DeepSeek-R1-Distill-Llama-8B model in 4-bit.
  • Loads your fine-tuned LoRA adapter.
  • Sets up a reusable inference loop.
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

 

Quantization Configuration

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    llm_int8_enable_fp32_cpu_offload=True
)

 

Loading Base + LoRA Model

base_model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    device_map="auto",
    torch_dtype=torch.float16,
    quantization_config=bnb_config
)

model = PeftModel.from_pretrained(base_model, "path/to/your/lora/adapter")
model.eval()

 

Step 2: Running Inference Interactively

The class reads a prompt from a file and then allows dynamic interaction through a terminal input loop:

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
    num_beams=1,
    do_sample=True,
    eos_token_id=128001
)

This yields a high-quality, fast response without exhausting your GPU memory.

 

Input/Output Example

ENTER PROMPT HERE OR TYPE EXIT TO LEAVE: What is PEFT in machine learning?
[Model Response...]

You can also use this in a web API or notebook interface—just call obj.inference() and capture the output.

 

Why This Works Well

  • 4-bit quantization keeps GPU usage low
  • LoRA adds task-specific knowledge without bloating memory
  • You avoid full model reloading or retraining

 

Final Thoughts

This setup gives you a fully operational DeepSeek 8B chatbot or task-specific assistant with only a few lines of Python—ideal for R&D, MVPs, or local testing before production scaling.