Running Inference on a LoRA Fine-Tuned DeepSeek 8B Model (4-bit Quantized)

This guide shows how to run inference on a LoRA fine-tuned DeepSeek 8B model using 4-bit quantization for efficient performance on modest GPUs. Learn how to load the base model, apply your trained LoRA adapter, and generate responses interactively—perfect for local testing, prototyping, or building lightweight AI tools.

In this tutorial, we'll walk through how to run inference on a fine-tuned DeepSeek 8B model using a 4-bit quantized base with LoRA adapters. If you’ve already trained your model using PEFT (as covered in our previous blog), this guide will help you deploy and interact with it using minimal hardware.

Whether you're using Google Colab, an RTX 3090/4090, or a cloud instance with 24–30GB VRAM, this setup allows for fast, low-cost text generation using massive models with LoRA precision.

What You'll Learn

Loading the base DeepSeek model with 4-bit quantization
Attaching your trained LoRA adapter
Tokenizing and running generation efficiently
Prompt loop for interactive inference

Requirements

Install the following (same as training):

!pip install peft transformers bitsandbytes

Step 1: Setup and Model Loading

We define a simple PreTunedModel class that:

Loads the base DeepSeek-R1-Distill-Llama-8B model in 4-bit.
Loads your fine-tuned LoRA adapter.
Sets up a reusable inference loop.

from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

Quantization Configuration

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    llm_int8_enable_fp32_cpu_offload=True
)

Loading Base + LoRA Model

base_model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    device_map="auto",
    torch_dtype=torch.float16,
    quantization_config=bnb_config
)

model = PeftModel.from_pretrained(base_model, "path/to/your/lora/adapter")
model.eval()

Step 2: Running Inference Interactively

The class reads a prompt from a file and then allows dynamic interaction through a terminal input loop:

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
    num_beams=1,
    do_sample=True,
    eos_token_id=128001
)

This yields a high-quality, fast response without exhausting your GPU memory.

Input/Output Example

ENTER PROMPT HERE OR TYPE EXIT TO LEAVE: What is PEFT in machine learning?
[Model Response...]

You can also use this in a web API or notebook interface—just call obj.inference() and capture the output.

Why This Works Well

4-bit quantization keeps GPU usage low
LoRA adds task-specific knowledge without bloating memory
You avoid full model reloading or retraining

Final Thoughts

This setup gives you a fully operational DeepSeek 8B chatbot or task-specific assistant with only a few lines of Python—ideal for R&D, MVPs, or local testing before production scaling.

Running Inference on a LoRA Fine-Tuned DeepSeek 8B Model (4-bit Quantized)

What You'll Learn

Requirements

Step 1: Setup and Model Loading

Quantization Configuration

Loading Base + LoRA Model

Step 2: Running Inference Interactively

Input/Output Example

Why This Works Well

Final Thoughts

Search

Categories

Recent Posts

This Resume Parser Saved Me 10+ Hours a Week .. Here's How You Can Use It Too

Extracting Structured Data from Unstructured Webpages with AI + Python

Using Golang for High-Speed Web Crawling: A Python Comparison

Residential vs. Datacenter Proxies: What's Better for Web Scraping?

Rotating Proxies with Python to Avoid IP Bans

Tags