In this tutorial, we'll walk through how to run inference on a fine-tuned DeepSeek 8B model using a 4-bit quantized base with LoRA adapters. If you’ve already trained your model using PEFT (as covered in our previous blog), this guide will help you deploy and interact with it using minimal hardware.
Whether you're using Google Colab, an RTX 3090/4090, or a cloud instance with 24–30GB VRAM, this setup allows for fast, low-cost text generation using massive models with LoRA precision.
What You'll Learn
- Loading the base DeepSeek model with 4-bit quantization
- Attaching your trained LoRA adapter
- Tokenizing and running generation efficiently
- Prompt loop for interactive inference
Requirements
Install the following (same as training):
!pip install peft transformers bitsandbytes
Step 1: Setup and Model Loading
We define a simple PreTunedModel
class that:
- Loads the base DeepSeek-R1-Distill-Llama-8B model in 4-bit.
- Loads your fine-tuned LoRA adapter.
- Sets up a reusable inference loop.
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
Quantization Configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
llm_int8_enable_fp32_cpu_offload=True
)
Loading Base + LoRA Model
base_model = AutoModelForCausalLM.from_pretrained(
"deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
device_map="auto",
torch_dtype=torch.float16,
quantization_config=bnb_config
)
model = PeftModel.from_pretrained(base_model, "path/to/your/lora/adapter")
model.eval()
Step 2: Running Inference Interactively
The class reads a prompt from a file and then allows dynamic interaction through a terminal input loop:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
num_beams=1,
do_sample=True,
eos_token_id=128001
)
This yields a high-quality, fast response without exhausting your GPU memory.
Input/Output Example
ENTER PROMPT HERE OR TYPE EXIT TO LEAVE: What is PEFT in machine learning?
[Model Response...]
You can also use this in a web API or notebook interface—just call
obj.inference()
and capture the output.
Why This Works Well
- 4-bit quantization keeps GPU usage low
- LoRA adds task-specific knowledge without bloating memory
- You avoid full model reloading or retraining
Final Thoughts
This setup gives you a fully operational DeepSeek 8B chatbot or task-specific assistant with only a few lines of Python—ideal for R&D, MVPs, or local testing before production scaling.