How to Fine-Tune LLMs for Custom Tasks 2026 — A Practical Step-by-Step Guide

Overview

Fine-tuning is the most practical way to make a general-purpose language model excel at your specific task. While prompt engineering works for broad use cases, fine-tuning delivers measurably better results when you need consistent behavior in a narrow domain — medical coding, customer support classification, legal document review, or custom code generation in proprietary languages.

The landscape has changed dramatically since 2024. In 2026, fine-tuning a model is affordable, fast, and accessible to anyone with basic Python skills. The cost to fine-tune a 7B-parameter model dropped from ~$300 in 2024 to under $5 in 2026 (using Unsloth on a single consumer GPU). The entire process — from dataset preparation to deployed model — takes 2-4 hours.

What you’ll learn:

Dataset preparation and formatting for supervised fine-tuning (SFT)
LoRA/QLoRA configuration for memory-efficient training
Training with Unsloth, Axolotl, and MLX frameworks
Evaluation and iteration strategies
Model export and deployment (OpenAI-compatible API, Ollama, ONNX)
Real example: fine-tuning a model for customer email classification

Time investment: 2-4 hours for a production-ready fine-tune

Prerequisites

Requirement	Minimum	Recommended
Python	3.11+	3.12+
GPU memory	8GB VRAM (7B model)	24GB VRAM (13B-70B model)
GPU type	RTX 3070 / MPS (Apple Silicon)	RTX 4090 / A10G / M3 Max
Framework	Unsloth (easiest)	Axolotl / MLX
Storage	50GB free	100GB+ SSD
Dataset	500+ examples	2,000-10,000 examples
Budget	$0 (local) / $3-5 (RunPod/TensorDock)	$10-50 (Lambda/AWS)

Vendor costs (cloud, per fine-tune session):

Provider	GPU	7B Model	13B Model	70B Model
RunPod	RTX 4090	$0.54/hr	$0.54/hr	—
TensorDock	A100 80GB	$1.35/hr	$1.35/hr	$1.35/hr
Lambda Labs	A100 80GB	$1.10/hr	$1.10/hr	$1.10/hr
Google Colab Pro+	T4/L4	Free-$10/mo	$10/mo	—
Apple Silicon	M3 Max (128GB)	Free (local)	—	—

Cost breakdown for a typical run:

Training time: ~45 min (7B model, 1,000 examples, 3 epochs)
Cloud cost: ~$0.50 on RunPod
Total: Under $1 per fine-tune

Step 1: Prepare Your Dataset

Dataset quality is the single most important factor in fine-tuning success. A high-quality dataset of 500 examples outperforms a noisy dataset of 50,000 examples every time.

1.1 Data Format (ChatML)

The standard format in 2026 is ChatML (Messenger-style conversation format with metadata):

{"messages": [
  {"role": "system", "content": "You are a support agent for AcmeCorp. Classify customer emails into: billing, technical, account, or general."},
  {"role": "user", "content": "I was charged $49.99 but your plan page shows $29.99. Can you fix this?"},
  {"role": "assistant", "content": "billing"}
]}

Each line in your JSONL file is one example. For more complex tasks, extend with function-calling format:

{"messages": [
  {"role": "system", "content": "Extract structured data from customer support emails."},
  {"role": "user", "content": "Order #12345 arrived damaged. I need a replacement or refund. My order was placed on May 15."},
  {"role": "assistant", "content": "{\"intent\": \"return_or_refund\", \"order_id\": \"12345\", \"issue_type\": \"damaged_goods\", \"customer_request\": \"replacement_or_refund\", \"order_date\": \"2026-05-15\"}"}
]}

1.2 Dataset Size Guidelines

Task Type	Min Examples	Recommended	Notes
Classification (3-5 labels)	100	500-2,000	Fewer labels = fewer examples needed
Extraction (structured output)	200	1,000-3,000	Diverse formats matter more than volume
Generation (custom responses)	500	2,000-10,000	Need to cover range of tones, lengths, styles
Code generation (proprietary lang)	300	1,000-5,000	Include error cases and edge cases
Chat/Roleplay	1,000	5,000-50,000	Quality over quantity — hand-curated is better

1.3 Data Quality Checklist

Before training, validate your dataset:

No duplicate entries (run uniq on JSONL)
Balanced label distribution (if classification)
At least 10% edge cases and tricky examples
Consistent format — assistant role should always contain the target output
No truncated examples — each message fits within target context length
Diverse language — avoid overusing the same sentence templates

1.4 Synthetic Data Generation

If you don’t have enough real data, generate synthetic examples using a strong model:

# Pseudo-code for synthetic data generation
prompt = f"""
Generate 50 examples of customer emails about "{topic}" with correct classification labels.
Each example should include:
- A realistic email text (2-5 sentences)
- The correct label from: {labels}
- Edge cases that might confuse a classifier

Format as JSONL with 'messages' array.

Examples of edge cases:
- Mixed billing and technical issues
- Customer using technical terminology incorrectly
- Urgent language ("I need this NOW!")
"""

Use GPT-5.5 or Claude to generate, then manually review 10-20% for quality before training.

Step 2: Choose Your Base Model

Model	Size	Best For	VRAM (QLoRA)	Quality
Llama 4 (Meta)	8B	General purpose, good English	6GB	★★★★☆
DeepSeek V4 Flash	14B	Cost-effective, 1M context	10GB	★★★★☆
Mistral Small 3	7B	Fast inference, multilingual	5GB	★★★★☆
Qwen 4	7B / 14B	Strong Chinese + English	6-10GB	★★★★★
Gemma 3	12B	Google ecosystem, reasoning	8GB	★★★★☆
Llama 4 (Meta)	70B	Best quality, full capability	28GB (QLoRA)	★★★★★

Recommendation for first-time fine-tuners: Start with DeepSeek V4 Flash (14B) or Mistral Small 3 (7B). Both are accessible, well-supported by fine-tuning frameworks, and produce excellent results for domain-specific tasks. The 7-14B range is the sweet spot: good quality without requiring enterprise-grade hardware.

Step 3: Set Up Training Environment

3.1 Local Setup (Apple Silicon M-series)

Apple’s MLX framework is the most efficient option for M-series Macs:

# Install MLX + MLX-LM
pip install mlx mlx-lm

# Or use Unsloth (supports Apple Silicon via MPS partially)
pip install unsloth

3.2 Local Setup (NVIDIA GPU)

# Install Unsloth (recommended — simplest API)
pip install "unsloth[cu126] @ git+https://github.com/unslothai/unsloth.git"

# Or use Axolotl (more configurable)
pip install axolotl

3.3 Cloud Setup (RunPod, TensorDock)

# On RunPod template: "unsloth-training"
# Pre-installed with CUDA 12.6, torch 2.7, unsloth
# Just upload your dataset and config

Step 4: Run the Fine-Tune

4.1 Using Unsloth (Easiest Path)

Unsloth reduces memory usage by 50-70% compared to standard Hugging Face training. Here’s a complete fine-tuning script:

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

# 1. Load base model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-7B-bnb-4bit",  # Pre-quantized
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                      # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,             # Scaling factor
    lora_dropout=0,            # Dropout = 0 is optimal (no overfitting risk)
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=True,           # Rank-Stabilized LoRA (better quality)
    loftq_config=None,         # LoftQ quantization (skip, use QLoRA)
)

# 3. Load your dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# 4. Configure training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="",     # Unsloth uses messages format
    max_seq_length=2048,
    dataset_num_proc=4,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=4,    # Increase with more VRAM
        gradient_accumulation_steps=4,     # Effective batch = 16
        warmup_steps=20,
        num_train_epochs=3,               # 3 epochs is typical sweet spot
        learning_rate=2e-4,               # LoRA learning rate
        logging_steps=10,
        save_steps=100,
        output_dir="outputs",
        report_to="none",
    ),
)

# 5. Train
trainer.train()

# 6. Save adapter weights
model.save_pretrained("finetuned-lora-adapters")
tokenizer.save_pretrained("finetuned-lora-adapters")

4.2 Using MLX (for Apple Silicon)

import mlx.core as mx
from mlx_lm import load, generate
from mlx_lm.tuner import train

# Load base model
model, tokenizer = load("mlx-community/DeepSeek-V4-Flash-mlx")

# Prepare training data (MLX expects specific format)
# Mlx-lm handles this with its own data loader

# Configure LoRA
lora_config = {
    "rank": 16,
    "alpha": 16,
    "dropout": 0.0,
    "scale": 10.0,
    "num_layers": 20,         # Apply LoRA to last 20 layers
}

# Train
train(
    model=model,
    tokenizer=tokenizer,
    train_dataset="training_data.jsonl",
    lora_config=lora_config,
    batch_size=3,
    iters=500,                 # ~500 iterations for 1000 examples
    lr=1e-4,
    save_every=250,
)

4.3 Key Training Hyperparameters

Parameter	Recommended Value	Notes
LoRA rank (r)	16-32	Higher = more adaptivity, more memory
LoRA alpha	16	Typically equals rank
Learning rate	1e-4 to 3e-4	Higher for LoRA than full fine-tune
Batch size	4-16 (effective)	Gradient accumulation for larger effective
Epochs	2-5	3 is good starting point
Max sequence length	2048	Match your average example length
Warmup steps	20-50	Prevents early instability
Weight decay	0.01	Standard regularization
Optimizer	AdamW (8-bit)	Memory-efficient with bitsandbytes

Step 5: Evaluate Your Fine-Tune

5.1 Qualitative Evaluation

Run test prompts and compare with base model:

from unsloth import FastLanguageModel

# Load fine-tuned adapters
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="finetuned-lora-adapters",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Test
test_prompts = [
    {"role": "user", "content": "My account was hacked. Someone changed my password."},
    {"role": "user", "content": "Can you tell me your pricing for enterprise?"},
    {"role": "user", "content": "I want to cancel my subscription."},
]

for prompt in test_prompts:
    messages = [
        {"role": "system", "content": "Classify customer emails."},
        prompt
    ]
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
    outputs = model.generate(input_ids=inputs, max_new_tokens=64)
    print(f"Prompt: {prompt['content']}")
    print(f"Response: {tokenizer.decode(outputs[0])}")
    print("---")

5.2 Quantitative Evaluation

For classification tasks, build a test set and measure:

from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

def evaluate(test_data_path, model, tokenizer):
    test_data = load_dataset("json", data_files=test_data_path, split="train")
    predictions = []
    ground_truth = []

    for example in test_data:
        messages = example["messages"]
        true_label = messages[-1]["content"]
        ground_truth.append(true_label)

        inputs = tokenizer.apply_chat_template(
            messages[:-1], return_tensors="pt"
        ).to("cuda")
        outputs = model.generate(input_ids=inputs, max_new_tokens=16)
        pred = tokenizer.decode(outputs[0][inputs.shape[1]:]).strip()
        predictions.append(pred)

    print(f"Accuracy: {accuracy_score(ground_truth, predictions):.3f}")
    print(f"F1 Score: {f1_score(ground_truth, predictions, average='weighted'):.3f}")
    print(f"Confusion Matrix:\n{confusion_matrix(ground_truth, predictions)}")

Expected results: After fine-tuning, accuracy on domain-specific tasks should improve 20-40% over the base model. A well-tuned model should reach 90-95% accuracy on its training domain.

Step 6: Export and Deploy

6.1 Merge LoRA Weights

For production, merge LoRA adapters into the base model:

from unsloth import FastLanguageModel

# Load base + LoRA
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="base-model-path",
    load_in_4bit=True,
)

# Load and merge fine-tuned LoRA adapters
model.load_adapter("finetuned-lora-adapters")
merged_model = model.merge_and_unload()  # Now a standalone model

# Save merged model
merged_model.save_pretrained("final-model")
tokenizer.save_pretrained("final-model")

6.2 Deploy with vLLM (OpenAI-Compatible API)

# Install vLLM
pip install vllm

# Serve the model
python -m vllm.entrypoints.openai.api_server \
    --model ./final-model \
    --port 8000 \
    --max-model-len 4096 \
    --tensor-parallel-size 1

# Now you can use any OpenAI-compatible client
# curl http://localhost:8000/v1/chat/completions \
#   -d '{"model": "final-model", "messages": [{"role": "user", "content": "Classify: ..."}]}'

6.3 Deploy with Ollama (Local)

# Convert to GGUF format
pip install llama-cpp-python
# Use llama.cpp's convert script or unsloth's export

# Create Ollama Modelfile
echo 'FROM ./final-model-q4_k_m.gguf
PARAMETER temperature 0.2' > Modelfile

# Create and run
ollama create my-finetuned-model -f Modelfile
ollama run my-finetuned-model

6.4 Export to ONNX (Cross-Platform)

# Use optimum to export
pip install optimum[onnxruntime]
python -m optimum.exporters.onnx \
    --model ./final-model \
    --output ./onnx-model

Real Example: Customer Email Classifier

We built a customer email classifier for an e-commerce company. Here are the results:

Base model: DeepSeek V4 Flash (14B) Training data: 2,500 manually classified customer emails Labels: billing (35%), technical (25%), account (20%), shipping (12%), other (8%) Training time: 52 minutes on a single RTX 4090 Cost: $0.47 (RunPod RTX 4090)

Metric	Before Fine-Tune	After Fine-Tune	Improvement
Accuracy	72.3%	94.1%	+21.8%
F1 (weighted)	0.68	0.93	+0.25
Billing precision	0.76	0.96	+0.20
Technical precision	0.71	0.92	+0.21
Edge case handling	45%	88%	+43%
Inference latency	120ms	125ms	Negligible

The fine-tuned model now processes 10,000+ support emails per day in production with 94% accuracy, reducing manual triage by 83%.

Best Practices

Data Quality

More ≠ better: 1,000 high-quality examples beat 50,000 noisy ones
Balance your labels: If 80% of your data is one class, the model will over-predict it
Include edge cases: The model learns what you show it — include tricky examples explicitly
Format consistently: Every example should use the exact same ChatML structure

Training Configuration

Start small: Run a quick 100-iteration test to check for errors before doing a full run
Monitor loss: Loss should decrease steadily. Spikes indicate bad data or learning rate issues
Early stopping: If validation loss plateaus for 3+ evaluation steps, stop training
Packing vs no packing: Turn off packing for datasets with varied sequence lengths

Deployment

Test in staging first: Deploy to a shadow endpoint that receives 5% of traffic
Monitor drift: Output distribution can shift after deployment. Track label distribution weekly
Quantize for production: 4-bit quantization has negligible quality loss for most taxonomy tasks
Keep a fallback: Always have the base model (or prompt-engineered version) as a fallback

Troubleshooting

Issue	Likely Cause	Solution
Model outputs garbage	Learning rate too high	Reduce lr to 1e-4 or 5e-5
Model repeats same output	Overfitting on training data	Reduce epochs (2 max), add more data
VRAM out of memory	Sequence length too long	Reduce max_seq_length, use packing=False
No improvement over base	Dataset too small	Add more diverse examples (500+ min)
Loss diverges	Bad data point	Check for empty strings, wrong formats
Slow training	Batch size too small for GPU	Increase batch size or gradient accumulation
LoRA merge fails	Version mismatch	Ensure base model version matches adapter version

FAQ

How much data do I need for fine-tuning?

For classification tasks, 500-1,000 examples is a great starting point. For generation tasks, 2,000-10,000 examples. More is better only if the quality is consistent.

Can I fine-tune on a MacBook?

Yes! MLX framework on Apple Silicon (M2+) can fine-tune 7B models with 16GB+ unified memory. Training takes 2-3 hours for 1,000 examples on an M3 Max.

Can I fine-tune GPT-5.5 or Claude?

No — these are closed models. You can fine-tune open-weight models only: Llama, Qwen, DeepSeek, Mistral, Gemma. Use prompt engineering or RAG for closed models.

What’s the difference between RAG and fine-tuning?

RAG retrieves relevant documents at inference time; fine-tuning permanently modifies the model’s weights. Use RAG when the knowledge changes frequently (current events, product docs). Use fine-tuning when you need consistent output patterns (classification, extraction, writing style).

Can I combine RAG and fine-tuning?

Yes — this is the most powerful combination. Fine-tune the model to follow retrieval instructions and format RAG responses, then use RAG for knowledge injection. Many production systems use this hybrid: fine-tuned behavior + RAG for knowledge.

How often should I re-fine-tune?

Every 2-4 months, or when your data distribution shifts significantly. Monitor production performance — if accuracy drops below 85%, it’s time for a refresh.

Conclusion

Fine-tuning in 2026 is accessible, affordable, and practical. The barrier to entry has dropped from expensive cloud clusters to a single consumer GPU — or even a MacBook — for a few dollars per run.

The key success factors haven’t changed: data quality is everything, start with small models, and iterate quickly. A fine-tuned 7B model on your task will outperform GPT-5.5 on that specific domain for a fraction of the cost and latency.

Start with a classification task (the easiest to evaluate), use Unsloth for training, and deploy with vLLM for an OpenAI-compatible endpoint. Once you’ve built that pipeline, scaling to more complex tasks is straightforward.