Fine-Tune GPT-4o-mini 2026 — Complete Step-by-Step Guide

Why This Matters

Fine-tuning GPT-4o-mini gives you a model that speaks your domain’s language — literally. A base model knows general information, but a fine-tuned version knows your product catalog, your writing style, your codebase conventions, or your customer support playbook.

In 2026, OpenAI’s fine-tuning API supports GPT-4o-mini at $3/100K tokens for training and $0.60/1M tokens for inference — roughly 8x cheaper than GPT-4o while matching its task-specific accuracy on narrow domains. Companies like Zapier, Replit, and Intercom use fine-tuned GPT-4o-mini models to handle millions of domain-specific requests per day.

The workflow is: prepare data → upload → create job → monitor → evaluate → deploy. This guide covers each step with production-ready code.

Prerequisites

An OpenAI API account with billing enabled
Python 3.9+ with openai package installed: pip install openai pandas scikit-learn
A labeled dataset in JSONL format (at least 50 examples, ideally 200+)
Basic familiarity with Python and command line

Step-by-Step

Step 1: Prepare Your Dataset

Fine-tuning works best with conversational data. Each training example is a list of messages:

import json

training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a customer support agent for CloudSync Pro, a SaaS file-sync product. Keep answers under 3 sentences. Reference the knowledge base when available."},
            {"role": "user", "content": "My files aren't syncing between my Mac and iPhone. What should I check?"},
            {"role": "assistant", "content": "First, ensure both devices are on the same Wi-Fi network. Then open the CloudSync Pro app and check the sync status icon — it should show a green checkmark. If it shows a red X, tap 'Resync Now' in Settings > Sync Status."}
        ]
    },
    # ... 200+ more examples
]

Dataset quality rules:

Coverage: Include examples for every intent the model should handle. For support, cover login issues, billing, syncing, account management, and feature questions.
Consistency: Maintain the same tone and format across all examples. If your brand voice uses “we’d love to help,” use it in every assistant response.
Edge cases: Include 10-15 examples where the correct answer is “I don’t know, let me transfer you to a human.”
Multi-turn: Include 20% of examples as multi-turn conversations (2-4 exchanges) so the model learns context tracking.

Save as JSONL:

with open("training_data.jsonl", "w") as f:
    for example in training_data:
        f.write(json.dumps(example) + "\n")

print(f"Dataset: {len(training_data)} examples")

Step 2: Validate Your Data

OpenAI provides a validation script. Run this check before uploading:

import json
from collections import Counter

with open("training_data.jsonl") as f:
    examples = [json.loads(line) for line in f]

# Check 1: message structure
for i, ex in enumerate(examples):
    assert "messages" in ex, f"Example {i}: missing 'messages' key"
    for msg in ex["messages"]:
        assert "role" in msg, f"Example {i}: missing 'role'"
        assert "content" in msg, f"Example {i}: missing 'content'"
        assert msg["role"] in ["system", "user", "assistant"], f"Example {i}: invalid role '{msg['role']}'"

# Check 2: token count (approximate: 1 token ≈ 4 chars)
token_counts = [sum(len(m["content"]) / 4 for m in ex["messages"]) for ex in examples]
print(f"Avg tokens: {sum(token_counts)/len(token_counts):.0f}")
print(f"Max tokens: {max(token_counts):.0f}")
print(f"Min tokens: {min(token_counts):.0f}")

# Check 3: class balance
roles = Counter()
for ex in examples:
    for msg in ex["messages"]:
        roles[msg["role"]] += 1
print(f"Role distribution: {dict(roles)}")

# Check 4: system prompt consistency
system_prompts = [ex["messages"][0]["content"] for ex in examples if ex["messages"][0]["role"] == "system"]
unique_prompts = set(system_prompts)
if len(unique_prompts) > 1:
    print(f"⚠️  Warning: {len(unique_prompts)} different system prompts detected")
else:
    print("✅ Consistent system prompt across all examples")

Step 3: Upload and Create the Fine-Tuning Job

from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY from environment

# Upload the training file
with open("training_data.jsonl", "rb") as f:
    file = client.files.create(
        file=f,
        purpose="fine-tune"
    )
print(f"✅ File uploaded: {file.id}")

# Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2026-04-01",  # Latest checkpoint
    hyperparameters={
        "n_epochs": 3,           # Start with 3, increase to 5-8 for larger datasets
        "batch_size": 8,         # Auto-determined if omitted
        "learning_rate_multiplier": 0.5  # Conservative learning rate
    },
    suffix="cloudsync-support-v1"  # Custom model identifier
)
print(f"✅ Job created: {job.id}")
print(f"   Status: {job.status}")

Hyperparameter guidelines:

Dataset Size	n_epochs	Learning Rate Multiplier	Batch Size
50-100	5-8	0.8-1.0	4
200-500	3-5	0.5-0.8	8
1000+	1-3	0.2-0.5	16

Step 4: Monitor Training Progress

import time

def monitor_job(job_id: str, poll_interval: int = 60):
    while True:
        job = client.fine_tuning.jobs.retrieve(job_id)
        status = job.status
        
        if status == "running":
            progress = f"{job.trained_tokens:,} tokens trained"
            eta = job.estimated_finish if hasattr(job, 'estimated_finish') else "unknown"
            print(f"⏳ {progress} | ETA: {eta}")
            
            # Print last 2 validation metrics
            for event in reversed(job.events()[-5:]):
                if event.type == "metrics":
                    metrics = event.data
                    print(f"   Loss: {metrics.get('train_loss', 'N/A')}")
                    break
        
        elif status == "succeeded":
            print(f"✅ Fine-tuning complete!")
            print(f"   Model: {job.fine_tuned_model}")
            print(f"   Total tokens: {job.trained_tokens:,}")
            print(f"   Training time: {job.finished_at - job.created_at:.0f} seconds")
            return job.fine_tuned_model
        
        elif status == "failed":
            print(f"❌ Job failed: {job.error}")
            return None
        
        else:
            print(f"Status: {status}")
        
        time.sleep(poll_interval)

model_id = monitor_job(job.id)

Step 5: Evaluate the Fine-Tuned Model

Create a test set of 30-50 examples the model hasn’t seen:

from sklearn.metrics import accuracy_score

def batch_evaluate(test_examples: list, model_id: str, temperature: float = 0.0):
    results = []
    for ex in test_examples:
        # Remove the assistant message — we want the model to generate it
        messages = ex["messages"][:-1]  
        expected = ex["messages"][-1]["content"]
        
        response = client.chat.completions.create(
            model=model_id,
            messages=messages,
            temperature=temperature,
            max_tokens=150
        )
        
        generated = response.choices[0].message.content
        results.append({
            "expected": expected,
            "generated": generated,
            "exact_match": expected.strip() == generated.strip()
        })
    
    accuracy = sum(r["exact_match"] for r in results) / len(results)
    print(f"Exact match accuracy: {accuracy:.1%}")
    
    # Sample 3 mismatches for review
    for r in results:
        if not r["exact_match"]:
            print(f"\nExpected: {r['expected'][:100]}...")
            print(f"Got: {r['generated'][:100]}...")
    
    return results

evaluation = batch_evaluate(test_set, model_id)

For more useful evaluation, use semantic similarity (embedding cosine distance) instead of exact match — fine-tuned models often produce correct answers with different phrasing.

Step 6: Deploy to Production

Once evaluated, set up inference:

def query_fine_tuned(model_id: str, user_message: str, system_prompt: str = None):
    messages = []
    
    # Use the same system prompt as training
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    
    messages.append({"role": "user", "content": user_message})
    
    response = client.chat.completions.create(
        model=model_id,
        messages=messages,
        temperature=0.3,  # Lower temp for consistent outputs
        max_tokens=200
    )
    
    return response.choices[0].message.content

# Example usage
response = query_fine_tuned(
    model_id=model_id,
    user_message="How do I cancel my subscription?",
    system_prompt="You are a CloudSync Pro support agent. Be polite and concise."
)
print(response)

Step 7: Iterate Based on Production Logs

Production deployment is step one. Continuous improvement is the real value:

def log_feedback(user_message, model_response, was_helpful: bool):
    """Log production interactions for dataset augmentation."""
    entry = {
        "timestamp": "2026-06-04T12:00:00Z",
        "user_message": user_message,
        "model_response": model_response,
        "helpful": was_helpful
    }
    with open("production_feedback.jsonl", "a") as f:
        f.write(json.dumps(entry) + "\n")

# Flag unhelpful responses for priority retraining
def extract_hard_examples(feedback_file: str, threshold: int = 50):
    """Create a training batch from the least-helpful responses."""
    hard_examples = []
    with open(feedback_file) as f:
        for line in f:
            entry = json.loads(line)
            if not entry["helpful"]:
                hard_examples.append(entry)
    
    print(f"Found {len(hard_examples)} hard examples for retraining")
    return hard_examples[:100]

Tips

Start small, iterate fast. A 100-example dataset can improve accuracy by 20-40% on narrow tasks. Add more data only where the model still fails.
Use a validation split. Hold out 10-20% of your data for evaluation. Never test on training data.
Cost estimation. 500 training examples at 500 tokens each × 3 epochs = 750K training tokens × $3/100K = $22.50 per fine-tune run.
Model naming convention. Use {app}-{domain}-{version} pattern (e.g., cloudsync-support-v2). Never reuse a model name.
One system prompt per model. Don’t train with multiple different system prompts in the same dataset — the model will blend them.
Temperature for production. Use 0.0-0.3 for deterministic tasks (classification, extraction), 0.5-0.7 for creative tasks (content generation).

FAQ

Q: How many examples do I need to start?
A: 50 minimum for noticeable improvement, 200-500 for reliable performance. More is better, but quality matters more than quantity.

Q: Can I fine-tune on PDFs or Word documents?
A: No. The fine-tuning API accepts JSONL files only. You must convert documents to the conversation format first.

Q: How long does training take?
A: 500 examples × 3 epochs typically finishes in 15-30 minutes. Larger datasets (5000+ examples) can take 2-6 hours.

Q: What’s the difference between fine-tuning and RAG?
A: Fine-tuning teaches the model new knowledge and behavior patterns. RAG retrieves external documents at query time. Use both: fine-tune for behavior, RAG for up-to-date facts.

Q: Can I fine-tune GPT-4o?
A: Yes, but it costs $25/100K training tokens (8x more than GPT-4o-mini). Start with GPT-4o-mini — if it hits a ceiling, graduate to GPT-4o for the final production model.

Q: Does OpenAI keep my training data?
A: Fine-tuning data is used only for your job and deleted after training completes. OpenAI does not train on customer fine-tuning data.