Fine-Tune GPT-4o-mini 2026 — Complete Step-by-Step Guide
Why This Matters
Fine-tuning GPT-4o-mini gives you a model that speaks your domain’s language — literally. A base model knows general information, but a fine-tuned version knows your product catalog, your writing style, your codebase conventions, or your customer support playbook.
In 2026, OpenAI’s fine-tuning API supports GPT-4o-mini at $3/100K tokens for training and $0.60/1M tokens for inference — roughly 8x cheaper than GPT-4o while matching its task-specific accuracy on narrow domains. Companies like Zapier, Replit, and Intercom use fine-tuned GPT-4o-mini models to handle millions of domain-specific requests per day.
The workflow is: prepare data → upload → create job → monitor → evaluate → deploy. This guide covers each step with production-ready code.
Prerequisites
- An OpenAI API account with billing enabled
- Python 3.9+ with
openaipackage installed:pip install openai pandas scikit-learn - A labeled dataset in JSONL format (at least 50 examples, ideally 200+)
- Basic familiarity with Python and command line
Step-by-Step
Step 1: Prepare Your Dataset
Fine-tuning works best with conversational data. Each training example is a list of messages:
import json
training_data = [
{
"messages": [
{"role": "system", "content": "You are a customer support agent for CloudSync Pro, a SaaS file-sync product. Keep answers under 3 sentences. Reference the knowledge base when available."},
{"role": "user", "content": "My files aren't syncing between my Mac and iPhone. What should I check?"},
{"role": "assistant", "content": "First, ensure both devices are on the same Wi-Fi network. Then open the CloudSync Pro app and check the sync status icon — it should show a green checkmark. If it shows a red X, tap 'Resync Now' in Settings > Sync Status."}
]
},
# ... 200+ more examples
]
Dataset quality rules:
- Coverage: Include examples for every intent the model should handle. For support, cover login issues, billing, syncing, account management, and feature questions.
- Consistency: Maintain the same tone and format across all examples. If your brand voice uses “we’d love to help,” use it in every assistant response.
- Edge cases: Include 10-15 examples where the correct answer is “I don’t know, let me transfer you to a human.”
- Multi-turn: Include 20% of examples as multi-turn conversations (2-4 exchanges) so the model learns context tracking.
Save as JSONL:
with open("training_data.jsonl", "w") as f:
for example in training_data:
f.write(json.dumps(example) + "\n")
print(f"Dataset: {len(training_data)} examples")
Step 2: Validate Your Data
OpenAI provides a validation script. Run this check before uploading:
import json
from collections import Counter
with open("training_data.jsonl") as f:
examples = [json.loads(line) for line in f]
# Check 1: message structure
for i, ex in enumerate(examples):
assert "messages" in ex, f"Example {i}: missing 'messages' key"
for msg in ex["messages"]:
assert "role" in msg, f"Example {i}: missing 'role'"
assert "content" in msg, f"Example {i}: missing 'content'"
assert msg["role"] in ["system", "user", "assistant"], f"Example {i}: invalid role '{msg['role']}'"
# Check 2: token count (approximate: 1 token ≈ 4 chars)
token_counts = [sum(len(m["content"]) / 4 for m in ex["messages"]) for ex in examples]
print(f"Avg tokens: {sum(token_counts)/len(token_counts):.0f}")
print(f"Max tokens: {max(token_counts):.0f}")
print(f"Min tokens: {min(token_counts):.0f}")
# Check 3: class balance
roles = Counter()
for ex in examples:
for msg in ex["messages"]:
roles[msg["role"]] += 1
print(f"Role distribution: {dict(roles)}")
# Check 4: system prompt consistency
system_prompts = [ex["messages"][0]["content"] for ex in examples if ex["messages"][0]["role"] == "system"]
unique_prompts = set(system_prompts)
if len(unique_prompts) > 1:
print(f"⚠️ Warning: {len(unique_prompts)} different system prompts detected")
else:
print("✅ Consistent system prompt across all examples")
Step 3: Upload and Create the Fine-Tuning Job
from openai import OpenAI
client = OpenAI() # Uses OPENAI_API_KEY from environment
# Upload the training file
with open("training_data.jsonl", "rb") as f:
file = client.files.create(
file=f,
purpose="fine-tune"
)
print(f"✅ File uploaded: {file.id}")
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2026-04-01", # Latest checkpoint
hyperparameters={
"n_epochs": 3, # Start with 3, increase to 5-8 for larger datasets
"batch_size": 8, # Auto-determined if omitted
"learning_rate_multiplier": 0.5 # Conservative learning rate
},
suffix="cloudsync-support-v1" # Custom model identifier
)
print(f"✅ Job created: {job.id}")
print(f" Status: {job.status}")
Hyperparameter guidelines:
| Dataset Size | n_epochs | Learning Rate Multiplier | Batch Size |
|---|---|---|---|
| 50-100 | 5-8 | 0.8-1.0 | 4 |
| 200-500 | 3-5 | 0.5-0.8 | 8 |
| 1000+ | 1-3 | 0.2-0.5 | 16 |
Step 4: Monitor Training Progress
import time
def monitor_job(job_id: str, poll_interval: int = 60):
while True:
job = client.fine_tuning.jobs.retrieve(job_id)
status = job.status
if status == "running":
progress = f"{job.trained_tokens:,} tokens trained"
eta = job.estimated_finish if hasattr(job, 'estimated_finish') else "unknown"
print(f"⏳ {progress} | ETA: {eta}")
# Print last 2 validation metrics
for event in reversed(job.events()[-5:]):
if event.type == "metrics":
metrics = event.data
print(f" Loss: {metrics.get('train_loss', 'N/A')}")
break
elif status == "succeeded":
print(f"✅ Fine-tuning complete!")
print(f" Model: {job.fine_tuned_model}")
print(f" Total tokens: {job.trained_tokens:,}")
print(f" Training time: {job.finished_at - job.created_at:.0f} seconds")
return job.fine_tuned_model
elif status == "failed":
print(f"❌ Job failed: {job.error}")
return None
else:
print(f"Status: {status}")
time.sleep(poll_interval)
model_id = monitor_job(job.id)
Step 5: Evaluate the Fine-Tuned Model
Create a test set of 30-50 examples the model hasn’t seen:
from sklearn.metrics import accuracy_score
def batch_evaluate(test_examples: list, model_id: str, temperature: float = 0.0):
results = []
for ex in test_examples:
# Remove the assistant message — we want the model to generate it
messages = ex["messages"][:-1]
expected = ex["messages"][-1]["content"]
response = client.chat.completions.create(
model=model_id,
messages=messages,
temperature=temperature,
max_tokens=150
)
generated = response.choices[0].message.content
results.append({
"expected": expected,
"generated": generated,
"exact_match": expected.strip() == generated.strip()
})
accuracy = sum(r["exact_match"] for r in results) / len(results)
print(f"Exact match accuracy: {accuracy:.1%}")
# Sample 3 mismatches for review
for r in results:
if not r["exact_match"]:
print(f"\nExpected: {r['expected'][:100]}...")
print(f"Got: {r['generated'][:100]}...")
return results
evaluation = batch_evaluate(test_set, model_id)
For more useful evaluation, use semantic similarity (embedding cosine distance) instead of exact match — fine-tuned models often produce correct answers with different phrasing.
Step 6: Deploy to Production
Once evaluated, set up inference:
def query_fine_tuned(model_id: str, user_message: str, system_prompt: str = None):
messages = []
# Use the same system prompt as training
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model=model_id,
messages=messages,
temperature=0.3, # Lower temp for consistent outputs
max_tokens=200
)
return response.choices[0].message.content
# Example usage
response = query_fine_tuned(
model_id=model_id,
user_message="How do I cancel my subscription?",
system_prompt="You are a CloudSync Pro support agent. Be polite and concise."
)
print(response)
Step 7: Iterate Based on Production Logs
Production deployment is step one. Continuous improvement is the real value:
def log_feedback(user_message, model_response, was_helpful: bool):
"""Log production interactions for dataset augmentation."""
entry = {
"timestamp": "2026-06-04T12:00:00Z",
"user_message": user_message,
"model_response": model_response,
"helpful": was_helpful
}
with open("production_feedback.jsonl", "a") as f:
f.write(json.dumps(entry) + "\n")
# Flag unhelpful responses for priority retraining
def extract_hard_examples(feedback_file: str, threshold: int = 50):
"""Create a training batch from the least-helpful responses."""
hard_examples = []
with open(feedback_file) as f:
for line in f:
entry = json.loads(line)
if not entry["helpful"]:
hard_examples.append(entry)
print(f"Found {len(hard_examples)} hard examples for retraining")
return hard_examples[:100]
Tips
- Start small, iterate fast. A 100-example dataset can improve accuracy by 20-40% on narrow tasks. Add more data only where the model still fails.
- Use a validation split. Hold out 10-20% of your data for evaluation. Never test on training data.
- Cost estimation. 500 training examples at 500 tokens each × 3 epochs = 750K training tokens × $3/100K = $22.50 per fine-tune run.
- Model naming convention. Use
{app}-{domain}-{version}pattern (e.g.,cloudsync-support-v2). Never reuse a model name. - One system prompt per model. Don’t train with multiple different system prompts in the same dataset — the model will blend them.
- Temperature for production. Use 0.0-0.3 for deterministic tasks (classification, extraction), 0.5-0.7 for creative tasks (content generation).
FAQ
Q: How many examples do I need to start?
A: 50 minimum for noticeable improvement, 200-500 for reliable performance. More is better, but quality matters more than quantity.
Q: Can I fine-tune on PDFs or Word documents?
A: No. The fine-tuning API accepts JSONL files only. You must convert documents to the conversation format first.
Q: How long does training take?
A: 500 examples × 3 epochs typically finishes in 15-30 minutes. Larger datasets (5000+ examples) can take 2-6 hours.
Q: What’s the difference between fine-tuning and RAG?
A: Fine-tuning teaches the model new knowledge and behavior patterns. RAG retrieves external documents at query time. Use both: fine-tune for behavior, RAG for up-to-date facts.
Q: Can I fine-tune GPT-4o?
A: Yes, but it costs $25/100K training tokens (8x more than GPT-4o-mini). Start with GPT-4o-mini — if it hits a ceiling, graduate to GPT-4o for the final production model.
Q: Does OpenAI keep my training data?
A: Fine-tuning data is used only for your job and deleted after training completes. OpenAI does not train on customer fine-tuning data.