You’ve invested countless hours and significant compute resources into fine-tuning a large language model (LLM). You've tailored it to your specific domain, fed it proprietary data, and tweaked its hyperparameters. It feels more accurate, more relevant, and more aligned with your business needs. But how do you prove it? How do you quantify that improvement against industry giants like GPT-4 or Claude 3 Opus?
Ad-hoc testing with a few sample prompts is a start, but it's not scientific, scalable, or reproducible. To make confident, data-driven decisions, you need standardized evaluation.
This is where Benchmarks.do steps in. Our agentic workflow platform provides AI performance testing as a simple, API-driven service. This tutorial will show you exactly how to integrate your own fine-tuned or proprietary models into our standardized testing framework to get objective, comparative performance reports.
Before diving into the "how," let's establish the "why." Moving from subjective "gut feelings" about model performance to objective metrics is a game-changer for any AI team.
With Benchmarks.do, you don't need to build and maintain a complex evaluation infrastructure. All you need is an API endpoint for your model. Our service handles the rest.
For Benchmarks.do to evaluate your model, it needs to be accessible over the internet. You'll need to create a simple, secure REST API endpoint.
This endpoint should:
Most model serving frameworks like TGI, vLLM, or cloud services like Amazon SageMaker and Google Vertex AI make this process straightforward.
This is where the magic happens. You define the entire benchmark in a single JSON request. You tell us which standard models to test, which tasks to run, and—most importantly—where to find your custom model.
Let's say you've fine-tuned a Llama-3 model for better code generation and want to compare it against Claude 3 Opus and GPT-4 on the HumanEval dataset.
Here’s how you would define your request:
{
"name": "My Fine-Tuned Llama-3 Code-Gen Test",
"tasks": ["code-generation"],
"models": [
"claude-3-opus",
"gpt-4"
],
"customModels": [
{
"modelId": "my-finetuned-llama3-v2",
"name": "My Llama-3 8B (Code Fine-Tune)",
"endpoint": {
"url": "https://api.my-company.com/v1/models/llama3-code-invoke",
"auth": {
"type": "bearer",
"token": "sk-my-secret-api-key-for-my-model"
}
}
}
]
}
In this request:
With your request body defined, simply send it to the Benchmarks.do API to kick off the evaluation process.
curl -X POST https://api.benchmarks.do/v1/run \
-H "Authorization: Bearer YOUR_BENCHMARKS_DO_API_KEY" \
-H "Content-Type: application/json" \
-d '{ ... your JSON request from Step 2 ... }'
Our agentic workflow platform now takes over. It will:
Once the status of the job is completed, you will receive a detailed JSON report. It will include your model's performance right alongside the industry leaders, allowing for immediate analysis.
{
"benchmarkId": "bmk-x1y2z3a4b5c6",
"name": "My Fine-Tuned Llama-3 Code-Gen Test",
"status": "completed",
"report": {
"code-generation": {
"dataset": "humaneval",
"results": [
{ "model": "claude-3-opus", "pass@1": 74.4 },
{ "model": "gpt-4", "pass@1": 72.9 },
{ "model": "My Llama-3 8B (Code Fine-Tune)", "pass@1": 71.5 }
]
}
// ...other metrics like pass@10 would also be here
}
}
From this report, you can instantly see that your fine-tuned model (pass@1: 71.5) is performing competitively, closing the gap with much larger models like GPT-4 on this specific task. This is the objective, actionable data you need to drive your AI strategy forward.
Fine-tuning is a powerful technique, but its value can only be unlocked through rigorous, standardized testing. The Benchmarks.do API provides the simplest path to achieving this. By integrating your custom models, you can move beyond subjective assessments and start optimizing with data-driven confidence.
Ready to see how your model truly performs? Visit Benchmarks.do to get started and run your first comparative benchmark in minutes.