Integrating and Testing Your Fine-Tuned Models with the Benchmarks.do API

You’ve invested countless hours and significant compute resources into fine-tuning a large language model (LLM). You've tailored it to your specific domain, fed it proprietary data, and tweaked its hyperparameters. It feels more accurate, more relevant, and more aligned with your business needs. But how do you prove it? How do you quantify that improvement against industry giants like GPT-4 or Claude 3 Opus?

Ad-hoc testing with a few sample prompts is a start, but it's not scientific, scalable, or reproducible. To make confident, data-driven decisions, you need standardized evaluation.

This is where Benchmarks.do steps in. Our agentic workflow platform provides AI performance testing as a simple, API-driven service. This tutorial will show you exactly how to integrate your own fine-tuned or proprietary models into our standardized testing framework to get objective, comparative performance reports.

Why Standardized Benchmarking for Custom Models is Crucial

Before diving into the "how," let's establish the "why." Moving from subjective "gut feelings" about model performance to objective metrics is a game-changer for any AI team.

Objective Measurement: Replace anecdotal evidence with hard data. See precisely how your model performs on established datasets like SQuAD v2 for question-answering or HumanEval for code generation, using industry-standard metrics like F1-score and pass@1.
ROI Justification: Justify the investment in fine-tuning. A benchmark report showing your custom model outperforming a more expensive, general-purpose model on a key task is a powerful way to demonstrate value.
Direct, Apples-to-Apples Comparison: See how your model stacks up against the state-of-the-art. Our platform ensures every model is tested on the same data under the same conditions for a fair and transparent comparison.
Reproducibility and Regression Testing: As you continue to iterate on your model, you can re-run the exact same benchmark to track progress and ensure you haven't introduced performance regressions.

Tutorial: Integrating Your Fine-Tuned Model via the API

With Benchmarks.do, you don't need to build and maintain a complex evaluation infrastructure. All you need is an API endpoint for your model. Our service handles the rest.

Step 1: Expose Your Model via an API Endpoint

For Benchmarks.do to evaluate your model, it needs to be accessible over the internet. You'll need to create a simple, secure REST API endpoint.

This endpoint should:

Accept a standard request format (e.g., a JSON payload with a prompt).
Process the request with your fine-tuned model.
Return the model's output in a standard JSON response.

Most model serving frameworks like TGI, vLLM, or cloud services like Amazon SageMaker and Google Vertex AI make this process straightforward.

Step 2: Construct Your Benchmark API Request

This is where the magic happens. You define the entire benchmark in a single JSON request. You tell us which standard models to test, which tasks to run, and—most importantly—where to find your custom model.

Let's say you've fine-tuned a Llama-3 model for better code generation and want to compare it against Claude 3 Opus and GPT-4 on the HumanEval dataset.

Here’s how you would define your request:

{
  "name": "My Fine-Tuned Llama-3 Code-Gen Test",
  "tasks": ["code-generation"],
  "models": [
    "claude-3-opus",
    "gpt-4"
  ],
  "customModels": [
    {
      "modelId": "my-finetuned-llama3-v2",
      "name": "My Llama-3 8B (Code Fine-Tune)",
      "endpoint": {
        "url": "https://api.my-company.com/v1/models/llama3-code-invoke",
        "auth": {
          "type": "bearer",
          "token": "sk-my-secret-api-key-for-my-model"
        }
      }
    }
  ]
}

In this request:

tasks: We specify code-generation, which tells the platform to use a standard dataset like HumanEval.
models: We list the public, off-the-shelf models for comparison.
customModels: This is the key section. We provide a unique modelId, a display name, and the endpoint details, including the URL and authentication credentials needed to access your model securely.

Step 3: Make the API Call

With your request body defined, simply send it to the Benchmarks.do API to kick off the evaluation process.

curl -X POST https://api.benchmarks.do/v1/run \
-H "Authorization: Bearer YOUR_BENCHMARKS_DO_API_KEY" \
-H "Content-Type: application/json" \
-d '{ ... your JSON request from Step 2 ... }'

Our agentic workflow platform now takes over. It will:

Fetch the specified benchmark dataset (HumanEval).
Send each task prompt to GPT-4, Claude 3 Opus, and your custom endpoint.
Collect all the responses.
Calculate the standard performance metrics (like pass@1).
Compile a detailed, comparative report.

Step 4: Analyze Your Comprehensive Report

Once the status of the job is completed, you will receive a detailed JSON report. It will include your model's performance right alongside the industry leaders, allowing for immediate analysis.

{
  "benchmarkId": "bmk-x1y2z3a4b5c6",
  "name": "My Fine-Tuned Llama-3 Code-Gen Test",
  "status": "completed",
  "report": {
    "code-generation": {
      "dataset": "humaneval",
      "results": [
        { "model": "claude-3-opus", "pass@1": 74.4 },
        { "model": "gpt-4", "pass@1": 72.9 },
        { "model": "My Llama-3 8B (Code Fine-Tune)", "pass@1": 71.5 }
      ]
    }
    // ...other metrics like pass@10 would also be here
  }
}

From this report, you can instantly see that your fine-tuned model (pass@1: 71.5) is performing competitively, closing the gap with much larger models like GPT-4 on this specific task. This is the objective, actionable data you need to drive your AI strategy forward.

Stop Guessing. Start Measuring.

Fine-tuning is a powerful technique, but its value can only be unlocked through rigorous, standardized testing. The Benchmarks.do API provides the simplest path to achieving this. By integrating your custom models, you can move beyond subjective assessments and start optimizing with data-driven confidence.

Ready to see how your model truly performs? Visit Benchmarks.do to get started and run your first comparative benchmark in minutes.

Do Work. With AI.