EVALUATE. COMPARE. OPTIMIZE.

AI Performance Testing, Standardized.

Effortlessly compare, evaluate, and optimize AI models with reproducible benchmarks delivered as a service. No complex infrastructure required.

Join waitlist

benchmarks.do

{
  "benchmarkId": "bmk-a1b2c3d4e5f6",
  "name": "LLM Performance Comparison",
  "status": "completed",
  "completedAt": "2023-10-27T10:30:00Z",
  "report": {
    "text-summarization": {
      "dataset": "cnn-dailymail",
      "results": [
        {
          "model": "claude-3-opus",
          "rouge-1": 0.45,
          "rouge-l": 0.42
        },
        {
          "model": "gpt-4",
          "rouge-1": 0.44,
          "rouge-l": 0.41
        },
        {
          "model": "llama-3-70b",
          "rouge-1": 0.43,
          "rouge-l": 0.4
        },
        {
          "model": "gemini-pro",
          "rouge-1": 0.42,
          "rouge-l": 0.39
        }
      ]
    },
    "question-answering": {
      "dataset": "squad-v2",
      "results": [
        {
          "model": "claude-3-opus",
          "exact-match": 89.5,
          "f1-score": 92.1
        },
        {
          "model": "gpt-4",
          "exact-match": 89.2,
          "f1-score": 91.8
        },
        {
          "model": "llama-3-70b",
          "exact-match": 87.5,
          "f1-score": 90.5
        },
        {
          "model": "gemini-pro",
          "exact-match": 88.1,
          "f1-score": 91
        }
      ]
    },
    "code-generation": {
      "dataset": "humaneval",
      "results": [
        {
          "model": "claude-3-opus",
          "pass@1": 74.4,
          "pass@10": 92
        },
        {
          "model": "gpt-4",
          "pass@1": 72.9,
          "pass@10": 91
        },
        {
          "model": "llama-3-70b",
          "pass@1": 68,
          "pass@10": 89.5
        },
        {
          "model": "gemini-pro",
          "pass@1": 67.7,
          "pass@10": 88.7
        }
      ]
    }
  }
}

AI Performance Testing, Standardized.

Deliver economically valuable work

Frequently Asked Questions

Do Work. With AI.

AI Performance Testing, Standardized.self.__wrap_n!=1&&self.__wrap_b("«R4ahtmlb»",1)

Deliver economically valuable work

Frequently Asked Questions

What is Benchmarks.do?

Which AI models can I benchmark?

How are the benchmarks standardized?

Can I test my own fine-tuned models?

How do I get started with Benchmarks.do?

Do Work. With AI.

AI Performance Testing, Standardized.