Back to benchmarks

How to Use Raw Data

Guide on using benchmark raw data for research and analysis.

Overview

We provide raw data for transparency and reproducibility. Researchers can verify results and perform additional analysis.

Data Structure

Each result includes problem ID, model name, attempt count, success status, response time, token usage, and cost.

{
  "problem_id": "h01-longest-substring",
  "model": "Gemini 2.5 Flash",
  "success": true,
  "attempts": 2,
  "first_attempt_success": false,
  "total_time_ms": 9599,
  "cost_usd": 0.00094,
  "input_tokens": 4330,
  "output_tokens": 484,
  "prompt_mode": "careti",
  "termination_reason": "success",
  "attempt_history": [
    {
      "attempt": 1,
      "success": false,
      "latency_ms": 6229,
      "error": "SyntaxError: invalid syntax"
    },
    {
      "attempt": 2,
      "success": true,
      "latency_ms": 3370
    }
  ]
}

Problem Lookup

Hard Suite problems can be looked up from the GitHub repository's hard-suite.json. Search by problem_id (e.g., h01-longest-substring).

# Python - Hard Suite problem lookup
import json
import urllib.request

BASE = "https://raw.githubusercontent.com/caretive-ai/careti-benchmark/main/results/2026-02-hard-suite"

# Problem definitions (prompts, test code)
problems = json.loads(urllib.request.urlopen(f"{BASE}/hard-suite.json").read())

# Benchmark results (2100 entries)
results = json.loads(urllib.request.urlopen(f"{BASE}/results.json").read())

# Search by problem_id (e.g., h01-longest-substring)
problem = next(p for p in problems if p["id"] == "h01-longest-substring")
print(problem["prompt"])
print(problem["test_code"])

GitHub: caretive-ai/careti-benchmark

Model Behavior Analysis

The attempt_history field contains detailed information for each attempt. Analyze error messages and retry patterns on failures.

  • termination_reason: success, max_attempts, timeout, oscillation, same_error
  • attempt_history: success/failure, response time, token usage per attempt
  • first_attempt_success: whether solved without retries

Download

Raw data can be downloaded from the bottom of each benchmark detail page.

Hard Suite 100 Results

results/2026-02-hard-suite/

GitHub Repository

caretive-ai/careti-benchmark

Full raw data, verification scripts, examples