How to Use Raw Data

Guide on using benchmark raw data for research and analysis.

Overview

We provide raw data for transparency and reproducibility. Researchers can verify results and perform additional analysis.

Data Structure

Each result includes problem ID, model name, attempt count, success status, response time, token usage, and cost.

{
  "problem_id": "h01-longest-substring",
  "model": "Gemini 2.5 Flash",
  "success": true,
  "attempts": 2,
  "first_attempt_success": false,
  "total_time_ms": 9599,
  "cost_usd": 0.00094,
  "input_tokens": 4330,
  "output_tokens": 484,
  "prompt_mode": "careti",
  "termination_reason": "success",
  "attempt_history": [
    {
      "attempt": 1,
      "success": false,
      "latency_ms": 6229,
      "error": "SyntaxError: invalid syntax"
    },
    {
      "attempt": 2,
      "success": true,
      "latency_ms": 3370
    }
  ]
}

Problem Lookup

Hard Suite problems can be looked up from the GitHub repository's hard-suite.json. Search by problem_id (e.g., h01-longest-substring).

# Python - Hard Suite problem lookup
import json
import urllib.request

BASE = "https://raw.githubusercontent.com/caretive-ai/careti-benchmark/main/results/2026-02-hard-suite"

# Problem definitions (prompts, test code)
problems = json.loads(urllib.request.urlopen(f"{BASE}/hard-suite.json").read())

# Benchmark results (2100 entries)
results = json.loads(urllib.request.urlopen(f"{BASE}/results.json").read())

# Search by problem_id (e.g., h01-longest-substring)
problem = next(p for p in problems if p["id"] == "h01-longest-substring")
print(problem["prompt"])
print(problem["test_code"])

GitHub: caretive-ai/careti-benchmark

Model Behavior Analysis

The attempt_history field contains detailed information for each attempt. Analyze error messages and retry patterns on failures.

termination_reason: success, max_attempts, timeout, oscillation, same_error
attempt_history: success/failure, response time, token usage per attempt
first_attempt_success: whether solved without retries

Download

Raw data can be downloaded from the bottom of each benchmark detail page.

Hard Suite 100 Results

results/2026-02-hard-suite/

hard-suite.json - 100 problems (prompts, test code)
results.json - 2100 test results
summary.json - Aggregated stats by model

GitHub Repository

caretive-ai/careti-benchmark

Full raw data, verification scripts, examples