Overview
We provide raw data for transparency and reproducibility. Researchers can verify results and perform additional analysis.
Data Structure
Each result includes problem ID, model name, attempt count, success status, response time, token usage, and cost.
{
"problem_id": "h01-longest-substring",
"model": "Gemini 2.5 Flash",
"success": true,
"attempts": 2,
"first_attempt_success": false,
"total_time_ms": 9599,
"cost_usd": 0.00094,
"input_tokens": 4330,
"output_tokens": 484,
"prompt_mode": "careti",
"termination_reason": "success",
"attempt_history": [
{
"attempt": 1,
"success": false,
"latency_ms": 6229,
"error": "SyntaxError: invalid syntax"
},
{
"attempt": 2,
"success": true,
"latency_ms": 3370
}
]
}Problem Lookup
Hard Suite problems can be looked up from the GitHub repository's hard-suite.json. Search by problem_id (e.g., h01-longest-substring).
# Python - Hard Suite problem lookup
import json
import urllib.request
BASE = "https://raw.githubusercontent.com/caretive-ai/careti-benchmark/main/results/2026-02-hard-suite"
# Problem definitions (prompts, test code)
problems = json.loads(urllib.request.urlopen(f"{BASE}/hard-suite.json").read())
# Benchmark results (2100 entries)
results = json.loads(urllib.request.urlopen(f"{BASE}/results.json").read())
# Search by problem_id (e.g., h01-longest-substring)
problem = next(p for p in problems if p["id"] == "h01-longest-substring")
print(problem["prompt"])
print(problem["test_code"])GitHub: caretive-ai/careti-benchmark
Model Behavior Analysis
The attempt_history field contains detailed information for each attempt. Analyze error messages and retry patterns on failures.
- termination_reason: success, max_attempts, timeout, oscillation, same_error
- attempt_history: success/failure, response time, token usage per attempt
- first_attempt_success: whether solved without retries
Download
Raw data can be downloaded from the bottom of each benchmark detail page.
Hard Suite 100 Results
results/2026-02-hard-suite/- hard-suite.json - 100 problems (prompts, test code)
- results.json - 2100 test results
- summary.json - Aggregated stats by model
