Back to list

Feb 4, 2026

HumanEval Agent Mode Benchmark

Gemini 2.5 Flash - Careti prompt mode 97.6% first-attempt pass rate, 5.3s avg response

Careti Agent Benchmark

Hard Suite Report (9 Models)

2026-02-02 ~ 2026-02-05

⚠️ Note: Benchmark results may differ from real-world experience. Algorithm problem-solving ability and actual project development skills are separate.

Summary

  • Gemini 2.5 Flash, Gemini 2.5 Pro, Claude Code CLI, GLM-4.7 all excellent for coding (97-98%)
  • Gemini 2.5 Flash best value (98% pass, $0.05)
  • Need harder tests to differentiate top models (Hard++ Suite)
  • Latest models aren't always best (Gemini 3 Pro ≈ 2.5 Pro)
  • Korean models (Solar, HyperCLOVA X) need coding optimization

1. Benchmark Overview

Method

HumanEval

Problem → Code → Score (1 attempt)

Careti Agent

Problem → Code → Test → [Error] → Retry (max 5)

Parameters

100

Problems

5

Max Attempts

300s

Timeout

Hard

Difficulty

Termination

✓ success

Test passed

✗ max_attempts

Failed after 5 attempts

⏱ timeout

Exceeded 300s

↺ same_error

Same error repeated

2. Model Rankings

Sorted by: Final pass rate → 1st attempt rate → Cost (lowest first)

RankModelFinal1st PassAvg TimeCostTotal Timesame_error
🥇Gemini 2.5 Flash98%92%17.6s$0.05~39m1
🥈Gemini 2.5 Pro97%95%40.5s$0.33~70m3
🥉Claude Code CLI*97%94%35.2s~$0.17~60m3
4GLM-4.797%90%15.6s$0.18~40m2
5Gemini 3 Pro97%90%55.8s$0.24~100m2
6Gemini 3 Flash†91%82%22.1s$0.03~45m0
7Solar Pro283%61%18.3s$0.79~48m11
8Solar Pro375%70%45.2s$1.35~85m25
9HyperCLOVA X‡0%0%--~3m100

* Claude Code CLI: Based on Careti usage (Max subscription, 20x cheaper than API)
† Gemini 3 Flash: Preview API timeout caused 9 failures
‡ HyperCLOVA X: HCX-003(2%), HCX-007(0%) don't support coding. Retest needed

3. Visualization

Pass Rate Comparison

Cost vs Performance

4. Models Tested

Google Gemini

2.5 Flash/Pro, 3 Pro/Flash. Google's multimodal LLM.

Claude Code CLI

CLI tool based on Anthropic Opus 4.5.

GLM-4.7

Zhipu AI (China). Coding-optimized.

Solar Pro2/3

Upstage (Korea). Pro2 excels at feedback learning.

HyperCLOVA X

Naver (Korea). Doesn't support coding tasks.

5. Future Improvements

  • Top 5 models tied at 97-98% → Need Hard++ Suite
  • Retest when HyperCLOVA X releases coding version
  • Consider adding real production bugs/refactoring problems