Feb 4, 2026

HumanEval Agent Mode Benchmark

Gemini 2.5 Flash - Careti prompt mode 97.6% first-attempt pass rate, 5.3s avg response

Models tested:gemini-2.5-flash solar-pro2 solar-pro3

Careti Agent Benchmark

Hard Suite Report (9 Models)

2026-02-02 ~ 2026-02-05

⚠️ Note: Benchmark results may differ from real-world experience. Algorithm problem-solving ability and actual project development skills are separate.

Summary

• Gemini 2.5 Flash, Gemini 2.5 Pro, Claude Code CLI, GLM-4.7 all excellent for coding (97-98%)
• Gemini 2.5 Flash best value (98% pass, $0.05)
• Need harder tests to differentiate top models (Hard++ Suite)
• Latest models aren't always best (Gemini 3 Pro ≈ 2.5 Pro)
• Korean models (Solar, HyperCLOVA X) need coding optimization

1. Benchmark Overview

Method

HumanEval

Problem → Code → Score (1 attempt)

Careti Agent

Problem → Code → Test → [Error] → Retry (max 5)

Parameters

100

Problems

Max Attempts

300s

Timeout

Hard

Difficulty

Termination

✓ success

Test passed

✗ max_attempts

Failed after 5 attempts

⏱ timeout

Exceeded 300s

↺ same_error

Same error repeated

2. Model Rankings

Sorted by: Final pass rate → 1st attempt rate → Cost (lowest first)

Rank	Model	Final	1st Pass	Avg Time	Cost	Total Time	same_error
🥇	Gemini 2.5 Flash	98%	92%	17.6s	$0.05	~39m	1
🥈	Gemini 2.5 Pro	97%	95%	40.5s	$0.33	~70m	3
🥉	Claude Code CLI*	97%	94%	35.2s	~$0.17	~60m	3
4	GLM-4.7	97%	90%	15.6s	$0.18	~40m	2
5	Gemini 3 Pro	97%	90%	55.8s	$0.24	~100m	2
6	Gemini 3 Flash†	91%	82%	22.1s	$0.03	~45m	0
7	Solar Pro2	83%	61%	18.3s	$0.79	~48m	11
8	Solar Pro3	75%	70%	45.2s	$1.35	~85m	25
9	HyperCLOVA X‡	0%	0%	-	-	~3m	100

* Claude Code CLI: Based on Careti usage (Max subscription, 20x cheaper than API)
† Gemini 3 Flash: Preview API timeout caused 9 failures
‡ HyperCLOVA X: HCX-003(2%), HCX-007(0%) don't support coding. Retest needed

3. Visualization

Pass Rate Comparison

Cost vs Performance

4. Models Tested

Google Gemini

2.5 Flash/Pro, 3 Pro/Flash. Google's multimodal LLM.

Claude Code CLI

CLI tool based on Anthropic Opus 4.5.

GLM-4.7

Zhipu AI (China). Coding-optimized.

Solar Pro2/3

Upstage (Korea). Pro2 excels at feedback learning.

HyperCLOVA X

Naver (Korea). Doesn't support coding tasks.

5. Future Improvements

• Top 5 models tied at 97-98% → Need Hard++ Suite
• Retest when HyperCLOVA X releases coding version
• Consider adding real production bugs/refactoring problems

Download raw data

GitHub hard-suite.json results.json summary.json