Feb 4, 2026
HumanEval Agent Mode Benchmark
Gemini 2.5 Flash - Careti prompt mode 97.6% first-attempt pass rate, 5.3s avg response
Careti Agent Benchmark
Hard Suite Report (9 Models)
2026-02-02 ~ 2026-02-05
⚠️ Note: Benchmark results may differ from real-world experience. Algorithm problem-solving ability and actual project development skills are separate.
Summary
- • Gemini 2.5 Flash, Gemini 2.5 Pro, Claude Code CLI, GLM-4.7 all excellent for coding (97-98%)
- • Gemini 2.5 Flash best value (98% pass, $0.05)
- • Need harder tests to differentiate top models (Hard++ Suite)
- • Latest models aren't always best (Gemini 3 Pro ≈ 2.5 Pro)
- • Korean models (Solar, HyperCLOVA X) need coding optimization
1. Benchmark Overview
Method
HumanEval
Problem → Code → Score (1 attempt)
Careti Agent
Problem → Code → Test → [Error] → Retry (max 5)
Parameters
100
Problems
5
Max Attempts
300s
Timeout
Hard
Difficulty
Termination
✓ success
Test passed
✗ max_attempts
Failed after 5 attempts
⏱ timeout
Exceeded 300s
↺ same_error
Same error repeated
2. Model Rankings
Sorted by: Final pass rate → 1st attempt rate → Cost (lowest first)
| Rank | Model | Final | 1st Pass | Avg Time | Cost | Total Time | same_error |
|---|---|---|---|---|---|---|---|
| 🥇 | Gemini 2.5 Flash | 98% | 92% | 17.6s | $0.05 | ~39m | 1 |
| 🥈 | Gemini 2.5 Pro | 97% | 95% | 40.5s | $0.33 | ~70m | 3 |
| 🥉 | Claude Code CLI* | 97% | 94% | 35.2s | ~$0.17 | ~60m | 3 |
| 4 | GLM-4.7 | 97% | 90% | 15.6s | $0.18 | ~40m | 2 |
| 5 | Gemini 3 Pro | 97% | 90% | 55.8s | $0.24 | ~100m | 2 |
| 6 | Gemini 3 Flash† | 91% | 82% | 22.1s | $0.03 | ~45m | 0 |
| 7 | Solar Pro2 | 83% | 61% | 18.3s | $0.79 | ~48m | 11 |
| 8 | Solar Pro3 | 75% | 70% | 45.2s | $1.35 | ~85m | 25 |
| 9 | HyperCLOVA X‡ | 0% | 0% | - | - | ~3m | 100 |
* Claude Code CLI: Based on Careti usage (Max subscription, 20x cheaper than API)
† Gemini 3 Flash: Preview API timeout caused 9 failures
‡ HyperCLOVA X: HCX-003(2%), HCX-007(0%) don't support coding. Retest needed
3. Visualization
Pass Rate Comparison
Cost vs Performance
4. Models Tested
Google Gemini
2.5 Flash/Pro, 3 Pro/Flash. Google's multimodal LLM.
Claude Code CLI
CLI tool based on Anthropic Opus 4.5.
GLM-4.7
Zhipu AI (China). Coding-optimized.
Solar Pro2/3
Upstage (Korea). Pro2 excels at feedback learning.
HyperCLOVA X
Naver (Korea). Doesn't support coding tasks.
5. Future Improvements
- • Top 5 models tied at 97-98% → Need Hard++ Suite
- • Retest when HyperCLOVA X releases coding version
- • Consider adding real production bugs/refactoring problems
Download raw data
