LLM Models: Benchmarks and Selection Guide

What Are LLM Models?

LLM Models (Large Language Models) are the underlying AI systems that power coding tools and agentic platforms. This section covers model capabilities, pricing, benchmarks, and selection guidance—not the tools that use them.

Available Model Guides

Claude Opus 4.5

Anthropic’s flagship reasoning model. Highest SWE-bench score at 80.9%.

Kimi k2.5

Moonshot AI’s high-performing model with free access options. 76.8% SWE-bench.

Gemini 3 Flash

Google’s 1M context window model with free input tokens. 78.0% SWE-bench.

Full guide

GLM 4.7

Zhipu AI’s model available through OpenCode and other platforms.

Full guide

Quick Selection Guide

Model	SWE-bench	Price/1M	Best For
Claude Opus 4.5	80.9%	$5/$25	Maximum reasoning, safety-critical
Gemini 3 Flash	78.0%	Free/$0.15	High context, cost-sensitive
Kimi k2.5	76.8%	$3/$3	Best value, free access available

See Budget Tier, Mid-Range, and Premium comparisons for detailed analysis.

Coding Tools — Tools that use these models for coding assistance
Agentic Tools — Autonomous platforms powered by these models
Compare — Model-to-model comparisons

2026-02-01 | Gemini 3 Flash: 78% SWE-bench, $0.50/1M Input, 8x Cheaper Than Claude Opus Google's Gemini 3 Flash delivers frontier coding performance with $0.50/1M input tokens, 65K output limit, and 1M context. Full benchmarks, pricing breakdown, and when to choose it over Claude Opus 4.5 or Kimi k2.5.
2026-02-01 | GLM 4.7: Thinking Mode, 73.8% SWE-bench, Free via OpenCode & Z.AI Z.AI's GLM 4.7 delivers 73.8% SWE-bench coding performance with unique thinking mode visibility. Free via OpenCode Zen and Z.AI tier, with $3/month coding plan for heavy usage.
2026-02-01 | Kimi k2.5: Capabilities, Benchmarks & Free Access Moonshot AI's Kimi k2.5 model: 76.8% SWE-bench score, 256K context window, multimodal capabilities, and how to access it free through Kilo Code and OpenCode Zen.
2026-01-30 | Claude Opus 4.5: 80.9% SWE-bench, Pricing & When to Use Anthropic's flagship reasoning model delivers the highest verified SWE-bench score at 80.9%. Complete pricing breakdown ($5/1M input), subscription vs API analysis, and comparison to Kimi k2.5 and Gemini 3 Flash.