The practical answer: test ZCode first for an integrated GLM-5.2-native desktop workflow and long-running goals; choose Pi for a minimal, programmable terminal harness; choose OpenCode for a polished general-purpose open-source agent with broad providers, IDE/GitHub integration, and granular permissions.

That is a workflow recommendation, not an empirical ranking. We have not run controlled same-task tests proving that one of these harnesses wins. Use the test card below before standardizing.

HarnessStart here when you wantMain trade-off
ZCodeGLM-5.2-native desktop workflow, Goal Mode, built-in subagents, Git/change reviewTighter Z.AI product fit; less provider-neutral
PiSmall terminal core, sessions, compaction, custom tools and TypeScript extensionsYou assemble more of the workflow; Coding Plan authorization is unclear
OpenCodeBroad providers, terminal/desktop/IDE/GitHub use, agents and explicit permissionsZen, BYO-provider, and Z.AI Coding Plan are separate billing paths

The Model Is Only Half the System

A coding model does not select files, expose tools, decide when to retry, or recover a long session by itself. The harness does. Six levers materially affect useful output:

  1. Context selection: which files, instructions, diffs, and tool results enter the prompt.
  2. Tool interfaces: whether the model gets precise read/edit/test primitives or a vague adapter.
  3. Agent loop: how the harness plans, acts, observes results, and decides what to try next.
  4. Verification and stopping: whether it runs the requested tests, checks the diff, and stops on evidence.
  5. Permissions: what it may read, edit, execute, publish, or delete without approval.
  6. Recovery and compaction: how it preserves decisions when context fills or a tool call fails.

Harness-Bench isolates this execution layer across shared tasks, budgets, and protocols. Its 5,194 trajectories show substantial variation in completion, process quality, efficiency, and failure behavior across model-harness pairings.

Claw-SWE-Bench gives a concrete GLM example: a minimal direct-diff adapter scored 19.1% Pass@1, while a full adapter reached 73.4% with the same GLM-5.1 backbone. That result shows adapter/interface design can dominate outcomes. It does not measure Pi versus ZCode versus OpenCode, and it does not establish GLM-5.2 performance in any of them.

Pi vs ZCode vs OpenCode

Decision pointPiZCodeOpenCode
Primary interfaceTerminal TUI, print/JSON, RPC, SDKDesktop ADE with terminal, Git, tasks, remote and bot controlsTerminal TUI plus desktop, IDE and GitHub integrations
Provider postureBroad provider support and custom providersDeep GLM-5.2 integrationBroad provider support; optional OpenCode Zen
Long workPersistent branching sessions and compactionGoal Mode iterates until goal verification passesPrimary agents, subagents, sessions and configurable workflows
ExtensibilityTypeScript extensions, custom tools, skills, prompt templates, packagesSkills, MCP, plugins, commands and custom subagentsAgents, commands, tools, MCP and provider configuration
PermissionsProject trust plus extension-controlled tool interceptionConfirmation modes from confirm-before-changes through fuller accessPer-tool, per-command and per-agent allow/ask/deny rules
GLM-5.2 accessPi-native Z.AI coding endpoint; Coding Plan authorization is not establishedZ.AI product with GLM models and Coding Plan connectionOpenCode Zen PAYG, direct Z.AI PAYG, or Z.AI Coding Plan
Best first testProgrammable terminal workflowIntegrated GLM-5.2 desktop workflowGeneral-purpose multi-provider workflow

Three Different Ways To Pay

Do not treat “supports GLM-5.2” as one entitlement.

PathWhat it meansUse in
Direct Z.AI PAYGAPI usage billed to a Z.AI API accountOpenCode’s Z.AI provider; other compatible clients
OpenCode ZenOptional OpenCode gateway; add credits and pay per request/model pricingOpenCode only
GLM Coding PlanSubscription quota restricted to Z.AI’s officially supported tools and productsZCode and listed integrations such as OpenCode

For the subscription rules, supported tools, and quotas, use the Z.AI Coding Plan guide. For model specs and PAYG pricing, use GLM-5.2.

Copyable Productive Tasks

These prompts constrain scope, define evidence, and make harness behavior easier to compare.

1. Bounded failing-test repair

1
2
3
4
5
6
7
8
9
Fix only the failure in tests/auth/session-expiry.test.ts.

Constraints:
- Reproduce the failure first.
- Read the smallest relevant implementation surface.
- Do not change public APIs or unrelated tests.
- Run the failing test after the patch, then the nearest auth test suite.
- Show the final diff and explain why it fixes the root cause.
- Stop and report if the test cannot be reproduced.

2. Read-only repository audit

1
2
3
4
5
6
7
8
Audit this repository for places where untrusted input reaches shell execution.

Read-only rules:
- Do not edit files, install packages, or run network commands.
- You may use repository search, git log, and existing static-analysis commands.
- Report file:line evidence, reachable data flow, severity, and uncertainty.
- Separate confirmed findings from hypotheses.
- End with the three highest-value verification steps.

3. Multi-file refactor with review

1
2
3
4
5
6
7
8
9
Replace the duplicated retry logic with one shared helper.

Acceptance:
- Preserve existing public behavior and error types.
- Add or update focused tests before deleting the old paths.
- Run the focused tests and the repository's standard lint/typecheck gates.
- Review git diff --check and the final diff for unrelated changes.
- List every changed file and why it changed.
- Do not commit, push, or publish.

Same-Model Harness Test Card

Run the same model, task, repository commit, instructions, budget, and timeout in each harness. Repeat enough times to expose flaky behavior.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
Model: GLM-5.2
Harness/version:
Repository commit:
Task:
Budget/timeout:
Permission mode:

Success (yes/no/partial):
Required tests passed:
Files changed:
Input/output tokens or quota consumed:
Retries/tool calls:
Human interventions:
Unsafe or out-of-scope actions:
Diff quality notes:
Failure/recovery notes:

Compare successful patches per unit of cost/quota and review time, not just whether the harness eventually produced a diff.

Why 1M Context Does Not Guarantee Productivity

One million tokens is capacity, not a promise that the right evidence will be selected or retained. Dumping an entire repository into context can dilute relevant instructions, increase latency, and make failures harder to diagnose. Agent loops also multiply usage: Z.AI estimates one Coding Plan prompt may invoke a model 15–20 times.

Costs rise when a harness:

  • rereads large files instead of using targeted search;
  • retries without changing its hypothesis;
  • launches redundant subagents;
  • carries noisy tool output forward;
  • compacts away constraints or earlier test evidence;
  • keeps working after acceptance criteria already pass.

Start with the smallest sufficient context. Record quota/tokens, retries, interventions, and test evidence in the test card.

Recommendation By Workflow

  • Choose ZCode when GLM-5.2 is the primary model and you want an integrated desktop environment with explicit goals, ongoing verification, safety confirmations, and built-in collaboration features.
  • Choose Pi when you want a small terminal harness that can become your own tool through extensions, custom tools, session branching, and customizable compaction.
  • Choose OpenCode when you need a provider-neutral default, explicit permission policies, reusable agents, and a path across terminal, desktop, IDE, and GitHub workflows.
  • Keep testing when the task is high risk. None of these feature lists proves lower defect rates in your repository.

Sources


Last verified: July 2, 2026. Harness features, model routing, prices, and subscription authorization can change independently.