GLM 5.2 Beat Claude Code on Security Benchmarks at One-Sixth the Cost

Semgrep's IDOR detection benchmarks show GLM 5.2 scoring 39% F1 to Claude Code's 32%, at roughly $0.17 per vulnerability found. Here is what the numbers mean.

GLM 5.2 Beat Claude Code on Security Benchmarks at One-Sixth the Cost

Semgrep, the company behind the popular static analysis tool, ran an unexpected experiment this week. They pitted open-weight models against frontier coding agents on a real security detection task. The result that got 1,000+ points and 479 comments on Hacker News was this: GLM 5.2, an open-weight model from Z.ai (formerly Zhipu AI), scored 39% F1 on IDOR vulnerability detection, beating Claude Code at 32% and costing about one-sixth as much per bug found.

That number per vulnerability is what makes this worth paying attention to. At roughly $0.17 per true positive, GLM 5.2 is not just competitive on accuracy for a security task. It changes the economics of running AI-powered code analysis at scale.

This is not the same as saying the model is outright better. The Semgrep team is careful to note this was one task on one dataset. But the result is real, it is independently measured, and it challenges the assumption that frontier closed models are the only viable option for specialized security work.

What the GLM 5.2 Benchmarks Tested

Semgrep tested IDOR detection. IDOR stands for Insecure Direct Object Reference, a vulnerability where an application exposes an internal identifier like a user ID in a request without checking the caller has permission to access it. It is the number four most common vulnerability type on HackerOne, and it is hard to catch because there is no dangerous function to flag, only a missing permission check.

The test used Semgrep’s standard IDOR dataset, real open-source applications, and the same system prompt across all models. The harness varied. Semgrep’s custom multimodal pipeline used endpoint-discovery scaffolding. The open-weight models (GLM 5.2, MiniMax M3, Kimi K2.7) got only the Pydantic AI harness with no special guidance. Claude Code ran through its native SDK.

The results ranked by F1:

  • Semgrep Multimodal (GPT 5.5): 61%
  • Semgrep Multimodal (Opus 4.8): 53%
  • GLM 5.2 (prompt only): 39%
  • Claude Code (Opus 4.6): 37%
  • Claude Code (Opus 4.8/4.7): 32%
  • MiniMax M3: 23%
  • Kimi K2.7 Code: 22%
  • GPT-5.5 Codex: 20%
  • Nemotron Super 3 120B: 18%
  • DeepSeek V4: 17%

The important comparison is between GLM 5.2 (39%, no scaffolding) and Claude Code (32%, with its native SDK). An open-weight model with nothing but a prompt and a codebase outperformed a frontier coding agent purpose-built for code tasks.

The Cost Difference

GLM 5.2 runs at roughly one-sixth the price of comparable frontier models. On Semgrep’s test, that worked out to about $0.17 per IDOR found. For a detection task you might run across thousands of endpoints, per-bug economics matter. They often decide whether a technique is usable at scale or stays a lab experiment.

The Semgrep pipeline scored higher (53-61% F1) but required the custom endpoint-discovery harness to get there. GLM 5.2 got 39% with none of that infrastructure. If you are a security team evaluating tools, the question becomes whether the harness premium buys you enough extra coverage to justify the cost and complexity.

The Broader Pattern

This result fits a pattern that has been building for weeks. Chinese AI labs have been releasing models that compete with frontier US models on specific tasks at a fraction of the cost. DeepSeek V4 did it on general reasoning. GLM 5.2 did it on coding benchmarks. Now Semgrep’s data adds security-specific evidence.

Z.ai publishes GLM 5.2’s weights under an MIT license. You can run it on your own hardware, fine-tune it, and inspect it. For security teams that cannot send code to third-party APIs, an open-weight model that performs this well on a bare prompt is a real option. That was not true six months ago.

There is also an interesting disclosure from Z.ai’s release notes. GLM 5.2 exhibited more reward-hacking behavior during training than GLM 5.1, including reading protected evaluation files and curling reference solutions to inflate scores. Z.ai built a dedicated anti-hacking guard for it. If you are building a security model, the fact that it tries to cheat at benchmarks during training is either concerning or reassuring depending on your perspective.

The Bottom Line

GLM 5.2 beat Claude Code on a real security task at one-sixth the cost. The harness still matters more than the model (Semgrep’s pipeline at 61% proves that), but the gap between open-weight and frontier on a bare prompt has shrunk to the point where it demands attention.

For TRT readers who build on or evaluate AI coding tools: this is the first independent benchmark showing an open-weight model outperforming a frontier agent on security detection. It does not mean GLM 5.2 is the better model for everything. It means the days of assuming frontier closed models are the only viable option for specialized tasks are over.

Read my original GLM 5.2 launch coverage here for the full specs and context on what the model can do.

Tony Simons

Reviewed & Written By

Tony Simons

Independent tech reviewer and creator of Tony Reviews Things. 14 years of hands-on testing, software auditing, and workflow automation. I test the gear so you don't waste your money on junk.

Submit a Take

Your email address will not be published. Required fields are marked *