EVMBench - Testing AI on Smart Contract Security

Agentic Detection, Patching, and Exploitation on Real‑world Vulnerabilities

Feb 23, 2026

Smart contracts house over $100 billion in onchain assets, yet bugs still cause devastating losses. As LLMs increasingly write and audit code, the line between human and machine risk grows blurry.

To make AI both measurable and defensible, OpenAI and crypto‑investment firm Paradigm introduced EVMBench, a benchmark that evaluates how well AI agents can detect, patch and exploit smart‑contract vulnerabilities. By curating 120 real audit findings and testing models in sandboxed EVM environments, EVMBench turns abstract agent hype into measurable performance.

In this edition, we'll look into how EVMBench works, why the agentic economy needs transparent security benchmarks, and what the first results reveal about offensive and defensive AI.

How EVMBench Works

EVMBench is a dataset of real bugs paired with a deterministic test harness. The 120 vulnerabilities come from 40 code audits and Code4rena security contests which are real bugs that existed in production code.

OpenAI@OpenAI

Introducing EVMbench—a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities. openai.com/index/introduc…

6:46 PM · Feb 18, 2026 · 2.31M Views

1.12K Replies · 1.28K Reposts · 8.74K Likes

The benchmark tests three capabilities:

Detect: the agent audits a repository and gets scored on how many documented vulnerabilities it finds. Scoring rewards completeness; finding one bug and stopping doesn’t cut it.
Patch: the agent modifies code to remove the exploit without breaking intended behavior. Automated tests verify the fix works.
Exploit: the agent writes and executes a fund-draining attack in a sandboxed local EVM. Success is binary: did you drain the funds or not?

The harness runs on Rust, deploys contracts to an isolated Anvil EVM chain, and blocks unsafe RPC methods. Everything replays deterministically, so agents can’t game the setup through timing tricks or network-state manipulation.

Why This Matters

Blockchain security has always been a cat and mouse game. Billions have been stolen despite audits and bug bounties. However, with AI, its increasing both the capabilities of attackers’ and even the defenders. EVMBench addresses this gap by grounding model performance in economically meaningful environments.

The benchmark arrives at a moment when the agentic economy is taking shape. Onchain agentic projects like Bankr, Venice, Conway, etc. are building onchain products that deploy contracts. If these are touching real money, they need to be auditable. EVMBench provides a starting point to benchmark these onchain AI-powered DeFi protocols.

First Results

In exploit mode, GPT5.3 Codex achieved 72.2% success. GPT 5, released six months earlier, hit 31.9%. When the project started, top models could only exploit less than 20% of critical Code4rena bugs. The leap to >70% shows how fast offensive capabilities are improving.

Detection and patching remain harder as agents often find one bug and stop searching, missing the rest. Patching requires understanding what the code is supposed to do, not just what’s broken and models struggle to maintain functionality while removing flaws.

The benchmark also has its limitations as these tasks come from Code4rena audits with real protocols potentially being harder to crack after countless security audits. The grading can’t distinguish between genuine undiscovered bugs and false positives.

Still, the directional signal is useful in proving that offensive capabilities are currently improving faster than defensive ones (which is concerning). OpenAI is expanding its Aardvark security agent and committing $10M in API credits for defensive research, which suggests they see the same imbalance.

What to Watch

EVMBench is meant to become a continuous evaluation loop. Researchers can add new vulnerabilities and refine metrics over time which are in public code and datasets.

For anyone evaluating AI-driven DeFi products, EVMBench could become the go-to security audit option that is accessible to everyone. However, given the offensive/defensive gap in current results, large improvements need to be made for it to truly detect and patch contracts to protect users.

Important Links

Become a Premium member today to unlock all our research & reports.

Join thousands of sharp crypto investors & traders by becoming a Premium Member & gain an edge in the markets. For just $149/month, you can access our full suite of offerings:

Gain access to Deep Dives, Blueprints, Perspectives, Theses, Benchmarks & Outlooks.
Weekly market update reports and key actionable insights, keeping you informed as the market evolves.
Full access to historical research archive, including hundreds of long-form reports.
Join Today

Alea Research

Discussion about this post

Ready for more?