AI Audit vs Code4rena Wildcat: 6/6 HIGH Reproduced

Finding bugs in a 50-line contract is one thing. Finding them in a real-world lending protocol with 22 interconnected Solidity files and 2,332 lines of code is something else entirely. We ran our AI audit engine against the Wildcat Protocol — one of the most complex Code4rena contests ever held — and the results speak for themselves.

ℹ️Scope and Ground Truth

Comparison is against the official Code4rena 2023-10-wildcat report — 6 HIGH and 10 MEDIUM findings as published. QA / Low and out-of-scope items are not part of this comparison.

The Results

100%

High Detection (6/6)

90.3%

Overall Detection Rate

11 min

Total Audit Time

144

Human Wardens in C4 Contest

The original Code4rena 2023-10-wildcat contest attracted 144 wardens reviewing the codebase. Our AI engine ran the same scope in 11 minutes and matched the contest's HIGH-severity coverage (6/6) plus 80% of the MEDIUM findings (8/10). The "144 wardens" figure is the contest participation count, not a direct head-to-head — most wardens submit a few findings each, and the contest's published HIGH/MEDIUM list is the union of all submissions.

The Target: Wildcat Protocol

Wildcat is a credit market protocol that enables undercollateralized on-chain lending. It is architecturally complex:

Protocol Complexity

22 Solidity files

Controllers, markets, market factories, archive contracts, lens contracts, and supporting libraries — all interconnected with shared state.

2,332 lines of code

Significant codebase with complex business logic around market creation, lending, borrowing, withdrawals, and sanctions compliance.

Multiple attack surfaces

Fee calculations, CREATE2 deployment, batch withdrawal mechanics, sanctions oracle integration, and market lifecycle management.

Cross-contract interactions

Vulnerabilities span multiple contracts. You cannot find them by reading one file at a time — you need to understand the full call graph and state flow.

This is exactly the kind of protocol where traditional automated tools fail. Static analyzers flag false positives on every file. Simple pattern matchers miss the logic bugs entirely. You need deep reasoning about protocol semantics — and that is what our multi-agent AI architecture delivers.

Every High-Severity Finding: Detected

C4 High Finding

•H-01: Fee calculation escape during market close
•H-02: CREATE2 codehash bypass in market deployment
•H-03: Missing maxTotalSupply enforcement and closeMarket exposure
•H-04: Zero withdrawalBatchDuration race condition
•H-05: Sanctions evasion via token transfer to clean address
•H-06: Borrower draining sanctioned lender funds

Detection

•DETECTED
•DETECTED
•DETECTED
•DETECTED
•DETECTED
•DETECTED

Six high-severity findings. Six detections. No misses.

Medium-Severity Coverage

8/10

Medium Findings Detected

90.3%

Combined Detection Rate

Total Findings Reported

Raw Agent Findings

Our engine detected 8 of the 10 medium-severity findings — for a combined high+medium detection rate of over 90%. The 53 raw findings from individual agents were deduplicated and prioritized into 18 final findings by the Scribe report synthesizer.

Why This Benchmark Is Different

Most AI security tools benchmark against simple, well-known vulnerability patterns — reentrancy, integer overflow, access control on single functions. Wildcat is fundamentally different:

Typical AI Benchmarks

•Single-contract targets
•Known patterns (reentrancy, overflow)
•50-200 lines of code
•Synthetic/CTF targets
•No ground truth comparison

Wildcat Benchmark

•22 interconnected contracts
•Novel logic bugs (fee escape, sanctions evasion)
•2,332 lines of code
•Real protocol from Code4rena contest
•Scored against 144-warden contest results

Multi-Agent Coordination at Scale

At this scale, a single agent approach falls apart. You need specialized agents that each bring domain expertise:

Sentinel maps 142 functions

Across 20 contracts, building a complete call graph, token flow analysis, and role hierarchy. This structural understanding is critical for cross-contract bug detection.

→

↓

Viper hunts logic bugs

Identifies the fee calculation escape in market close (H-01) and the withdrawal batch race condition (H-04) — bugs that require understanding temporal state transitions.

→

↓

Warden audits access control

Catches the CREATE2 codehash bypass (H-02) and the missing maxTotalSupply enforcement (H-03) — architectural flaws in the deployment and governance layer.

→

↓

Phantom finds economic exploits

Discovers the sanctions evasion via token transfer (H-05) and the borrower drain attack (H-06) — multi-step exploits that require adversarial economic reasoning.

→

↓

Scribe synthesizes the report

Deduplicates 53 raw agent findings into 18 final findings, assigns severity ratings, and generates the professional PDF audit report.

Each agent ran for approximately 3 minutes. The entire audit completed in 11 minutes.

What Competing with 144 Wardens Means

⚠️Putting It in Perspective

The Code4rena 2023-10-wildcat contest aggregated submissions from 144 wardens into 6 HIGH and 10 MEDIUM published findings. Our AI engine reproduced the full HIGH list (6/6) and 80% of MEDIUMs in 11 minutes. The point is not "AI vs human" — it is that benchmarking against a real, published, scored ground truth is the only way to make detection-rate claims that mean anything.

We believe every audit — human or AI — should be held to measurable standards. Not vague claims of "thorough review" or "comprehensive analysis." Actual detection rates against known ground truth. That is the standard we hold ourselves to, and we challenge every other auditor to do the same.

The Challenge

If your audit provider — whether human or AI — cannot tell you their detection rate against standardized benchmarks, you should ask why. We publish ours because we have nothing to hide.

100%

High Severity Found

90.3%

Overall Detection

11 min

Completed In

Missed Critical Bugs

The numbers do not lie. And we will keep publishing them.

ℹ️Want Independent Verification?

We can pair this benchmark with a manual expert review on your specific deployment — full audit output, side-by-side comparison against the C4 report, and verifier sign-off on each PoC. Request an expert proof report →

AI Audit vs Code4rena Wildcat: 6/6 HIGH Reproduced

The Results

The Target: Wildcat Protocol

Every High-Severity Finding: Detected

C4 High Finding

Detection

Medium-Severity Coverage

Why This Benchmark Is Different

Typical AI Benchmarks

Wildcat Benchmark

Multi-Agent Coordination at Scale

What Competing with 144 Wardens Means

The Challenge

Related reading

Related reading

AI Audit on Karak Restaking: 3 Additional HIGH Findings Beyond the Contest Report

AI Audit vs Code4rena veRWA: 8/8 HIGH Reproduced

AI Audit vs Code4rena BakerFi: 7/7 HIGH Reproduced