The Web3 security industry runs on reputation and trust. Auditors say "we found critical bugs" but never show you how they compare against a known baseline. We took a different approach: we ran our AI audit engine against a real Code4rena contest with known findings — and we are publishing every metric.
The result? A perfect score.
The Benchmark: Code4rena 2022-09-vtvl
2/2
HIGH Findings Detected
3/3
MEDIUM Findings Detected
5/5
Foundry PoC Verified
5.7m
Total Audit Time
ℹ️Scope and Ground Truth
Comparison is against the official Code4rena 2022-09-vtvl report — 2 HIGH and 3 MEDIUM findings as published. QA / Low and out-of-scope items are not part of this comparison.
VTVL is a token vesting protocol audited in a Code4rena contest in September 2022. The contest identified 2 high-severity and 3 medium-severity vulnerabilities. Our AI engine found all of them — in under 6 minutes — and every finding was verified with a passing Foundry proof-of-concept test.
Why This Matters
⚠️The Industry Problem
Ask any smart contract auditor to show you their detection rate against a known vulnerability catalog. Most cannot. Not because they are bad at their job — but because the industry has never demanded measurable accountability. We think that needs to change, and we are starting with ourselves.
Human auditors are brilliant — but they are inconsistent. In Code4rena contests, the median warden misses 40-60% of high-severity findings. Top wardens find more, but even the best do not catch everything. The question is not whether humans or AI are better — it is whether your audit process has measurable, reproducible quality guarantees.
Ours does. Here is the proof.
Detailed Results
High-Severity Findings (2/2 Detected)
C4 Finding
- •H-01: revokeClaim ignores vested-but-unwithdrawn tokens, causing loss of user funds
- •H-02: uint112 overflow in _baseVestedAmount intermediate multiplication
RedVolt Detection
- •DETECTED — Severity: HIGH, Forge PoC: PASS
- •DETECTED — Severity: HIGH, Forge PoC: PASS
Medium-Severity Findings (3/3 Detected)
All three medium-severity findings from the Code4rena contest were identified and correctly classified.
Forge PoC Verification
This is where RedVolt separates from every other AI auditing tool. We do not just flag potential issues — our Forge agent writes and executes Foundry test cases that prove the vulnerability exists:
ℹ️H-01 Forge Output
Ran 1 test for test/Exploit.t.sol:RevokeClaimTest [PASS] test_revokeClaimLosesVestedButUnwithdrawnTokens() (gas: 139058) 1 tests passed, 0 failed in 4.95ms
ℹ️H-02 Forge Output
Ran 1 test for test/Exploit.t.sol:Uint112OverflowTest [PASS] test_uint112OverflowInBaseVestedAmount() (gas: 145713) 1 tests passed, 0 failed in 5.84ms
Every finding is backed by executable proof. No "potential issue" hand-waving. No "we recommend further investigation." A passing test that demonstrates the exploit.
The Multi-Agent Architecture
Our audit engine is not a single LLM reading code. It is a coordinated team of specialized AI agents working in sequence:
Sentinel — Protocol Mapper
Maps every contract, function, state variable, call graph, token flow, and role. Builds the foundation that all other agents work from.
Viper — Vulnerability Hunter
Hunts for logic bugs, arithmetic overflows, reentrancy, oracle manipulation, and economic exploits. Reasons about state transitions and invariant violations.
Warden — Access Control Auditor
Analyzes role hierarchies, permission checks, proxy initialization, and privilege escalation paths. Identifies centralization risks and governance attack vectors.
Phantom — Edge Case Finder
Explores extreme scenarios, economic edge cases, and multi-transaction attack sequences that other agents miss. Thinks like a MEV searcher.
Forge — PoC Generator
Takes findings from all agents and writes Foundry test cases that prove each vulnerability. If the PoC does not compile and pass, the finding is flagged for review.
Scribe — Report Synthesizer
Deduplicates findings, assigns final severity ratings, and generates a professional PDF audit report with executive summary, detailed findings, and remediation roadmap.
Performance Metrics
5.7 min
Total Audit Time
5/5
Findings at Correct Bucket
2/2
Forge PoC Verified
3
Pipeline Stages
5.7 minutes. That is the time from contract submission to a complete audit report with verified proof-of-concept exploits. A traditional audit firm takes 1-4 weeks for the same scope.
The Transparency Standard
We are not publishing these results to say "AI replaces human auditors." We are publishing them because the industry deserves measurable standards. When you hire an auditor — human or AI — you should be able to ask: "What is your detection rate against known vulnerability catalogs?"
If they cannot answer that question, you are paying for confidence without evidence.
💡Our Commitment
Every benchmark we run is scored against real Code4rena contest findings. We do not create synthetic vulnerabilities designed to make our tool look good. We use the same ground truth that dozens of professional wardens competed against. Our results are reproducible, auditable, and published openly. That is what accountability looks like.
ℹ️Want Independent Verification?
We can pair this benchmark with a manual expert review on your specific deployment — full audit output, side-by-side comparison against the C4 report, and verifier sign-off on each PoC. Request an expert proof report →
Related reading
veRWA (8/8 HIGH) · Wildcat Protocol (6/6 HIGH) · BakerFi (7/7 HIGH) · Karak Restaking · Ethernaut + DVD (7/7) · Jito Restaking on Solana. For methodology, read RedVolt benchmark results.