AI Audit vs Code4rena VTVL: 5/5 Findings + 5/5 PoCs Verified

The Web3 security industry runs on reputation and trust. Auditors say "we found critical bugs" but never show you how they compare against a known baseline. We took a different approach: we ran our AI audit engine against a real Code4rena contest with known findings — and we are publishing every metric.

The result? A perfect score.

The Benchmark: Code4rena 2022-09-vtvl

2/2

HIGH Findings Detected

3/3

MEDIUM Findings Detected

5/5

Foundry PoC Verified

5.7m

Total Audit Time

ℹ️Scope and Ground Truth

Comparison is against the official Code4rena 2022-09-vtvl report — 2 HIGH and 3 MEDIUM findings as published. QA / Low and out-of-scope items are not part of this comparison.

VTVL is a token vesting protocol audited in a Code4rena contest in September 2022. The contest identified 2 high-severity and 3 medium-severity vulnerabilities. Our AI engine found all of them — in under 6 minutes — and every finding was verified with a passing Foundry proof-of-concept test.

Why This Matters

⚠️The Industry Problem

Ask any smart contract auditor to show you their detection rate against a known vulnerability catalog. Most cannot. Not because they are bad at their job — but because the industry has never demanded measurable accountability. We think that needs to change, and we are starting with ourselves.

Human auditors are brilliant — but they are inconsistent. In Code4rena contests, the median warden misses 40-60% of high-severity findings. Top wardens find more, but even the best do not catch everything. The question is not whether humans or AI are better — it is whether your audit process has measurable, reproducible quality guarantees.

Ours does. Here is the proof.

Detailed Results

High-Severity Findings (2/2 Detected)

C4 Finding

•H-01: revokeClaim ignores vested-but-unwithdrawn tokens, causing loss of user funds
•H-02: uint112 overflow in _baseVestedAmount intermediate multiplication

RedVolt Detection

•DETECTED — Severity: HIGH, Forge PoC: PASS
•DETECTED — Severity: HIGH, Forge PoC: PASS

Medium-Severity Findings (3/3 Detected)

All three medium-severity findings from the Code4rena contest were identified and correctly classified.

Forge PoC Verification

This is where RedVolt separates from every other AI auditing tool. We do not just flag potential issues — our Forge agent writes and executes Foundry test cases that prove the vulnerability exists:

ℹ️H-01 Forge Output

Ran 1 test for test/Exploit.t.sol:RevokeClaimTest [PASS] test_revokeClaimLosesVestedButUnwithdrawnTokens() (gas: 139058) 1 tests passed, 0 failed in 4.95ms

ℹ️H-02 Forge Output

Ran 1 test for test/Exploit.t.sol:Uint112OverflowTest [PASS] test_uint112OverflowInBaseVestedAmount() (gas: 145713) 1 tests passed, 0 failed in 5.84ms

Every finding is backed by executable proof. No "potential issue" hand-waving. No "we recommend further investigation." A passing test that demonstrates the exploit.

The Multi-Agent Architecture

Our audit engine is not a single LLM reading code. It is a coordinated team of specialized AI agents working in sequence:

Sentinel — Protocol Mapper

Maps every contract, function, state variable, call graph, token flow, and role. Builds the foundation that all other agents work from.

→

↓

Viper — Vulnerability Hunter

Hunts for logic bugs, arithmetic overflows, reentrancy, oracle manipulation, and economic exploits. Reasons about state transitions and invariant violations.

→

↓

Warden — Access Control Auditor

Analyzes role hierarchies, permission checks, proxy initialization, and privilege escalation paths. Identifies centralization risks and governance attack vectors.

→

↓

Phantom — Edge Case Finder

Explores extreme scenarios, economic edge cases, and multi-transaction attack sequences that other agents miss. Thinks like a MEV searcher.

→

↓

Forge — PoC Generator

Takes findings from all agents and writes Foundry test cases that prove each vulnerability. If the PoC does not compile and pass, the finding is flagged for review.

→

↓

Scribe — Report Synthesizer

Deduplicates findings, assigns final severity ratings, and generates a professional PDF audit report with executive summary, detailed findings, and remediation roadmap.

Performance Metrics

5.7 min

Total Audit Time

5/5

Findings at Correct Bucket

2/2

Forge PoC Verified

Pipeline Stages

5.7 minutes. That is the time from contract submission to a complete audit report with verified proof-of-concept exploits. A traditional audit firm takes 1-4 weeks for the same scope.

The Transparency Standard

We are not publishing these results to say "AI replaces human auditors." We are publishing them because the industry deserves measurable standards. When you hire an auditor — human or AI — you should be able to ask: "What is your detection rate against known vulnerability catalogs?"

If they cannot answer that question, you are paying for confidence without evidence.

💡Our Commitment

Every benchmark we run is scored against real Code4rena contest findings. We do not create synthetic vulnerabilities designed to make our tool look good. We use the same ground truth that dozens of professional wardens competed against. Our results are reproducible, auditable, and published openly. That is what accountability looks like.

ℹ️Want Independent Verification?

We can pair this benchmark with a manual expert review on your specific deployment — full audit output, side-by-side comparison against the C4 report, and verifier sign-off on each PoC. Request an expert proof report →