A comprehensive human-evaluated benchmark for assessing LLM performance in software security. VADER evaluates models across four critical dimensions: vulnerability assessment, detection, explanation, and remediation using 174 real-world vulnerabilities from open-source projects.
arXiv:2505.19395 [cs.CR]
Model performance on vulnerability assessment, detection, explanation, and remediation tasks
Performance breakdown across four evaluation dimensions with statistical analysis
Model | Mean Score |
---|---|
Claude-3.7 | 52.31% |
Gemini-2.5-Pro | 53.58% |
GPT-4.1 | 50.00% |
GPT-4.5 | 49.19% |
o3 | 54.62% |
Grok 3 Beta | 52.02% |
Statistic | Claude | Gemini | GPT-4.1 | GPT-4.5 | o3 | Grok |
---|---|---|---|---|---|---|
Mean | 52.30% | 52.76% | 49.08% | 49.20% | 54.60% | 51.38% |
Std | 47.34% | 47.22% | 46.65% | 47.92% | 48.67% | 47.37% |
25% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
50% | 80.00% | 80.00% | 60.00% | 60.00% | 90.00% | 60.00% |
75% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |
Statistic | Claude | Gemini | GPT-4.1 | GPT-4.5 | o3 | Grok |
---|---|---|---|---|---|---|
Mean | 53.74% | 56.03% | 53.45% | 50.29% | 55.46% | 53.74% |
Std | 48.39% | 48.75% | 49.15% | 48.83% | 49.41% | 48.98% |
25% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
50% | 100.00% | 100.00% | 100.00% | 50.00% | 100.00% | 100.00% |
75% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |
Statistic | Claude | Gemini | GPT-4.1 | GPT-4.5 | o3 | Grok |
---|---|---|---|---|---|---|
Mean | 51.72% | 53.83% | 49.81% | 48.66% | 54.21% | 52.11% |
Std | 47.21% | 47.76% | 47.24% | 47.76% | 48.40% | 48.00% |
25% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
50% | 66.67% | 66.67% | 66.67% | 66.67% | 83.33% | 66.67% |
75% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |
Performance comparison across four evaluation dimensions
VADER assesses LLM capabilities across the complete vulnerability handling lifecycle
Identify and classify security flaws using Common Weakness Enumeration (CWE) categories
Detect real-world vulnerabilities across 15 programming languages and multi-file contexts
Provide clear technical explanations of vulnerability causes and potential exploit scenarios
Generate clean, minimal patches that eliminate vulnerabilities without breaking functionality
Coverage of critical security flaws across multiple programming languages and contexts
Database query manipulation through unsanitized user input, allowing unauthorized data access, modification, or deletion. Most prevalent in Python and SQL-based applications.
Injection of malicious scripts into web applications, enabling session hijacking, user redirection, and data theft. Common in JavaScript and web frameworks.
Execution of arbitrary operating system commands through unsanitized input. Can lead to complete system compromise and data destruction.
Unauthorized access to files and directories outside the intended scope through manipulated file paths. Prevalent in Python applications with file I/O operations.
VADER focuses on High and Critical severity vulnerabilities (61% of cases)
7% of cases
11% of cases
21% of cases
41% of cases
20% of cases
Real-world vulnerability case demonstrating the four-stage evaluation process
A recursive function that calls itself without proper termination conditions, leading to stack overflow and potential denial of service. The function continues to recurse indefinitely when invalid input is provided.
def validd():
code = input("Enter book code: ")
if not is_valid(code):
print("Invalid code. Try again.")
validd() # Dangerous recursion
def validd():
retries = 0
while retries < 3:
code = input("Enter book code: ")
if is_valid(code):
break
print("Invalid code. Try again.")
retries += 1
Expected: CWE-674 (Uncontrolled Recursion), Severity Level 4 (High)
Required: Identify that the recursive call without proper termination conditions leads to stack overflow when invalid input is repeatedly provided.
Expected: Replace recursion with iteration using a retry counter to prevent stack overflow.
Even state-of-the-art LLMs achieve only moderate success on VADER
Rigorous expert annotation and evaluation process ensuring benchmark quality
VADER's 174 vulnerability cases were constructed through a rigorous double-annotator process:
VADER captures real-world complexity through diverse language and file coverage:
Comprehensive vulnerability benchmark with expert annotations, scoring tools, and evaluation results
VADER's complete dataset, detailed evaluation rubrics, scoring tools, and visualized results are publicly available. Perfect for researchers developing vulnerability-aware LLMs and security tools.
Our research findings are advancing foundational model capabilities through human-generated, specialized datasets.