VADER: Vulnerability Assessment, Detection, Explanation, and Remediation

A comprehensive human-evaluated benchmark for assessing LLM performance in software security. VADER evaluates models across four critical dimensions: vulnerability assessment, detection, explanation, and remediation using 174 real-world vulnerabilities from open-source projects.

Security Assessment

Vulnerability Detection

Code Remediation

15 Programming Languages

Read the Paper

arXiv:2505.19395 [cs.CR]

VADER Benchmark Results

Model performance on vulnerability assessment, detection, explanation, and remediation tasks

0%25%50%75%100%

OpenAI o3

54.6%

Gemini 2.5 Pro

53.6%

Claude 3.7

52.3%

Grok 3 Beta

52.0%

GPT-4.1

50.0%

GPT-4.5

49.2%

Performance measured on 174 real-world vulnerability casesScore: Overall accuracy across all evaluation dimensions

Detailed Performance Statistics

Performance breakdown across four evaluation dimensions with statistical analysis

Overall Performance Across Models

Model	Mean Score
Claude-3.7	52.31%
Gemini-2.5-Pro	53.58%
GPT-4.1	50.00%
GPT-4.5	49.19%
o3	54.62%
Grok 3 Beta	52.02%

Remediation Performance

Statistic	Claude	Gemini	GPT-4.1	GPT-4.5	o3	Grok
Mean	52.30%	52.76%	49.08%	49.20%	54.60%	51.38%
Std	47.34%	47.22%	46.65%	47.92%	48.67%	47.37%
25%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%
50%	80.00%	80.00%	60.00%	60.00%	90.00%	60.00%
75%	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%

Explanation Performance

Statistic	Claude	Gemini	GPT-4.1	GPT-4.5	o3	Grok
Mean	53.74%	56.03%	53.45%	50.29%	55.46%	53.74%
Std	48.39%	48.75%	49.15%	48.83%	49.41%	48.98%
25%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%
50%	100.00%	100.00%	100.00%	50.00%	100.00%	100.00%
75%	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%

Other Performance (CWE + Severity)

Statistic	Claude	Gemini	GPT-4.1	GPT-4.5	o3	Grok
Mean	51.72%	53.83%	49.81%	48.66%	54.21%	52.11%
Std	47.21%	47.76%	47.24%	47.76%	48.40%	48.00%
25%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%
50%	66.67%	66.67%	66.67%	66.67%	83.33%	66.67%
75%	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%

Performance comparison across four evaluation dimensions

Four-Stage Evaluation Framework

VADER assesses LLM capabilities across the complete vulnerability handling lifecycle

Vulnerability Assessment

Identify and classify security flaws using Common Weakness Enumeration (CWE) categories

Threat Detection

Detect real-world vulnerabilities across 15 programming languages and multi-file contexts

Root Cause Explanation

Provide clear technical explanations of vulnerability causes and potential exploit scenarios

Code Remediation

Generate clean, minimal patches that eliminate vulnerabilities without breaking functionality

Vulnerability Categories

Coverage of critical security flaws across multiple programming languages and contexts

SQL Injection (CWE-89)

High to Critical

Database query manipulation through unsanitized user input, allowing unauthorized data access, modification, or deletion. Most prevalent in Python and SQL-based applications.

Prevalence:Most Common

Cross-Site Scripting (CWE-79)

Medium to High

Injection of malicious scripts into web applications, enabling session hijacking, user redirection, and data theft. Common in JavaScript and web frameworks.

Prevalence:Very Common

OS Command Injection (CWE-78)

Critical

Execution of arbitrary operating system commands through unsanitized input. Can lead to complete system compromise and data destruction.

Prevalence:Common

Path Traversal (CWE-22)

Medium to High

Unauthorized access to files and directories outside the intended scope through manipulated file paths. Prevalent in Python applications with file I/O operations.

Prevalence:Common

Severity Distribution

VADER focuses on High and Critical severity vulnerabilities (61% of cases)

Very Low

7% of cases

Low

11% of cases

Medium

21% of cases

High

41% of cases

Critical

20% of cases

Vulnerability Example

Real-world vulnerability case demonstrating the four-stage evaluation process

CWE-674: Uncontrolled Recursion

Vulnerability Description

A recursive function that calls itself without proper termination conditions, leading to stack overflow and potential denial of service. The function continues to recurse indefinitely when invalid input is provided.

Vulnerable Code

def validd():
    code = input("Enter book code: ")
    if not is_valid(code):
        print("Invalid code. Try again.")
        validd()  # Dangerous recursion

Remediated Code

def validd():
    retries = 0
    while retries < 3:
        code = input("Enter book code: ")
        if is_valid(code):
            break
        print("Invalid code. Try again.")
        retries += 1

VADER Evaluation Tasks

1. Classification & Assessment

Expected: CWE-674 (Uncontrolled Recursion), Severity Level 4 (High)

2. Explanation

Required: Identify that the recursive call without proper termination conditions leads to stack overflow when invalid input is repeatedly provided.

3. Remediation

Expected: Replace recursion with iteration using a retry counter to prevent stack overflow.

4. Test Plan

• Input invalid codes repeatedly to trigger stack overflow
• Test with special characters, empty input, and long strings
• Verify that patched code handles retry limit properly
• Confirm normal operation with valid input

Current Model Performance

Even state-of-the-art LLMs achieve only moderate success on VADER

54.7%

OpenAI o3 (Best)

Top performing model

49-54%

Other Models

GPT-4.5, Gemini 2.5, Claude 3.7

r >0.97

Correlation

Remediation & Classification

Methodology

Rigorous expert annotation and evaluation process ensuring benchmark quality

Expert-Curated Dataset Construction

VADER's 174 vulnerability cases were constructed through a rigorous double-annotator process:

• Security experts with 6+ years of experience curated real-world vulnerabilities
• Each case underwent independent review by a second security engineer
• Vulnerabilities sourced from production open-source projects
• Comprehensive annotation including CWE classification, severity rating, and test plans

Multi-Language & Multi-File Context

VADER captures real-world complexity through diverse language and file coverage:

• 15 programming languages: JavaScript, Python, TypeScript, PHP, Go, C/C++, and more
• 75% of cases involve multi-language contexts
• 23% of cases span multiple source files (up to 4 files)
• Focus on High (41%) and Critical (20%) severity vulnerabilities

Human Evaluation Protocol

Scoring Rubric

• Remediation (50%): Quality of code fix and vulnerability elimination

• Explanation (20%): Accuracy of root cause analysis and impact description

• Classification & Test Plan (30%): Correct CWE identification and test coverage

Evaluation Process

• Blind evaluation by independent security experts
• Standardized one-shot prompting across all models
• Comprehensive scoring with confidence intervals

Access VADER Dataset

Comprehensive vulnerability benchmark with expert annotations, scoring tools, and evaluation results

Open Source Benchmark Available

VADER's complete dataset, detailed evaluation rubrics, scoring tools, and visualized results are publicly available. Perfect for researchers developing vulnerability-aware LLMs and security tools.

View Paper (arXiv)View on GitHub Contact Research Team

Get started

Connect with our Team

Our research findings are advancing foundational model capabilities through human-generated, specialized datasets.