VADER: Vulnerability Assessment, Detection, Explanation, and Remediation

A comprehensive human-evaluated benchmark for assessing LLM performance in software security. VADER evaluates models across four critical dimensions: vulnerability assessment, detection, explanation, and remediation using 174 real-world vulnerabilities from open-source projects.

Security Assessment
Vulnerability Detection
Code Remediation
15 Programming Languages
Read the Paper

arXiv:2505.19395 [cs.CR]

VADER Benchmark Results

Model performance on vulnerability assessment, detection, explanation, and remediation tasks

0%25%50%75%100%
OpenAI o3
54.6%
Gemini 2.5 Pro
53.6%
Claude 3.7
52.3%
Grok 3 Beta
52.0%
GPT-4.1
50.0%
GPT-4.5
49.2%
Performance measured on 174 real-world vulnerability casesScore: Overall accuracy across all evaluation dimensions

Detailed Performance Statistics

Performance breakdown across four evaluation dimensions with statistical analysis

Overall Performance Across Models

ModelMean Score
Claude-3.752.31%
Gemini-2.5-Pro53.58%
GPT-4.150.00%
GPT-4.549.19%
o354.62%
Grok 3 Beta52.02%

Remediation Performance

StatisticClaudeGeminiGPT-4.1GPT-4.5o3Grok
Mean52.30%52.76%49.08%49.20%54.60%51.38%
Std47.34%47.22%46.65%47.92%48.67%47.37%
25%0.00%0.00%0.00%0.00%0.00%0.00%
50%80.00%80.00%60.00%60.00%90.00%60.00%
75%100.00%100.00%100.00%100.00%100.00%100.00%

Explanation Performance

StatisticClaudeGeminiGPT-4.1GPT-4.5o3Grok
Mean53.74%56.03%53.45%50.29%55.46%53.74%
Std48.39%48.75%49.15%48.83%49.41%48.98%
25%0.00%0.00%0.00%0.00%0.00%0.00%
50%100.00%100.00%100.00%50.00%100.00%100.00%
75%100.00%100.00%100.00%100.00%100.00%100.00%

Other Performance (CWE + Severity)

StatisticClaudeGeminiGPT-4.1GPT-4.5o3Grok
Mean51.72%53.83%49.81%48.66%54.21%52.11%
Std47.21%47.76%47.24%47.76%48.40%48.00%
25%0.00%0.00%0.00%0.00%0.00%0.00%
50%66.67%66.67%66.67%66.67%83.33%66.67%
75%100.00%100.00%100.00%100.00%100.00%100.00%

Performance comparison across four evaluation dimensions

Four-Stage Evaluation Framework

VADER assesses LLM capabilities across the complete vulnerability handling lifecycle

Vulnerability Assessment

Identify and classify security flaws using Common Weakness Enumeration (CWE) categories

Threat Detection

Detect real-world vulnerabilities across 15 programming languages and multi-file contexts

Root Cause Explanation

Provide clear technical explanations of vulnerability causes and potential exploit scenarios

Code Remediation

Generate clean, minimal patches that eliminate vulnerabilities without breaking functionality

Vulnerability Categories

Coverage of critical security flaws across multiple programming languages and contexts

SQL Injection (CWE-89)

High to Critical

Database query manipulation through unsanitized user input, allowing unauthorized data access, modification, or deletion. Most prevalent in Python and SQL-based applications.

Prevalence:Most Common

Cross-Site Scripting (CWE-79)

Medium to High

Injection of malicious scripts into web applications, enabling session hijacking, user redirection, and data theft. Common in JavaScript and web frameworks.

Prevalence:Very Common

OS Command Injection (CWE-78)

Critical

Execution of arbitrary operating system commands through unsanitized input. Can lead to complete system compromise and data destruction.

Prevalence:Common

Path Traversal (CWE-22)

Medium to High

Unauthorized access to files and directories outside the intended scope through manipulated file paths. Prevalent in Python applications with file I/O operations.

Prevalence:Common

Severity Distribution

VADER focuses on High and Critical severity vulnerabilities (61% of cases)

12

Very Low

7% of cases

19

Low

11% of cases

37

Medium

21% of cases

71

High

41% of cases

35

Critical

20% of cases

Vulnerability Example

Real-world vulnerability case demonstrating the four-stage evaluation process

CWE-674: Uncontrolled Recursion

Vulnerability Description

A recursive function that calls itself without proper termination conditions, leading to stack overflow and potential denial of service. The function continues to recurse indefinitely when invalid input is provided.

Vulnerable Code

def validd():
    code = input("Enter book code: ")
    if not is_valid(code):
        print("Invalid code. Try again.")
        validd()  # Dangerous recursion

Remediated Code

def validd():
    retries = 0
    while retries < 3:
        code = input("Enter book code: ")
        if is_valid(code):
            break
        print("Invalid code. Try again.")
        retries += 1

VADER Evaluation Tasks

1. Classification & Assessment

Expected: CWE-674 (Uncontrolled Recursion), Severity Level 4 (High)

2. Explanation

Required: Identify that the recursive call without proper termination conditions leads to stack overflow when invalid input is repeatedly provided.

3. Remediation

Expected: Replace recursion with iteration using a retry counter to prevent stack overflow.

4. Test Plan

  • • Input invalid codes repeatedly to trigger stack overflow
  • • Test with special characters, empty input, and long strings
  • • Verify that patched code handles retry limit properly
  • • Confirm normal operation with valid input

Current Model Performance

Even state-of-the-art LLMs achieve only moderate success on VADER

54.7%
OpenAI o3 (Best)
Top performing model
49-54%
Other Models
GPT-4.5, Gemini 2.5, Claude 3.7
r >0.97
Correlation
Remediation & Classification

Methodology

Rigorous expert annotation and evaluation process ensuring benchmark quality

Expert-Curated Dataset Construction

VADER's 174 vulnerability cases were constructed through a rigorous double-annotator process:

  • • Security experts with 6+ years of experience curated real-world vulnerabilities
  • • Each case underwent independent review by a second security engineer
  • • Vulnerabilities sourced from production open-source projects
  • • Comprehensive annotation including CWE classification, severity rating, and test plans

Multi-Language & Multi-File Context

VADER captures real-world complexity through diverse language and file coverage:

  • • 15 programming languages: JavaScript, Python, TypeScript, PHP, Go, C/C++, and more
  • • 75% of cases involve multi-language contexts
  • • 23% of cases span multiple source files (up to 4 files)
  • • Focus on High (41%) and Critical (20%) severity vulnerabilities

Human Evaluation Protocol

Scoring Rubric

Remediation (50%): Quality of code fix and vulnerability elimination
Explanation (20%): Accuracy of root cause analysis and impact description
Classification & Test Plan (30%): Correct CWE identification and test coverage

Evaluation Process

  • • Blind evaluation by independent security experts
  • • Standardized one-shot prompting across all models
  • • Comprehensive scoring with confidence intervals

Access VADER Dataset

Comprehensive vulnerability benchmark with expert annotations, scoring tools, and evaluation results

Open Source Benchmark Available

VADER's complete dataset, detailed evaluation rubrics, scoring tools, and visualized results are publicly available. Perfect for researchers developing vulnerability-aware LLMs and security tools.

Get started

Connect with our Team

Our research findings are advancing foundational model capabilities through human-generated, specialized datasets.