Skip to content

AI for Vulnerability Management

Vulnerability management is drowning in alerts. The traditional fix — CVSS + manual triage — doesn't scale when a mid-sized org has tens of thousands of open findings. AI-based prioritization (exploitability prediction, exposure scoring, asset-aware risk) is the response. The sources below are the research and technical write-ups that show how these systems actually work, not the vendor pages that promise they do.

Start here

  • The EPSS Model [docs] — FIRST's technical reference for the Exploit Prediction Scoring System: how the ML model collects ~1,400 vulnerability attributes (CPE vendor data, CVSS vectors, CWE, CISA KEV membership, public exploit-code presence across Exploit-DB/GitHub/Metasploit, offensive-tool signatures, honeypot and IDS/IPS telemetry from data partners), trains on a 12-month window validated against 2 months of unseen future data, and produces a daily per-CVE probability of exploitation in the next 30 days. The evaluation framework — effort, efficiency, coverage — is the right vocabulary for any practitioner trying to argue for a non-CVSS prioritization strategy. Companion paper: arxiv.org/abs/2302.14172. (FIRST; maintained 2026)

  • Vuln.AI: Our AI-powered leap into vulnerability management at Microsoft [blog] — Keith Boyd (Microsoft Digital) describes the two-agent system Microsoft built for its own enterprise network: a Research Agent that ingests CVE feeds from supplier APIs and device metadata from an Azure Data Explorer lakehouse to identify impacted devices, and an Interactive Agent that fields follow-up questions from network engineers and orchestrates remediation workflows through Azure Durable Functions and OpenAI models. Quantified results on a real 300,000-employee, 25,000-device environment: 70% reduction in time to vulnerability insights, 50%+ reduction in triage time. The clearest primary account of what an agentic AI vulnerability-management system looks like in production at scale. (Microsoft Inside Track; Oct 2025)

  • Prompting the Priorities: A First Look at Evaluating LLMs for Vulnerability Triage and Prioritization [paper] — Al Haddad, Ikram, Ahmed, and Lee test four frontier LLMs (ChatGPT, Claude, Gemini, DeepSeek) across 12 prompting strategies against 384 real-world vulnerabilities from VulZoo, using the Stakeholder-Specific Vulnerability Categorization (SSVC) decision framework as ground truth — generating over 165,000 model queries. Key finding: Gemini performs best overall, exemplar-based prompting improves all models, and all models systematically over-predict risk. Only DeepSeek achieves fair agreement on weighted metrics. The paper is the empirical baseline for any practitioner evaluating LLM-assisted triage against a structured prioritization framework. (Al Haddad et al.; arXiv 2510.18508; Oct 2025)

Go deeper

  • Enhancing Vulnerability Prioritization: Data-Driven Exploit Predictions with Community-Driven Insights [paper] — Jacobs, Romanosky, Suciu, Edwards, and Sarabi describe EPSS version 3: the full ML pipeline, the community-sourced feature set built with more than 170 industry experts, the performance evaluation framework (effort/efficiency/coverage trade-off curves), and the headline 82% improvement over v2 in distinguishing exploited from unexploited vulnerabilities. The paper that underpins the EPSS model page: read it when you want to understand why the model makes the choices it does, not just how to consume its output. (Jacobs, Romanosky et al.; arXiv 2302.14172; Feb 2023)

  • Conflicting Scores, Confusing Signals: An Empirical Study of Vulnerability Scoring Systems [paper] — Koscinski, Nelson, Okutan, Falso, and Mirakhorli evaluate CVSS, SSVC, EPSS, and the Exploitability Index against 600 real-world Microsoft vulnerabilities drawn from four months of Patch Tuesday disclosures. Finding: the four systems rank the same vulnerabilities substantially differently — fewer than 20% of CVEs that eventually entered CISA's KEV catalog ever had an EPSS score above 0.5 prior to inclusion. The empirical argument for why a single scoring signal is not a prioritization strategy, and why any AI-assisted system must account for the divergence. (Koscinski et al.; arXiv 2508.13644; Aug 2025)

  • SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned [paper] — Zhang, Fleischer, Fu, and 20 co-authors from Georgia Tech, Texas A&M, Smart Information Flow Technologies, and Kudu Dynamics systematize the largest LLM-based autonomous vulnerability discovery competition to date: seven finalist teams, seven architectural philosophies (ensemble-first, agentic, simplicity-with-diversity, DSPy-based, and more), and full performance decomposition across vulnerability discovery, patching, and validation phases. Core lessons: reliability beats sophistication (the winner scored 80% higher than second place largely through sustained uptime); parallel fuzzing plus LLM-guided PoV generation outperforms either alone; patch semantic correctness remains an open problem even for the top systems. The definitive technical post-mortem for practitioners thinking about deploying autonomous vulnerability-discovery pipelines. (Zhang et al.; arXiv 2602.07666; Feb 2026)

Tools

  • Buttercup — Trail of Bits' open-source Cyber Reasoning System (CRS) from DARPA's AIxCC: augments libFuzzer and Jazzer with LLM-generated seed inputs, integrates tree-sitter static analysis, and uses a multi-agent orchestration layer for patching. Achieved second place ($3M) at AIxCC Finals, discovering 28 vulnerabilities across 20 CWE categories with 90% accuracy at $181 per vulnerability — using exclusively non-reasoning LLMs. Released open source after the competition concluded. The most immediately usable open-source LLM-augmented fuzzing/vulnerability-discovery system available. Active (Aug 2025) · see repo for stars and license.