Skip to content

AI for Threat Intelligence

Threat-intelligence teams sit on top of more data than they can read. LLMs are useful for triage, summarization, and TTP clustering; they are worse at attribution, where the cost of a false positive is high. This topic catalogs both the useful applications and the honest limits — frontier-lab misuse reports, engineering write-ups from Mandiant and Microsoft, and empirical academic work on IOC enrichment and TTP extraction. The vendor tooling is real; the marketing is aggressive; the primary sources distinguish between the two.

Start here

  • Disrupting the first reported AI-orchestrated cyber espionage campaign [blog] — Anthropic's November 2025 report on a Chinese state-sponsored group (assessed with high confidence) that used Claude Code to execute reconnaissance, credential harvesting, and data exfiltration against thirty global targets — tech, finance, chemical, and government — with 80–90% automation and only 4–6 human decision points per campaign. The detection account is as important as the attack description: Anthropic deployed expanded classifiers, new methods for detecting distributed agentic attacks, and used Claude itself to analyze the attack corpus. The first documented case of a large-scale cyberattack executed without substantial human intervention, and a primary-source reference for what AI-orchestrated intrusion looks like from the defender's intelligence seat. (Anthropic Threat Intelligence; Nov 2025)

  • AI and the Five Phases of the Threat Intelligence Lifecycle [blog] — Coull and Nichols (Mandiant) describe how AI is embedded across every stage of the TI lifecycle: malware classifiers and deep learning for binary analysis in Collection; NLP pipelines for entity extraction and topic classification of cybercrime forum data in Enrichment; IOC scoring that achieves a 96% false-positive reduction at alert generation and a 97% IOC reduction at high-confidence filtering in Analysis; machine-readable threat intelligence and Mandiant Breach Analytics for automated deployment; and Sec-PaLM 2 LLM roadmapped for automated report generation and conversational intelligence access. The clearest engineering account of where AI fits across the TI workflow — and where analyst judgment remains the bottleneck. (Scott Coull, Jayce Nichols / Mandiant; Aug 2023)

  • Staying ahead of threat actors in the age of AI [blog] — Microsoft Threat Intelligence and OpenAI jointly document how five state-affiliated APTs — Forest Blizzard (Russia/GRU), Emerald Sleet (North Korea), Crimson Sandstorm (Iran/IRGC), Charcoal Typhoon (China), and Salmon Typhoon (China) — used LLMs for reconnaissance, vulnerability research, social engineering content, and scripting support. The key finding: AI acted as a productivity tool on old playbooks, not a novel capability unlockfor any of the five. Microsoft introduces nine LLM-themed TTPs mapped to MITRE ATT&CK and MITRE ATLAS. The baseline paper for understanding what state actors actually do with frontier models versus what threat models predict they might do. (Microsoft Threat Intelligence / OpenAI; Feb 2024)

Go deeper

  • Detecting and countering malicious uses of Claude: March 2025 [blog] — Anthropic documents four operations disrupted in the first quarter of 2025: an influence-as-a-service network managing 100+ social media bot accounts where Claude served as the orchestration layer — deciding when bots should like, share, or comment based on political objectives — reaching tens of thousands of authentic accounts; a credential-scraping operation targeting security camera systems; a recruitment fraud campaign using Claude for real-time language sanitization; and a novice malware developer lifted beyond their actual skill level. Detection methods described: Clio (hierarchical conversation summarization at scale) and input-output classifiers. The first Anthropic threat report to detail the orchestrator role, not just the content-generation role. (Anthropic; Apr 2025)

  • Introducing Google Threat Intelligence: Actionable threat intelligence at Google scale [blog] — Potti and Joyce (Google Cloud) describe the unified TI platform that fuses Mandiant frontline intelligence (1,100+ investigations/year), VirusTotal crowdsourced indicators (1 million+ contributors), Google Threat Insights (4 billion devices, 1.5 billion email accounts, 100 million phishing attempts blocked daily), and OSINT — all processed by Gemini 1.5 Pro. The AI-specific capabilities: automated OSINT crawling and entity extraction into knowledge collections, TTP clustering and actor-toolkit association, and malware analysis using Gemini's 1-million-token context window. Proof-of-concept: the entire decompiled WannaCry binary processed in 34 seconds with the killswitch identified. The architectural announcement for understanding how Google integrates Gemini into production TI at scale. (Sunil Potti, Sandra Joyce / Google Cloud; May 2024)

  • The Use of Large Language Models (LLM) for Cyber Threat Intelligence (CTI) in Cybercrime Forums [paper] — Clairoux-Trepanier, Beauchamp, Ruellan, Paquet-Clouston, Paquette, and Clay evaluate GPT-3.5-turbo on over 700 conversations from three Russian-language cybercrime forums (XSS, Exploit_in, RAMP), extracting 10 CTI variables per conversation — including whether critical infrastructure was targeted — against a human-coded ground truth. Results: 96.23% average accuracy, 90% precision, 88.2% recall. The study also identifies the main failure mode: the model confuses first-person stories about past events with current operational indicators, suggesting verb-tense sensitivity in prompts matters for TI extraction. The empirical baseline for using LLMs to triage dark-web forum intelligence at scale. (Clairoux-Trepanier, Beauchamp, Ruellan, Paquet-Clouston, Paquette, Clay; arXiv 2408.03354; Oct 2024)

  • CTIArena: Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence [paper] — Cheng, Liu, Li, Song, and Gao introduce CTIArena, a benchmark spanning nine tasks across three categories (structured CTI analysis, unstructured CTI analysis, hybrid analysis), evaluated against ten widely-used LLMs. Core finding: all models tested struggle significantly in closed-book settings but improve measurably when augmented with domain-specific retrieval, establishing that parametric knowledge alone is insufficient for production TI work and that RAG or fine-tuning is not optional — it is the minimum viable architecture. The measurement instrument for teams evaluating whether a general-purpose LLM can handle TI tasks before deploying it in a TI workflow. (Cheng, Liu, Li, Song, Gao; arXiv 2510.11974; Oct 2025)