Datasets & benchmarks
Training and evaluation datasets. We anchor promptshield’s evaluation to NVIDIA Aegis 2.0 and borrow Tattle / Uli annotation methodology; the rest are
Training and evaluation datasets. We anchor promptshield’s evaluation to NVIDIA Aegis 2.0 and borrow Tattle / Uli annotation methodology; the rest are listed for reference.
33 projects — 33 reference.
Project descriptions are adapted from awesome-safety-tools (maintained by ROOST); the verdicts and analysis are ours. Snapshot: June 2026 — a point-in-time view that complements, and does not replace, their living list.
Aegis Content Safety 2.0
Reference · by NVIDIA · pairs with promptshield
Our eval anchor for PromptShield: CC-BY-4.0 licensing plus a "Sexual (minor)" subset make it usable for CSAM-intent benchmarking.
Content-moderation and toxicity dataset spanning a broad LLM-safety taxonomy, including a dedicated "Sexual (minor)" category. Released under CC-BY-4.0.
AI Alignment (RLHF)
Reference · by Anthropic
Foundational helpful/harmless preference data for alignment research.
RLHF alignment data, explorable as a Nomic Atlas map.
AILuminate
Reference · by MLCommons
Standardized industry safety benchmark across many harm categories.
Human-created prompts spanning a standardized set of harm categories.
ALERT
Reference · by Babelscape
Pairs standard and adversarial variants for safety stress-testing.
Standard and adversarial red-team prompts organized by a safety taxonomy.
Aya Red-teaming
Reference · by Cohere
Good for multilingual red-team coverage beyond English.
Multilingual red-team prompts for probing model safety across languages.
badwords
Reference · by Richard Hughes
Quick multilingual profanity seed list, not a substitute for a real classifier.
Bad-word lists compiled across multiple locales.
CCP Sensitive Prompts
Reference · by Promptfoo
Niche set for probing political censorship behavior in models.
Prompts on topics sensitive to the Chinese Communist Party.
DarkBench
Reference · by Apart
Useful for evaluating manipulative / dark-pattern model tendencies.
Benchmark for detecting dark design patterns in LLM behavior.
DEFCON Red Teaming
Reference · by Humane Intelligence
Real crowd-sourced red-team attempts from a large public event.
Data from the DEF CON AI Village generative red-teaming event.
Do Not Answer
Reference · by LibrAI
Targeted refusal-behavior eval for questions a model should decline.
Questions designed to test whether a model correctly refuses.
Forbidden Questions
Reference · by TrustAIRLab
Policy-grounded prompts for testing disallowed-content refusals.
Questions derived from categories in the OpenAI usage policy.
HackAPrompt
Reference · by HackAPrompt
Big corpus of human-crafted prompt-injection attacks.
Large dataset of prompt-injection and jailbreaking attempts.
HarmBench
Reference · by CAIS
Standard harness for automated red-teaming comparisons.
Standardized evaluation dataset for automated red-teaming.
HiroKachi Jailbreak
Reference
Community-sourced jailbreak collection; treat provenance with caution.
Collection of adversarial prompt-attack examples.
Jailbreak Prompt Generator
Reference
Generator rather than a fixed set, for synthesizing attack prompts at scale.
A model that generates jailbreak prompts.
JailbreakBench
Reference · by JailbreakBench
Standard behaviors set for benchmarking jailbreak defenses.
Curated harmful behaviors for evaluating jailbreak robustness.
JailbreakHub
Reference · by WalledAI
Prompt+response pairs, handy for studying what jailbreaks actually elicit.
Jailbreak prompts paired with model responses.
LLM-LAT harmful
Reference · by LLM-LAT
General harmful-behavior probe set, often used in latent adversarial training.
Prompts for assessing harmful model behaviors.
MedSafetyBench
Reference · by AI4LIFE-GROUP
Domain-specific eval for medical-safety failure modes.
Medical-safety prompts for evaluating models in clinical contexts.
Multilingual Vulnerability
Reference
Probes safety gaps that appear in non-English languages.
Multilingual prompts that surface LLM vulnerabilities.
PKU-SafeRLHF
Reference · by PKU-Alignment
Good for safety-preference / RLHF training signal on response harmfulness.
Prompts paired with RLHF safety markers identifying unsafe responses.
Red Team Resistance Leaderboard
Reference · by Haize Labs
Comparative ranking of model attack-resistance rather than a raw dataset.
Leaderboard ranking models by their resistance to attacks.
Rentry Jailbreak
Reference
Informal community jailbreak dump; unversioned, verify before use.
A collected set of jailbreak prompts.
SidFeel Jailbreak
Reference
Another community jailbreak prompt collection for attack coverage.
A collection of jailbreak prompts.
SorryBench
Reference · by SorryBench
Tests refusal robustness under paraphrase and linguistic mutation.
Adversarial prompts augmented with linguistic mutations.
SOSBench
Reference · by SOSBench
Good for evaluating dangerous scientific-capability refusals (e.g. chem/bio).
Regulation-grounded hazard benchmark spanning six scientific domains.
TDC23-RedTeaming
Reference · by WalledAI
Competition-derived red-team prompts for benchmark continuity.
Prompts from the TDC23 red-teaming track.
Toxic Chat
Reference · by LMSYS
Realistic in-the-wild toxicity from live chat, good for conversational moderation eval.
Toxic conversations drawn from real user interactions with Vicuna.
Toxicity
Reference · by Jigsaw
License-clean (CC0) baseline for generic toxicity classification.
Wikipedia comments labeled for toxicity. Released under CC0.
Transphobia Awareness
Reference
Targeted set for evaluating anti-trans hate detection.
Transphobia-related queries with annotations.
Uli Dataset
Reference · by Tattle · pairs with detectkit-test
Its expert per-annotator methodology informs how we structure our CSAM-intent eval.
Gendered-abuse dataset built with an expert, per-annotator labeling methodology.
VTC
Reference · by Unitary AI
Reference for multimodal (video + comment) toxicity work.
Video-text-comments dataset and method for multimodal moderation.
XSTest
Reference
Catches over-refusal: safe prompts a model wrongly declines.
Prompts testing exaggerated-safety (over-refusal) behaviors.