FightCSAM

Datasets & benchmarks

Training and evaluation datasets. We anchor promptshield’s evaluation to NVIDIA Aegis 2.0 and borrow Tattle / Uli annotation methodology; the rest are

Training and evaluation datasets. We anchor promptshield’s evaluation to NVIDIA Aegis 2.0 and borrow Tattle / Uli annotation methodology; the rest are listed for reference.

33 projects — 33 reference.

Project descriptions are adapted from awesome-safety-tools (maintained by ROOST); the verdicts and analysis are ours. Snapshot: June 2026 — a point-in-time view that complements, and does not replace, their living list.

Aegis Content Safety 2.0

Reference · by NVIDIA · pairs with promptshield

Our eval anchor for PromptShield: CC-BY-4.0 licensing plus a "Sexual (minor)" subset make it usable for CSAM-intent benchmarking.

Content-moderation and toxicity dataset spanning a broad LLM-safety taxonomy, including a dedicated "Sexual (minor)" category. Released under CC-BY-4.0.

AI Alignment (RLHF)

Reference · by Anthropic

Foundational helpful/harmless preference data for alignment research.

RLHF alignment data, explorable as a Nomic Atlas map.

AILuminate

Reference · by MLCommons

Standardized industry safety benchmark across many harm categories.

Human-created prompts spanning a standardized set of harm categories.

ALERT

Reference · by Babelscape

Pairs standard and adversarial variants for safety stress-testing.

Standard and adversarial red-team prompts organized by a safety taxonomy.

Aya Red-teaming

Reference · by Cohere

Good for multilingual red-team coverage beyond English.

Multilingual red-team prompts for probing model safety across languages.

badwords

Reference · by Richard Hughes

Quick multilingual profanity seed list, not a substitute for a real classifier.

Bad-word lists compiled across multiple locales.

CCP Sensitive Prompts

Reference · by Promptfoo

Niche set for probing political censorship behavior in models.

Prompts on topics sensitive to the Chinese Communist Party.

DarkBench

Reference · by Apart

Useful for evaluating manipulative / dark-pattern model tendencies.

Benchmark for detecting dark design patterns in LLM behavior.

DEFCON Red Teaming

Reference · by Humane Intelligence

Real crowd-sourced red-team attempts from a large public event.

Data from the DEF CON AI Village generative red-teaming event.

Do Not Answer

Reference · by LibrAI

Targeted refusal-behavior eval for questions a model should decline.

Questions designed to test whether a model correctly refuses.

Forbidden Questions

Reference · by TrustAIRLab

Policy-grounded prompts for testing disallowed-content refusals.

Questions derived from categories in the OpenAI usage policy.

HackAPrompt

Reference · by HackAPrompt

Big corpus of human-crafted prompt-injection attacks.

Large dataset of prompt-injection and jailbreaking attempts.

HarmBench

Reference · by CAIS

Standard harness for automated red-teaming comparisons.

Standardized evaluation dataset for automated red-teaming.

HiroKachi Jailbreak

Reference

Community-sourced jailbreak collection; treat provenance with caution.

Collection of adversarial prompt-attack examples.

Jailbreak Prompt Generator

Reference

Generator rather than a fixed set, for synthesizing attack prompts at scale.

A model that generates jailbreak prompts.

JailbreakBench

Reference · by JailbreakBench

Standard behaviors set for benchmarking jailbreak defenses.

Curated harmful behaviors for evaluating jailbreak robustness.

JailbreakHub

Reference · by WalledAI

Prompt+response pairs, handy for studying what jailbreaks actually elicit.

Jailbreak prompts paired with model responses.

LLM-LAT harmful

Reference · by LLM-LAT

General harmful-behavior probe set, often used in latent adversarial training.

Prompts for assessing harmful model behaviors.

MedSafetyBench

Reference · by AI4LIFE-GROUP

Domain-specific eval for medical-safety failure modes.

Medical-safety prompts for evaluating models in clinical contexts.

Multilingual Vulnerability

Reference

Probes safety gaps that appear in non-English languages.

Multilingual prompts that surface LLM vulnerabilities.

PKU-SafeRLHF

Reference · by PKU-Alignment

Good for safety-preference / RLHF training signal on response harmfulness.

Prompts paired with RLHF safety markers identifying unsafe responses.

Red Team Resistance Leaderboard

Reference · by Haize Labs

Comparative ranking of model attack-resistance rather than a raw dataset.

Leaderboard ranking models by their resistance to attacks.

Rentry Jailbreak

Reference

Informal community jailbreak dump; unversioned, verify before use.

A collected set of jailbreak prompts.

SidFeel Jailbreak

Reference

Another community jailbreak prompt collection for attack coverage.

A collection of jailbreak prompts.

SorryBench

Reference · by SorryBench

Tests refusal robustness under paraphrase and linguistic mutation.

Adversarial prompts augmented with linguistic mutations.

SOSBench

Reference · by SOSBench

Good for evaluating dangerous scientific-capability refusals (e.g. chem/bio).

Regulation-grounded hazard benchmark spanning six scientific domains.

TDC23-RedTeaming

Reference · by WalledAI

Competition-derived red-team prompts for benchmark continuity.

Prompts from the TDC23 red-teaming track.

Toxic Chat

Reference · by LMSYS

Realistic in-the-wild toxicity from live chat, good for conversational moderation eval.

Toxic conversations drawn from real user interactions with Vicuna.

Toxicity

Reference · by Jigsaw

License-clean (CC0) baseline for generic toxicity classification.

Wikipedia comments labeled for toxicity. Released under CC0.

Transphobia Awareness

Reference

Targeted set for evaluating anti-trans hate detection.

Transphobia-related queries with annotations.

Uli Dataset

Reference · by Tattle · pairs with detectkit-test

Its expert per-annotator methodology informs how we structure our CSAM-intent eval.

Gendered-abuse dataset built with an expert, per-annotator labeling methodology.

VTC

Reference · by Unitary AI

Reference for multimodal (video + comment) toxicity work.

Video-text-comments dataset and method for multimodal moderation.

XSTest

Reference

Catches over-refusal: safe prompts a model wrongly declines.

Prompts testing exaggerated-safety (over-refusal) behaviors.

On this page