Classifiers & AI-safety models

ML classifiers and LLM guardrails. FightCSAM ships no general model — csam-shield is built to wrap the best of these as swappable detector backends, a

ML classifiers and LLM guardrails. FightCSAM ships no general model — csam-shield is built to wrap the best of these as swappable detector backends, and promptshield focuses narrowly on CSAM-generation intent.

22 projects — 9 use · 10 learn from · 2 reference · 1 out of scope.

Project descriptions are adapted from awesome-safety-tools (maintained by ROOST); the verdicts and analysis are ours. Snapshot: June 2026 — a point-in-time view that complements, and does not replace, their living list.

Content Safety API

Use · by Google · pairs with csam-shield

This is the one classifier on the list aimed squarely at the gap hashing cannot cover: novel, previously-unseen CSAM. A FightCSAM user fronts known-material hash matching with this for first-seen content, which is exactly the role csam-shield is built to orchestrate as a detector backend. Gated and closed-weight, but free with registration and reachable through ROOST.

Google's ML classifiers for detecting CSAM, nudity, and explicit content in images and video, offered to qualifying partners free of charge but gated behind registration; accessible to members via the ROOST coop.

CoPE

Use · by Zentropi · pairs with promptshield

Policy-as-prompt classifiers let a team encode a CSAM-specific policy without training a model, which is precisely the leverage promptshield needs for prompt-intent screening. Open-weight on Hugging Face, so it slots in as a promptshield companion or csam-shield text backend with a CSAM-tuned policy.

A small (9B) language model for steerable content classification that scores text against policies a developer writes in plain language rather than a fixed label set.

gpt-oss-safeguard

Use · by OpenAI · pairs with promptshield

Bring-your-own-policy reasoning models are a strong fit for nuanced CSAM-intent calls that a keyword filter misses, and an open license means a team can run it on-prem where sending prompts to a third party is a non-starter. We wrap it as a promptshield companion or csam-shield text backend with a CSAM policy.

An open-weight reasoning model that classifies text against safety policies supplied at inference time, returning a judgment with its reasoning rather than a fixed-taxonomy label.

Granite Guardian

Use · by IBM Research · pairs with csam-shield

A permissively-licensed, genuinely open guardrail that is easy to self-host, which matters when prompts cannot leave your infrastructure. We wrap it as a csam-shield detector backend or promptshield companion; its broad-harm coverage complements, rather than replaces, dedicated CSAM signals.

An Apache-2.0 family of input/output guardrail models from IBM covering general harm, RAG groundedness (hallucination), and agentic/function-calling risks.

Guardrails AI

Use · by Guardrails AI · pairs with promptshield

A FightCSAM user already standardized on Guardrails can add CSAM-intent screening as one validator in their existing guard, and promptshield is naturally exposed as a Hub validator. It is a harness rather than a detector itself, but it is the right place to plug our checks in.

A Python framework that validates LLM inputs and outputs against predefined risks, with a Hub of community validators that can be composed into a guard.

Kanana Safeguard

Use · by Kakao · pairs with csam-shield

Architecturally the same class of wrappable guardrail as Llama Guard or ShieldGemma, and its multilingual strength is valuable where English-centric guards underperform. A FightCSAM user can reach for it as a csam-shield text backend or promptshield companion; note it covers general harm, so a CSAM policy still does the narrowing.

An open-weight 8B harmful-content detection model from Kakao for moderating LLM inputs and outputs, with notable multilingual (including Korean) coverage.

Llama Guard

Use · by Meta · pairs with csam-shield

One of the strongest open, self-hostable text guardrails, and its built-in child-exploitation category makes it a natural CSAM-text signal. We wrap it as a csam-shield detector backend or promptshield companion rather than reimplementing classification.

Meta's open-weight content-moderation model that classifies both prompts and responses in text interactions against a safety taxonomy that includes a child-exploitation category.

Llama Prompt Guard 2

Use · by Meta · pairs with promptshield

Adversaries jailbreak a generator to coax out CSAM, so injection detection is a real layer of CSAM-prevention defense in depth. Tiny and cheap to run inline, it pairs with promptshield as the anti-jailbreak companion to CSAM-intent screening.

A small (86M) open-weight Meta classifier specialized in detecting prompt-injection and jailbreak attempts against LLMs.

ShieldGemma

Use · by Google DeepMind · pairs with csam-shield

Open-weight and spanning both text and image (via ShieldGemma 2), so it can back both modalities a CSAM pipeline cares about. We wrap it as a csam-shield detector backend or promptshield companion; as with other general guards, a CSAM policy narrows it to the target harm.

A Gemma-based toolkit of open-weight models from Google DeepMind for detecting and mitigating harmful LLM content across safety categories, with ShieldGemma 2 extending coverage to images.

Detoxify

Learn from · by Unitary AI

A clean, widely-copied reference for packaging a text classifier as a pip-installable model, which informs how we ship detector backends. It is general toxicity, not CSAM: useful as an architectural lesson rather than something a FightCSAM pipeline reaches for to catch abuse material.

A set of pretrained models for detecting generalized toxic language in text, trained on the Jigsaw toxic-comment datasets.

NSFW Keras Model

Learn from · by Gant Laborde

A useful pattern for a self-hostable image classifier, but it detects adult NSFW content, which is a different problem from CSAM. We note it as general-purpose: it informs the detector-backend shape without being something a FightCSAM user wires in to find abuse material.

A CNN-based Keras/TensorFlow model that classifies images into explicit categories (porn, hentai, sexy, neutral, drawing).

OpenGuardrails

Learn from · by OpenGuardrails · pairs with promptshield

A useful reference for the proxy-gateway enforcement topology, applying safety at a network choke point rather than in app code, which is one way to deploy promptshield-style screening. It is a general LLM-security gateway rather than a CSAM tool, so we take the deployment pattern as the lesson.

A security gateway that fronts OpenAI-compatible APIs as a reverse proxy, applying safety protections to traffic passing through it.

OSmod (Moderator)

Learn from · by Jigsaw · pairs with csam-shield

A reference design for routing model scores into a human-review queue, which is the orchestration problem csam-shield solves. It is general comment moderation rather than CSAM-specific, so we treat it as an architectural lesson for the detect-then-review loop, not a drop-in detector.

An open-source moderation toolkit combining ML models, APIs, and a review UI to help platforms triage and act on user comments at scale.

Perspective API

Learn from · by Jigsaw

The canonical example of an attribute-scoring moderation API, instructive for how we expose detector confidence scores. It is general toxicity and a closed hosted service, so we note it as general-purpose rather than a CSAM detector a FightCSAM user calls.

A hosted ML API that scores text for attributes like toxicity, insult, and threat to help platforms moderate conversations.

Private Detector

Learn from · by Bumble

A strong, production-proven example of open-sourcing a lewd-image detector, useful for how we structure and document an image backend. It targets adult lewd content rather than CSAM, so we flag it as adjacent and general-purpose, not a FightCSAM detection component.

A pretrained, open-sourced model from Bumble for detecting lewd (unsolicited nude) images.

RoGuard

Learn from · by Roblox

An instructive open peer to the guardrails we wrap, showing how a large platform tunes a general output-safety model to its own policy. It is general-harm and Roblox-shaped rather than CSAM-specific, so we treat it as an architectural reference rather than a drop-in backend.

An open LLM-safeguard model from Roblox for moderating text generation against a platform safety policy.

Sentinel

Learn from · by Roblox

This is the gap we most want to emulate: behavioral, conversation-level grooming detection sits upstream of the image/text hashing FightCSAM covers and catches abuse before any media exists. A model of how to surface rare harmful patterns from sparse signal, and a clear direction for where CSAM-safety tooling should grow.

An open-source system from Roblox that uses contrastive learning to flag rare, hard-to-spot text classes such as grooming and other harmful behavioral patterns in real time.

Toxic Prompt RoBERTa

Learn from · by Intel · pairs with promptshield

The same place in the stack as promptshield, screening prompts before they reach a model, which makes its packaging and latency profile directly instructive. It classifies general toxicity rather than CSAM intent, so we learn from the prompt-screening approach while noting the label set is general-purpose.

A RoBERTa-based classifier from Intel that detects toxic prompts and responses in LLM interactions.

Voice Safety Classifier

Learn from · by Roblox

Voice is a modality FightCSAM does not cover, and grooming-adjacent harm in real-time audio is a real vector worth learning from. We treat it as an architectural lesson for real-time, modality-specific detection rather than a CSAM tool a user wires in today.

An open-source ML model from Roblox that classifies harmful content in real-time voice chat.

Purple Llama

Reference · by Meta

The parent collection rather than a single component: the directly wrappable pieces, Llama Guard and Llama Prompt Guard 2, are catalogued on their own. We point here as the canonical home and orientation for Meta's open safety stack.

Meta's umbrella project of tools to assess and improve LLM security, bundling Llama Guard, the CyberSec Eval benchmarks, and Code Shield.

Risk Atlas Nexus

Reference · by IBM Research

Governance and taxonomy tooling rather than a runtime detector: it maps risks to controls instead of classifying content. Useful as a reference when authoring the policies that drive promptshield or CoPE and when arguing coverage to compliance, but nothing a pipeline calls at inference time.

A knowledge-graph toolkit from IBM that links AI risk taxonomies to evaluations, mitigations, and controls so teams can reason about coverage across frameworks.

NSFW Filtering

Out of scope · by nsfw-filter

This is an end-user safety extension, not developer infrastructure: it protects the viewer at the browser, with no API or pipeline integration point. It sits outside the platform-side CSAM-detection problem FightCSAM addresses.

A browser extension that blurs or blocks NSFW images in the browser for the person using it.

Classifiers & AI-safety models

On this page