SafetyKit | OpenAI Introduces Safety Models in Collaboration with SafetyKit

Announcements

OpenAI Introduces Safety Models in Collaboration with SafetyKit

OpenAI Launches gpt-oss-safeguard Models to Advance Online Safety

SafetyKit is proud to have partnered with OpenAI to support OpenAI’s release of two new reasoning models - gpt-oss-safeguard-120b and gpt-oss-safeguard-20b - designed to help developers classify a wide range of online safety harms across digital platforms. The announcement marks a significant step toward open, collaborative safety infrastructure.

Open-Weight Reasoning Models for Customizable Safety Policies

The new models are open-weight, meaning their parameters are publicly available, offering transparency and control while preserving system integrity. These models accept custom policy definitions at inference time. Allowing companies to adapt these models to their specific moderation frameworks or evolving community standards. Because the models are reasoning-based, they “show their work” - each decision includes a rationale, giving developers and auditors insight into how conclusions were reached. This feature enhances accountability and interpretability, which is crucial for platforms managing complex or high-stakes safety contexts.

Practical Applications

OpenAI highlighted several potential uses:

A product reviews site could build a policy to identify fake or deceptive reviews.
A gaming forum could automatically flag posts discussing cheats or exploits.

SafetyKit and Discord were early partners testing the models during their research preview, demonstrating their flexibility in real-world moderation scenarios.

Industry Context

At SafetyKit, we’re committed to helping platforms deploy and evaluate AI safety technology responsibly. Our tools enable teams to implement adaptable safeguards and ensure trust, transparency, and compliance at scale. To start building with these models, download them here.

Explore the Full Announcements:

Introducing gpt-oss-safeguard

OpenAI introduces safety models that other sites can use to classify harms

‍

Protect your platform.

GET A DEMO

Thank you!

Oops! Something went wrong while submitting the form.