OpenAI announced two safety models (gpt-oss-safeguard-120b and gpt-oss-safeguard-20b) for classifying online safety harms. Developed with ROOST, SafetyKit, and Discord these models let developers configure safety policies for their platforms.
.png)
SafetyKit is proud to have partnered with OpenAI to support OpenAI’s release of two new reasoning models - gpt-oss-safeguard-120b and gpt-oss-safeguard-20b - designed to help developers classify a wide range of online safety harms across digital platforms. The announcement marks a significant step toward open, collaborative safety infrastructure.
The new models are open-weight, meaning their parameters are publicly available, offering transparency and control while preserving system integrity. These models accept custom policy definitions at inference time. Allowing companies to adapt these models to their specific moderation frameworks or evolving community standards. Because the models are reasoning-based, they “show their work” - each decision includes a rationale, giving developers and auditors insight into how conclusions were reached. This feature enhances accountability and interpretability, which is crucial for platforms managing complex or high-stakes safety contexts.
OpenAI highlighted several potential uses:
SafetyKit and Discord were early partners testing the models during their research preview, demonstrating their flexibility in real-world moderation scenarios.
At SafetyKit, we’re committed to helping platforms deploy and evaluate AI safety technology responsibly. Our tools enable teams to implement adaptable safeguards and ensure trust, transparency, and compliance at scale. To start building with these models, download them here.
OpenAI introduces safety models that other sites can use to classify harms
Get a personalized walkthrough based on “OpenAI Introduces Safety Models in Collaboration with SafetyKit”.

