OpenAI launches gpt-oss-120b and gpt-oss-20b open-weight AI models

OpenAI has introduced two open-weight models, gpt-oss-120b and gpt-oss-20b, under the Apache 2.0 license. These models deliver advanced reasoning capabilities, efficient deployment options, and support a broad range of use cases, including on-device inference.

Why These Open Models Matter

OpenAI aims to broaden access to powerful AI systems by releasing models that developers can run, inspect, and customize. These models provide alternatives to proprietary systems, enabling safer deployment, research, and innovation, especially in sectors and regions with limited resources. They also support local inference, global access, and transparent AI development.

Model Specifications: gpt-oss-120b and gpt-oss-20b

Both models use Transformer architecture enhanced with Mixture-of-Experts (MoE). They support context lengths up to 128k tokens using Rotary Positional Embeddings (RoPE), alongside grouped multi-query attention and locally banded sparse attention for efficient processing.

gpt-oss-120b

Layers: 36
Total Parameters: 117 billion
Active Parameters Per Token: 5.1 billion
Total Experts: 128
Active Experts Per Token: 4
Context Length: 128k tokens
Minimum RAM Required: 80 GB GPU
Benchmark Standing: Matches or exceeds OpenAI o4-mini

gpt-oss-20b

Layers: 24
Total Parameters: 21 billion
Active Parameters Per Token: 3.6 billion
Total Experts: 32
Active Experts Per Token: 4
Context Length: 128k tokens
Minimum RAM Required: 16 GB
Benchmark Standing: Similar to OpenAI o3-mini; runs efficiently on-device

Reasoning and Tool Use

Both models excel on benchmarks such as:

TauBench (tool use and agentic reasoning)
HealthBench (health-related queries)
AIME (competition mathematics)
MMLU, HLE, and Codeforces (general knowledge and coding evaluations)

The gpt-oss-120b matches OpenAI’s o4-mini and outperforms o3-mini on many reasoning benchmarks, while the smaller gpt-oss-20b meets or exceeds o3-mini in various tests.

Post-Training and Alignment

The models underwent supervised fine-tuning followed by high-compute reinforcement learning (RL) stages. They support:

Structured outputs
Full chain-of-thought (CoT) reasoning
Users can adjust reasoning modes—low, medium, or high—via system instructions.

Safety and Evaluation

OpenAI applied several safety measures:

Pre-training filters excluded high-risk data (e.g., Chemical, Biological, Radiological, Nuclear content)
Post-training involved deliberative alignment and refusal training
Adversarial fine-tuning tested risks under misuse scenarios, confirming low risk
The models meet OpenAI’s internal safety benchmarks and Preparedness Framework

Note: Chain-of-thought outputs are not supervised and may contain hallucinations or harmful content. Developers should avoid exposing CoTs directly to end users.

To boost safety studies, OpenAI launched a $500K Red Teaming Challenge, releasing findings and evaluation data openly.

Availability and Deployment

Weights: Freely downloadable on Hugging Face
Quantization: MXFP4 format enables efficient memory use—80GB for 120B and 16GB for 20B models.
Tokenizer: Open-sourced o200k_harmony tokenizer
Inference tools: Ready-to-use examples for PyTorch and Apple Metal platforms.
Prompt renderer: Available in Python and Rust

Deployment support includes:

Cloud/hosting providers: Azure, AWS, Vercel, Hugging Face, Databricks, OpenRouter, Together AI, LM Studio, Fireworks, Baseten
Local and edge platforms: llama.cpp, Ollama, Qualcomm AI Hub, Foundry Local (Windows)
Hardware vendors: NVIDIA, AMD, Apple Metal, ONNX Runtime, Groq, Cerebras

GPU-optimized versions are also available for Windows devices via VS Code and Foundry Local using ONNX Runtime.

Snapdragon Integration and On-Device Inference

Qualcomm Technologies is supporting OpenAI’s GPT-OSS models on Snapdragon platforms, with deployment expected to begin in 2025. The gpt-oss-20b model runs entirely on Snapdragon-powered devices, enabling local AI inference without relying on cloud services.

Early integration testing with Qualcomm’s AI Engine and AI Stack has demonstrated the model’s capability to perform complex reasoning fully on-device. Developers will access the model through platforms like Hugging Face and Ollama, which integrates a lightweight open-source LLM servicing framework tailored for Snapdragon devices.

Key highlights:

On-device chain-of-thought reasoning optimized for Snapdragon processors
Access via Hugging Face and Ollama platforms
Ollama supports web search and turbo mode
Deployment on smartphones, PCs, XR headsets, and vehicles

Qualcomm expects ongoing improvements in mobile memory and software efficiency to drive growth in on-device AI, enabling more private, low-latency, and personalized AI experiences.