NVIDIA Nemotron 3 Super, a 120B open hybrid MoE model (12B active), supporting up to 1M tokens context window
To run the smallest Nemotron 3 Super, you need at least 83 GB of RAM.
Nemotron 3 Super models support tool use and reasoning. They are available in gguf.

Nemotron-3-Super-120B-A12B is a large language model (LLM) trained by NVIDIA, designed to deliver strong agentic, reasoning, and conversational capabilities. It is optimized for collaborative agents and high-volume workloads such as IT ticket automation. The model has 12B active parameters and 120B parameters in total.
Like other models in the Nemotron family, it responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be configured through a flag in the chat template.
The model employs a hybrid Latent Mixture-of-Experts (LatentMoE) architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. Distinct from the Nano model, the Super model incorporates Multi-Token Prediction (MTP) layers for faster text generation and improved quality, and it is trained using NVFP4 quantization to maximize compute efficiency. The model has 12B active parameters and 120B parameters in total.
The supported languages include: English, French, German, Italian, Japanese, Spanish, and Chinese.
This model is provided under the NVIDIA Open Model License