Tech

Nvidia’s new open weights Nemotron 3 super combines three different architectures to beat gpt-oss and Qwen in throughput

Published

4 hours ago

12 March 2026

NewsAdmin

Multi-agent systems, designed to handle long-horizon tasks like software engineering or cybersecurity triaging, can generate up to 15 times the token volume of standard chats — threatening their cost-effectiveness in handling enterprise tasks.

But today, Nvidia sought to help solve this problem with the release of Nemotron 3 Super, a 120-billion-parameter hybrid model, with weights posted on Hugging Face.

By merging disparate architectural philosophies—state-space models, transformers, and a novel “Latent” mixture-of-experts design—Nvidia is attempting to provide the specialized depth required for agentic workflows without the bloat typical of dense reasoning models, and all available for commercial usage under mostly open weights.

Triple hybrid architecture

At the core of Nemotron 3 Super is a sophisticated architectural triad that balances memory efficiency with precision reasoning. The model utilizes a Hybrid Mamba-Transformer backbone, which interleaves Mamba-2 layers with strategic Transformer attention layers.

To understand the implications for enterprise production, consider the “needle in a haystack” problem. Mamba-2 layers act like a “fast-travel” highway system, handling the vast majority of sequence processing with linear-time complexity. This allows the model to maintain a massive 1-million-token context window without the memory footprint of the KV cache exploding. However, pure state-space models often struggle with associative recall.

To fix this, Nvidia strategically inserts Transformer attention layers as “global anchors,” ensuring the model can precisely retrieve specific facts buried deep within a codebase or a stack of financial reports.

Beyond the backbone, the model introduces Latent Mixture-of-Experts (LatentMoE). Traditional Mixture-of-Experts (MoE) designs route tokens to experts in their full hidden dimension, which creates a computational bottleneck as models scale. LatentMoE solves this by projecting tokens into a compressed space before routing them to specialists.

This “expert compression” allows the model to consult four times as many specialists for the exact same computational cost. This granularity is vital for agents that must switch between Python syntax, SQL logic, and conversational reasoning within a single turn.

Further accelerating the model is Multi-Token Prediction (MTP). While standard models predict a single next token, MTP predicts several future tokens simultaneously. This serves as a “built-in draft model,” enabling native speculative decoding that can deliver up to 3x wall-clock speedups for structured generation tasks like code or tool calls.

The Blackwell advantage

For enterprises, the most significant technical leap in Nemotron 3 Super is its optimization for the Nvidia Blackwell GPU platform. By pre-training natively in NVFP4 (4-bit floating point), Nvidia has achieved a breakthrough in production efficiency.

On Blackwell, the model delivers 4x faster inference than 8-bit models running on the previous Hopper architecture, with no loss in accuracy.

In practical performance, Nemotron 3 Super is a specialized tool for agentic reasoning.

It currently holds the No. 1 position on the DeepResearch Bench, a benchmark measuring an AI’s ability to conduct thorough, multi-step research across large document sets.

Benchmark	Nemotron 3 Super	Qwen3.5-122B-A10B Advertisement	GPT-OSS-120B
General Knowledge
MMLU-Pro	83.73 Advertisement	86.70	81.00
Reasoning
AIME25 (no tools) Advertisement	90.21	90.36	92.50
HMMT Feb25 (no tools) Advertisement	93.67	91.40	90.00
HMMT Feb25 (with tools) Advertisement	94.73	89.55	—
GPQA (no tools) Advertisement	79.23	86.60	80.10
GPQA (with tools) Advertisement	82.70	—	80.09
LiveCodeBench (v5 2024-07↔2024-12) Advertisement	81.19	78.93	88.00
SciCode (subtask) Advertisement	42.05	42.00	39.00
HLE (no tools) Advertisement	18.26	25.30	14.90
HLE (with tools) Advertisement	22.82	—	19.0
Agentic Advertisement
Terminal Bench (hard subset)	25.78	26.80	24.00 Advertisement
Terminal Bench Core 2.0	31.00	37.50	18.70 Advertisement
SWE-Bench (OpenHands)	60.47	66.40	41.9 Advertisement
SWE-Bench (OpenCode)	59.20	67.40	— Advertisement
SWE-Bench (Codex)	53.73	61.20	— Advertisement
SWE-Bench Multilingual (OpenHands)	45.78	—	30.80 Advertisement
TauBench V2
Airline	56.25	66.0 Advertisement	49.2
Retail	62.83	62.6 Advertisement	67.80
Telecom	64.36	95.00 Advertisement	66.00
Average	61.15	74.53 Advertisement	61.0
BrowseComp with Search	31.28	— Advertisement	33.89
BIRD Bench	41.80	— Advertisement	38.25
Chat & Instruction Following
IFBench (prompt)	72.56 Advertisement	73.77	68.32
Scale AI Multi-Challenge	55.23 Advertisement	61.50	58.29
Arena-Hard-V2	73.88 Advertisement	75.15	90.26
Long Context
AA-LCR Advertisement	58.31	66.90	51.00
RULER @ 256k Advertisement	96.30	96.74	52.30
RULER @ 512k Advertisement	95.67	95.95	46.70
RULER @ 1M Advertisement	91.75	91.33	22.30
Multilingual Advertisement
MMLU-ProX (avg over langs)	79.36	85.06	76.59 Advertisement
WMT24++ (en→xx)	86.67	87.84	88.89 Advertisement

It also demonstrates significant throughput advantages, achieving up to 2.2x higher throughput than gpt-oss-120B and 7.5x higher than Qwen3.5-122B in high-volume settings.

Nvidia Nemotron 3 Super key benchmarks chart. Nvidia

Custom ‘open’ license — commercial usage but with important caveats

The release of Nemotron 3 Super under the Nvidia Open Model License Agreement (updated October 2025) provides a permissive framework for enterprise adoption, though it carries distinct “safeguard” clauses that differentiate it from pure open-source licenses like MIT or Apache 2.0.

Key Provisions for Enterprise Users:

Commercial Usability: The license explicitly states that models are “commercially usable” and grants a perpetual, worldwide, royalty-free license to sell and distribute products built on the model.
Ownership of Output: Nvidia makes no claim to the outputs generated by the model; the responsibility for those outputs—and the ownership of them—rests entirely with the user.
Derivative Works: Enterprises are free to create and own “Derivative Models” (fine-tuned versions), provided they include the required attribution notice: “Licensed by Nvidia Corporation under the Nvidia Open Model License.”

The “Red Lines”:

The license includes two critical termination triggers that production teams must monitor:

Safety Guardrails: The license automatically terminates if a user bypasses or circumvents the model’s “Guardrails” (technical limitations or safety hyperparameters) without implementing a “substantially similar” replacement appropriate for the use case.
Litigation Trigger: If a user institutes copyright or patent litigation against Nvidia alleging that the model infringes on their IP, their license to use the model terminates immediately.

This structure allows Nvidia to foster a commercial ecosystem while protecting itself from “IP trolling” and ensuring that the model isn’t stripped of its safety features for malicious use.

‘The team really cooked’

The release has generated significant buzz within the developer community. Chris Alexiuk, a Senior Product Research Enginner at Nvidia, heralded the launch on X under his handle @llm_wizard as a “SUPER DAY,” emphasizing the model’s speed and transparency. “Model is: FAST. Model is: SMART. Model is: THE MOST OPEN MODEL WE’VE DONE YET,” Chris posted, highlighting the release of not just weights, but 10 trillion tokens of training data and recipes.

The industry adoption reflects this enthusiasm:

Cloud and Hardware: The model is being deployed as an Nvidia NIM microservice, allowing it to run on-premises via the Dell AI Factory or HPE, as well as across Google Cloud, Oracle, and shortly, AWS and Azure.
Production Agents: Companies like CodeRabbit (software development) and Greptile are integrating the model to handle large-scale codebase analysis, while industrial leaders like Siemens and Palantir are deploying it to automate complex workflows in manufacturing and cybersecurity.

As Kari Briski, Nvidia VP of AI Software, noted: “As companies move beyond chatbots and into multi-agent applications, they encounter… context explosion.”

Nemotron 3 Super is Nvidia’s answer to that explosion—a model that provides the “brainpower” of a 120B parameter system with the operational efficiency of a much smaller specialist. For the enterprise, the message is clear: the “thinking tax” is finally coming down.

Source link

WordUp News

Tech

Nvidia’s new open weights Nemotron 3 super combines three different architectures to beat gpt-oss and Qwen in throughput

Triple hybrid architecture

The Blackwell advantage

Custom ‘open’ license — commercial usage but with important caveats

‘The team really cooked’

Leave a Reply

Leave a Reply

Trending

Triple hybrid architecture

The Blackwell advantage

Custom ‘open’ license — commercial usage but with important caveats

‘The team really cooked’

Leave a Reply Cancel reply

Leave a Reply

Trending

Leave a Reply