Tech

Nvidia’s new open weights Nemotron 3 super combines three different architectures to beat gpt-oss and Qwen in throughput

Published

on

Multi-agent systems, designed to handle long-horizon tasks like software engineering or cybersecurity triaging, can generate up to 15 times the token volume of standard chats — threatening their cost-effectiveness in handling enterprise tasks.

But today, Nvidia sought to help solve this problem with the release of Nemotron 3 Super, a 120-billion-parameter hybrid model, with weights posted on Hugging Face.

By merging disparate architectural philosophies—state-space models, transformers, and a novel “Latent” mixture-of-experts design—Nvidia is attempting to provide the specialized depth required for agentic workflows without the bloat typical of dense reasoning models, and all available for commercial usage under mostly open weights.

Triple hybrid architecture

At the core of Nemotron 3 Super is a sophisticated architectural triad that balances memory efficiency with precision reasoning. The model utilizes a Hybrid Mamba-Transformer backbone, which interleaves Mamba-2 layers with strategic Transformer attention layers.

Advertisement

To understand the implications for enterprise production, consider the “needle in a haystack” problem. Mamba-2 layers act like a “fast-travel” highway system, handling the vast majority of sequence processing with linear-time complexity. This allows the model to maintain a massive 1-million-token context window without the memory footprint of the KV cache exploding. However, pure state-space models often struggle with associative recall. 

To fix this, Nvidia strategically inserts Transformer attention layers as “global anchors,” ensuring the model can precisely retrieve specific facts buried deep within a codebase or a stack of financial reports.

Beyond the backbone, the model introduces Latent Mixture-of-Experts (LatentMoE). Traditional Mixture-of-Experts (MoE) designs route tokens to experts in their full hidden dimension, which creates a computational bottleneck as models scale. LatentMoE solves this by projecting tokens into a compressed space before routing them to specialists. 

This “expert compression” allows the model to consult four times as many specialists for the exact same computational cost. This granularity is vital for agents that must switch between Python syntax, SQL logic, and conversational reasoning within a single turn.

Advertisement

Further accelerating the model is Multi-Token Prediction (MTP). While standard models predict a single next token, MTP predicts several future tokens simultaneously. This serves as a “built-in draft model,” enabling native speculative decoding that can deliver up to 3x wall-clock speedups for structured generation tasks like code or tool calls.

The Blackwell advantage

For enterprises, the most significant technical leap in Nemotron 3 Super is its optimization for the Nvidia Blackwell GPU platform. By pre-training natively in NVFP4 (4-bit floating point), Nvidia has achieved a breakthrough in production efficiency.

On Blackwell, the model delivers 4x faster inference than 8-bit models running on the previous Hopper architecture, with no loss in accuracy.

In practical performance, Nemotron 3 Super is a specialized tool for agentic reasoning.

Advertisement

It currently holds the No. 1 position on the DeepResearch Bench, a benchmark measuring an AI’s ability to conduct thorough, multi-step research across large document sets.

Benchmark

Nemotron 3 Super

Qwen3.5-122B-A10B

Advertisement

GPT-OSS-120B

General Knowledge

MMLU-Pro

83.73

Advertisement

86.70

81.00

Reasoning

AIME25 (no tools)

Advertisement

90.21

90.36

92.50

HMMT Feb25 (no tools)

Advertisement

93.67

91.40

90.00

HMMT Feb25 (with tools)

Advertisement

94.73

89.55

GPQA (no tools)

Advertisement

79.23

86.60

80.10

GPQA (with tools)

Advertisement

82.70

80.09

LiveCodeBench (v5 2024-07↔2024-12)

Advertisement

81.19

78.93

88.00

SciCode (subtask)

Advertisement

42.05

42.00

39.00

HLE (no tools)

Advertisement

18.26

25.30

14.90

HLE (with tools)

Advertisement

22.82

19.0

Agentic

Advertisement

Terminal Bench (hard subset)

25.78

26.80

24.00

Advertisement

Terminal Bench Core 2.0

31.00

37.50

18.70

Advertisement

SWE-Bench (OpenHands)

60.47

66.40

41.9

Advertisement

SWE-Bench (OpenCode)

59.20

67.40

Advertisement

SWE-Bench (Codex)

53.73

61.20

Advertisement

SWE-Bench Multilingual (OpenHands)

45.78

30.80

Advertisement

TauBench V2

Airline

56.25

66.0

Advertisement

49.2

Retail

62.83

62.6

Advertisement

67.80

Telecom

64.36

95.00

Advertisement

66.00

Average

61.15

74.53

Advertisement

61.0

BrowseComp with Search

31.28

Advertisement

33.89

BIRD Bench

41.80

Advertisement

38.25

Chat & Instruction Following

IFBench (prompt)

72.56

Advertisement

73.77

68.32

Scale AI Multi-Challenge

55.23

Advertisement

61.50

58.29

Arena-Hard-V2

73.88

Advertisement

75.15

90.26

Long Context

AA-LCR

Advertisement

58.31

66.90

51.00

RULER @ 256k

Advertisement

96.30

96.74

52.30

RULER @ 512k

Advertisement

95.67

95.95

46.70

RULER @ 1M

Advertisement

91.75

91.33

22.30

Multilingual

Advertisement

MMLU-ProX (avg over langs)

79.36

85.06

76.59

Advertisement

WMT24++ (en→xx)

86.67

87.84

88.89

Advertisement

It also demonstrates significant throughput advantages, achieving up to 2.2x higher throughput than gpt-oss-120B and 7.5x higher than Qwen3.5-122B in high-volume settings.

Nvidia Nemotron 3 Super key benchmarks chart. Nvidia

Custom ‘open’ license — commercial usage but with important caveats 

The release of Nemotron 3 Super under the Nvidia Open Model License Agreement (updated October 2025) provides a permissive framework for enterprise adoption, though it carries distinct “safeguard” clauses that differentiate it from pure open-source licenses like MIT or Apache 2.0.

Key Provisions for Enterprise Users:

Advertisement
  • Commercial Usability: The license explicitly states that models are “commercially usable” and grants a perpetual, worldwide, royalty-free license to sell and distribute products built on the model.

  • Ownership of Output: Nvidia makes no claim to the outputs generated by the model; the responsibility for those outputs—and the ownership of them—rests entirely with the user.

  • Derivative Works: Enterprises are free to create and own “Derivative Models” (fine-tuned versions), provided they include the required attribution notice: “Licensed by Nvidia Corporation under the Nvidia Open Model License.”

The “Red Lines”:

The license includes two critical termination triggers that production teams must monitor:

  1. Safety Guardrails: The license automatically terminates if a user bypasses or circumvents the model’s “Guardrails” (technical limitations or safety hyperparameters) without implementing a “substantially similar” replacement appropriate for the use case.

  2. Litigation Trigger: If a user institutes copyright or patent litigation against Nvidia alleging that the model infringes on their IP, their license to use the model terminates immediately.

This structure allows Nvidia to foster a commercial ecosystem while protecting itself from “IP trolling” and ensuring that the model isn’t stripped of its safety features for malicious use.

‘The team really cooked’

The release has generated significant buzz within the developer community. Chris Alexiuk, a Senior Product Research Enginner at Nvidia, heralded the launch on X under his handle @llm_wizard as a “SUPER DAY,” emphasizing the model’s speed and transparency. “Model is: FAST. Model is: SMART. Model is: THE MOST OPEN MODEL WE’VE DONE YET,” Chris posted, highlighting the release of not just weights, but 10 trillion tokens of training data and recipes.

Advertisement

The industry adoption reflects this enthusiasm:

  • Cloud and Hardware: The model is being deployed as an Nvidia NIM microservice, allowing it to run on-premises via the Dell AI Factory or HPE, as well as across Google Cloud, Oracle, and shortly, AWS and Azure.

  • Production Agents: Companies like CodeRabbit (software development) and Greptile are integrating the model to handle large-scale codebase analysis, while industrial leaders like Siemens and Palantir are deploying it to automate complex workflows in manufacturing and cybersecurity.

As Kari Briski, Nvidia VP of AI Software, noted: “As companies move beyond chatbots and into multi-agent applications, they encounter… context explosion.”

Nemotron 3 Super is Nvidia’s answer to that explosion—a model that provides the “brainpower” of a 120B parameter system with the operational efficiency of a much smaller specialist. For the enterprise, the message is clear: the “thinking tax” is finally coming down.

Source link

Advertisement

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending

Exit mobile version