Connect with us

Crypto World

How to Reduce Non-Determinism and Hallucinations in Large Language Models (LLMs)

Published

on

How to Reduce Non-Determinism and Hallucinations in Large Language Models (LLMs)

In recent months, two separate pieces of research have shed light on two of the most pressing issues in large language models (LLMs): their
non-deterministic nature and their tendency to
hallucinate. Both phenomena have a direct impact on the
reliability,
reproducibility, and
practical usefulness of these technologies.

On the one hand,
Thinking Machines, led by former OpenAI CTO Mira Murati, has published a paper proposing ways to make LLMs return the
exact same answer to the
exact same prompt every time, effectively defeating non-determinism. On the other hand,
OpenAI has released research identifying the root cause of hallucinations and suggesting how they could be significantly reduced.

Let’s break down both findings and why they matter for the future of AI.

The problem of non-determinism in LLMs

Anyone who has used ChatGPT, Claude, or Gemini will have noticed that when you type in the exact same question multiple times, you don’t always get the same response. This is what’s known as
non-determinism: the same input does not consistently lead to the same output.

Advertisement

In some areas, such as creative writing, this variability can actually be a feature; it helps generate fresh ideas. But in domains where
consistency, auditability, and reproducibility are critical — such as healthcare, education, or scientific research — it becomes a serious limitation.

Why does non-determinism happen?

The most common explanation so far has been a mix of two technical issues:

  1. Floating-point numbers: computer systems round decimal numbers, which can introduce tiny variations.
  2. Concurrent execution on GPUs: calculations are performed in parallel, and the order in which they finish can vary, changing the result.

However, Thinking Machines argues that this doesn’t tell the whole story. According to their research, the real culprit is batch size.

When a model processes multiple prompts at once, it groups them into batches (or “carpools”). If the system is busy, the batch is large; if it’s quiet, the batch is small. These variations in batch size subtly change the order of operations inside the model, which can ultimately influence which word is predicted next. In other words, tiny shifts in the order of addition can completely alter the final response.

Thinking Machines’ solution

The key, they suggest, is to keep internal processes consistent regardless of batch size. Their paper outlines three core fixes:

Advertisement
  1. Batch-invariant kernels: ensure operations are processed in the same order, even at the cost of some speed.
  2. Consistent mixing: use one stable method of combining operations, independent of workload.
  3. Ordered attention: slice input text uniformly so the attention mechanism processes sequences in the same order each time.

The results are striking: in an experiment with the Qwen 235B model, applying these methods produced 1,000 identical completions to the same prompt, rather than dozens of unique variations.

This matters because determinism makes it possible to audit, debug, and above all, trust model outputs. It also enables stable benchmarks and easier verification, paving the way for reliable applications in mission-critical fields.


The problem of hallucinations in LLMs

The second major limitation of today’s LLMs is hallucination: confidently producing false or misleading answers. For example, inventing a historical date or attributing a theory to the wrong scientist.

Why do models hallucinate?

According to OpenAI’s paper, hallucinations aren’t simply bugs; they are baked into the way we train LLMs. There are two key phases where this happens:

  1. Pre-training: even with a flawless dataset (which is impossible), the objective of predicting the next word naturally produces errors. Generating the
    right answer is harder than checking whether an answer
    is right.
  2. Post-training (reinforcement learning): models are fine-tuned to be more “helpful” and “decisive”. But current metrics reward correct answers while penalising both mistakes
    and admissions of ignorance. The result? Models learn that it’s better to bluff with a confident but wrong answer than to say “I don’t know”.

This is much like a student taking a multiple-choice exam: leaving a question blank guarantees zero, while guessing gives at least a chance of scoring. LLMs are currently trained with the same incentive structure.

OpenAI’s solution: behavioural calibration

The proposed solution is surprisingly simple yet powerful: teach models when not to answer. Instead of forcing a response to every question, set a confidence threshold.

Advertisement
  • If the model is, for instance, more than 75% confident, it answers.
  • If not, it responds:
    “I don’t know.”

This technique is known as behavioural calibration. It aligns the model’s stated confidence with its actual accuracy.

Crucially, this requires rethinking benchmarks. Today’s most popular evaluations only score right and wrong answers. OpenAI suggests a three-tier scoring system:

  • +1 for a correct answer
  • 0 for “I don’t know”
  • –1 for an incorrect answer

This way, honesty is rewarded and overconfident hallucinations are discouraged.

Signs of progress

Some early users report that GPT-5 already shows signs of this approach: instead of fabricating answers, it sometimes replies,
“I don’t know, and I can’t reliably find out.” Even Elon Musk praised this behaviour as an impressive step forward.

The change may seem small, but it has profound implications: a model that admits uncertainty is far more trustworthy than one that invents details.


Two sides of the same coin: reliability and trust

What makes these two breakthroughs especially interesting is how complementary they are:

Advertisement
  • Thinking Machines is tackling
    non-determinism, making outputs consistent and reproducible.
  • OpenAI is addressing
    hallucinations, making outputs more honest and trustworthy.

Together, they target the biggest barrier to wider LLM adoption: confidence. If users — whether researchers, doctors, teachers, or policymakers — can trust that an LLM will both give reproducible answers and know when to admit ignorance, the technology can be deployed with far greater safety.


Conclusion

Large language models have transformed how we work, research, and communicate. But for them to move beyond experimentation and novelty, they need more than just raw power or creativity: they need trustworthiness.

Thinking Machines has shown that non-determinism is not inevitable; with the right adjustments, models can behave consistently. OpenAI has demonstrated that hallucinations are not just random flaws but the direct result of how we train and evaluate models, and that they can be mitigated with behavioural calibration.

Taken together, these advances point towards a future of AI that is more transparent, reproducible, and reliable. If implemented at scale, they could usher in a new era where LLMs become dependable partners in science, education, law, and beyond.

Source link

Advertisement
Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Crypto World

Pumpfun Unveils Investment Arm and $3 Million Hackathon

Published

on

Pumpfun Unveils Investment Arm and $3 Million Hackathon


PUMP rallied as much as 10% but erased its gains as crypto markets dipped.

Source link

Continue Reading

Crypto World

Spot Bitcoin ETF AUM Hits Lowest Level Since April 2025

Published

on

Spot Bitcoin ETF AUM Hits Lowest Level Since April 2025

Assets in spot Bitcoin (BTC) ETFs slipped below $100 billion on Tuesday following a fresh $272 million in outflows.

According to data from SoSoValue, the move marked the first time spot Bitcoin ETF assets under management have fallen below that level since April 2025, after peaking at about $168 billion in October

The drop came amid a broader crypto market sell-off, with Bitcoin sliding below $74,000 on Tuesday. The global cryptocurrency market capitalization fell from $3.11 trillion to $2.64 trillion over the past week, according to CoinGecko.

Altcoin funds secure modest inflows

The latest outflows from spot Bitcoin ETFs followed a brief rebound in flows on Monday, when the products attracted $562 million in net inflows.

Advertisement

Still, Bitcoin funds resumed losses on Tuesday, pushing year-to-date outflows to almost $1.3 billion, coming in line with ongoing market volatility.

Spot Bitcoin ETF flows since Jan. 26, 2026. Source: SoSoValue

By contrast, ETFs tracking altcoins such as Ether (ETH), XRP (XRP) and Solana (SOL) recorded modest inflows of $14 million, $19.6 million and $1.2 million, respectively.

Is institutional adoption moving beyond ETFs?

The ongoing sell-off in Bitcoin ETFs comes as BTC trades below the ETF creation cost basis of $84,000, suggesting new ETF shares are being issued at a loss and placing pressure on fund flows.

Market observers say that the slump is unlikely to trigger further mass sell-offs in ETFs.

“My guess is vast majority of assets in spot BTC ETFs stay put regardless,” ETF analyst Nate Geraci wrote on X on Monday.

Advertisement
Source: Nate Geraci

Thomas Restout, CEO of institutional liquidity provider B2C2, echoed the sentiment, noting that institutional ETF investors are generally resilient. Still, he hinted that a shift toward onchain trading may be underway.

Related: VistaShares launches Treasury ETF with options-based Bitcoin exposure

“The benefit of institutions coming in and buying ETFs is they’re far more resilient. They will sit on their views and positions for longer,” Restout said in a Rulematch Spot On podcast on Monday.

“I think the next level of transformation is institutions actually trading crypto, rather than just using securitized ETFs. We’re expecting the next wave of institutions to be the ones trading the underlying assets directly,” he noted.