Connect with us

Crypto World

How to Reduce Non-Determinism and Hallucinations in Large Language Models (LLMs)

Published

on

How to Reduce Non-Determinism and Hallucinations in Large Language Models (LLMs)

In recent months, two separate pieces of research have shed light on two of the most pressing issues in large language models (LLMs): their
non-deterministic nature and their tendency to
hallucinate. Both phenomena have a direct impact on the
reliability,
reproducibility, and
practical usefulness of these technologies.

On the one hand,
Thinking Machines, led by former OpenAI CTO Mira Murati, has published a paper proposing ways to make LLMs return the
exact same answer to the
exact same prompt every time, effectively defeating non-determinism. On the other hand,
OpenAI has released research identifying the root cause of hallucinations and suggesting how they could be significantly reduced.

Let’s break down both findings and why they matter for the future of AI.

The problem of non-determinism in LLMs

Anyone who has used ChatGPT, Claude, or Gemini will have noticed that when you type in the exact same question multiple times, you don’t always get the same response. This is what’s known as
non-determinism: the same input does not consistently lead to the same output.

Advertisement

In some areas, such as creative writing, this variability can actually be a feature; it helps generate fresh ideas. But in domains where
consistency, auditability, and reproducibility are critical — such as healthcare, education, or scientific research — it becomes a serious limitation.

Why does non-determinism happen?

The most common explanation so far has been a mix of two technical issues:

  1. Floating-point numbers: computer systems round decimal numbers, which can introduce tiny variations.
  2. Concurrent execution on GPUs: calculations are performed in parallel, and the order in which they finish can vary, changing the result.

However, Thinking Machines argues that this doesn’t tell the whole story. According to their research, the real culprit is batch size.

When a model processes multiple prompts at once, it groups them into batches (or “carpools”). If the system is busy, the batch is large; if it’s quiet, the batch is small. These variations in batch size subtly change the order of operations inside the model, which can ultimately influence which word is predicted next. In other words, tiny shifts in the order of addition can completely alter the final response.

Thinking Machines’ solution

The key, they suggest, is to keep internal processes consistent regardless of batch size. Their paper outlines three core fixes:

Advertisement
  1. Batch-invariant kernels: ensure operations are processed in the same order, even at the cost of some speed.
  2. Consistent mixing: use one stable method of combining operations, independent of workload.
  3. Ordered attention: slice input text uniformly so the attention mechanism processes sequences in the same order each time.

The results are striking: in an experiment with the Qwen 235B model, applying these methods produced 1,000 identical completions to the same prompt, rather than dozens of unique variations.

This matters because determinism makes it possible to audit, debug, and above all, trust model outputs. It also enables stable benchmarks and easier verification, paving the way for reliable applications in mission-critical fields.


The problem of hallucinations in LLMs

The second major limitation of today’s LLMs is hallucination: confidently producing false or misleading answers. For example, inventing a historical date or attributing a theory to the wrong scientist.

Why do models hallucinate?

According to OpenAI’s paper, hallucinations aren’t simply bugs; they are baked into the way we train LLMs. There are two key phases where this happens:

  1. Pre-training: even with a flawless dataset (which is impossible), the objective of predicting the next word naturally produces errors. Generating the
    right answer is harder than checking whether an answer
    is right.
  2. Post-training (reinforcement learning): models are fine-tuned to be more “helpful” and “decisive”. But current metrics reward correct answers while penalising both mistakes
    and admissions of ignorance. The result? Models learn that it’s better to bluff with a confident but wrong answer than to say “I don’t know”.

This is much like a student taking a multiple-choice exam: leaving a question blank guarantees zero, while guessing gives at least a chance of scoring. LLMs are currently trained with the same incentive structure.

OpenAI’s solution: behavioural calibration

The proposed solution is surprisingly simple yet powerful: teach models when not to answer. Instead of forcing a response to every question, set a confidence threshold.

Advertisement
  • If the model is, for instance, more than 75% confident, it answers.
  • If not, it responds:
    “I don’t know.”

This technique is known as behavioural calibration. It aligns the model’s stated confidence with its actual accuracy.

Crucially, this requires rethinking benchmarks. Today’s most popular evaluations only score right and wrong answers. OpenAI suggests a three-tier scoring system:

  • +1 for a correct answer
  • 0 for “I don’t know”
  • –1 for an incorrect answer

This way, honesty is rewarded and overconfident hallucinations are discouraged.

Signs of progress

Some early users report that GPT-5 already shows signs of this approach: instead of fabricating answers, it sometimes replies,
“I don’t know, and I can’t reliably find out.” Even Elon Musk praised this behaviour as an impressive step forward.

The change may seem small, but it has profound implications: a model that admits uncertainty is far more trustworthy than one that invents details.


Two sides of the same coin: reliability and trust

What makes these two breakthroughs especially interesting is how complementary they are:

Advertisement
  • Thinking Machines is tackling
    non-determinism, making outputs consistent and reproducible.
  • OpenAI is addressing
    hallucinations, making outputs more honest and trustworthy.

Together, they target the biggest barrier to wider LLM adoption: confidence. If users — whether researchers, doctors, teachers, or policymakers — can trust that an LLM will both give reproducible answers and know when to admit ignorance, the technology can be deployed with far greater safety.


Conclusion

Large language models have transformed how we work, research, and communicate. But for them to move beyond experimentation and novelty, they need more than just raw power or creativity: they need trustworthiness.

Thinking Machines has shown that non-determinism is not inevitable; with the right adjustments, models can behave consistently. OpenAI has demonstrated that hallucinations are not just random flaws but the direct result of how we train and evaluate models, and that they can be mitigated with behavioural calibration.

Taken together, these advances point towards a future of AI that is more transparent, reproducible, and reliable. If implemented at scale, they could usher in a new era where LLMs become dependable partners in science, education, law, and beyond.

Source link

Advertisement
Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Crypto World

Ethereum Dust Attacks Have Increased Post-Fusaka

Published

on

Ethereum Dust Attacks Have Increased Post-Fusaka

Stablecoin-fueled dusting attacks are now estimated to make up 11% of all Ethereum transactions and 26% of active addresses on an average day, after the Fusaka upgrade made transactions cheaper, according to Coin Metrics. 

Ethereum is now seeing more than 2 million average daily transactions, spiking to almost 2.9 million in mid-January, along with 1.4 million daily active addresses — a 60% increase over prior averages.

The Fusaka upgrade in December made using the network cheaper and easier by improving onchain data handling, reducing the cost of posting information from layer-2 networks back to Ethereum.

Digging through the dust on Ethereum

Coin Metrics said it analyzed over 227 million balance updates for USDC (USDC) and USDt (USDT) on Ethereum from November 2025 through January 2026.

Advertisement

It found that 43% were involved in transfers of less than $1 and 38% were under a single penny — “amounts with insignificant economic purpose other than wallet seeding.”

“The number of addresses holding small ‘dust’ balances, greater than zero but less than 1 native unit, has grown sharply, consistent with millions of wallets receiving tiny poisoning deposits.”

Pre-Fusaka, stablecoin dust accounted for roughly 3 to 5% of Ethereum transactions and 15 to 20% of active addresses, it said. 

“Post-Fusaka, these figures jumped to 10-15% of transactions and 25-35% of active addresses on a typical day, a 2-3x increase.”

However, the remaining 57% of balance updates involved transfers above $1, “suggesting the majority of stablecoin activity remains organic,” Coin Metrics stated.

Median Ethereum transaction size fell sharply after Fusaka. Source: Coin Metrics

Users need to be wary of address poisoning

In January, security researcher Andrey Sergeenkov pointed to a 170% increase in new wallet addresses in the week starting Jan. 12, and also suggested it was linked to a wave of address poisoning attacks taking advantage of low gas fees

These “dusting” attacks typically involve malicious actors sending fractions of a cent worth of a stablecoin from wallet addresses that resemble legitimate ones, duping users into copying the wrong address when making a transaction.

Advertisement

Related: Ethereum activity surge could be linked to dusting attacks: Researcher

Sergeenkov said $740,000 had already been lost to address poisoning attacks. The top attacker sent nearly 3 million dust transfers for just $5,175 in stablecoin costs, according to Coin Metrics.

Dust does not represent genuine economic usage

Coin Metrics reported that approximately 250,000 to 350,000 daily Ethereum addresses are involved in stablecoin dust activity, but the majority of network growth has been genuine.  

“The majority of post-Fusaka growth reflects genuine usage, though dust activity is a factor worth noting when interpreting headline metrics.”

Magazine: DAT panic dumps 73,000 ETH, India’s crypto tax stays: Asia Express

Advertisement