Connect with us

Tech

New KV cache compaction technique cuts LLM memory 50x without accuracy loss

Published

on

Enterprise AI applications that handle large documents or long-horizon tasks face a severe memory bottleneck. As the context grows longer, so does the KV cache, the area where the model’s working memory is stored.

A new technique developed by researchers at MIT addresses this challenge with a fast compression method for the KV cache. The technique, called Attention Matching, manages to compact the context by up to 50x with very little loss in quality.

While it is not the only memory compaction technique available, Attention Matching stands out for its execution speed and impressive information-preserving capabilities.

The memory bottleneck of the KV cache

Large language models generate their responses sequentially, one token at a time. To avoid recalculating the entire conversation history from scratch for every predicted word, the model stores a mathematical representation of every previous token it has processed, also known as the key and value pairs. This critical working memory is known as the KV cache.

Advertisement

The KV cache scales with conversation length because the model is forced to retain these keys and values for all previous tokens in a given interaction. This consumes expensive hardware resources. “In practice, KV cache memory is the biggest bottleneck to serving models at ultra-long context,” Adam Zweiger, co-author of the paper, told VentureBeat. “It caps concurrency, forces smaller batches, and/or requires more aggressive offloading.”

In modern enterprise use cases, such as analyzing massive legal contracts, maintaining multi-session customer dialogues, or running autonomous coding agents, the KV cache can balloon to many gigabytes of memory for a single user request.

To solve this massive bottleneck, the AI industry has tried several strategies, but these methods fall short when deployed in enterprise environments where extreme compression is necessary. A class of technical fixes includes optimizing the KV cache by either evicting tokens the model deems less important or merging similar tokens into a single representation. These techniques work for mild compression but “degrade rapidly at high reduction ratios,” according to the authors.

Real-world applications often rely on simpler techniques, with the most common approach being to simply drop the older context once the memory limit is reached. But this approach causes the model to lose older information as the context grows long. Another alternative is context summarization, where the system pauses, writes a short text summary of the older context, and replaces the original memory with that summary. While this is an industry standard, summarization is highly lossy and heavily damages downstream performance because it might remove pertinent information from the context.

Advertisement

Recent research has proven that it is technically possible to highly compress this memory using a method called Cartridges. However, this approach requires training latent KV cache models through slow, end-to-end mathematical optimization. This gradient-based training can take several hours on expensive GPUs just to compress a single context, making it completely unviable for real-time enterprise applications.

How attention matching compresses without the cost

Attention Matching achieves high-level compaction ratios and quality while being orders of magnitude faster than gradient-based optimization. It bypasses the slow training process through clever mathematical tricks.

The researchers realized that to perfectly mimic how an AI interacts with its memory, they need to preserve two mathematical properties when compressing the original key and value vectors into a smaller footprint. The first is the “attention output,” which is the actual information the AI extracts when it queries its memory. The second is the “attention mass,” which acts as the mathematical weight that a token has relative to everything else in the model’s working memory. If the compressed memory can match these two properties, it will behave exactly like the massive, original memory, even when new, unpredictable user prompts are added later. 

“Attention Matching is, in some ways, the ‘correct’ objective for doing latent context compaction in that it directly targets preserving the behavior of each attention head after compaction,” Zweiger said. While token-dropping and related heuristics can work, explicitly matching attention behavior simply leads to better results.

Advertisement
Attention matching

Before compressing the memory, the system generates a small set of “reference queries” that act as a proxy for the types of internal searches the model is likely to perform when reasoning about the specific context. If the compressed memory can accurately answer these reference queries, it will very likely succeed at answering the user’s actual questions later. The authors suggest various methods for generating these reference queries, including appending a hidden prompt to the document telling the model to repeat the previous context, known as the “repeat-prefill” technique. They also suggest a “self-study” approach where the model is prompted to perform a few quick synthetic tasks on the document, such as aggregating all key facts or structuring dates and numbers into a JSON format.

With these queries in hand, the system picks a set of keys to preserve in the compacted KV cache based on signals like the highest attention value. It then uses the keys and reference queries to calculate the matching values along with a scalar bias term. This bias ensures that pertinent information is preserved, allowing each retained key to represent the mass of many removed keys.

This formulation makes it possible to fit the values with simple algebraic techniques, such as ordinary least squares and nonnegative least squares, entirely avoiding compute-heavy gradient-based optimization. This is what makes Attention Matching super fast in comparison to optimization-heavy compaction methods. The researchers also apply chunked compaction, processing contiguous chunks of the input independently and concatenating them, to further improve performance on long contexts.

Attention matching in action

To understand how this method performs in the real world, the researchers ran a series of stress tests using popular open-source models like Llama 3.1 and Qwen-3 on two distinct types of enterprise datasets. The first was QuALITY, a standard reading comprehension benchmark using 5,000 to 8,000-word documents. The second, representing a true enterprise challenge, was LongHealth, a highly dense, 60,000-token dataset containing the complex medical records of multiple patients.

The key finding was the ability of Attention Matching to compact the model’s KV cache by 50x without reducing the accuracy, while taking only seconds to process the documents. To achieve that same level of quality previously, Cartridges required hours of intensive GPU computation per context.

Advertisement
Attention matching on Qwen 3

Attention Matching with Qwen-3 (source: arXiv)

When dealing with the dense medical records, standard industry workarounds completely collapsed. The researchers noted that when they tried to use standard text summarization on these patient records, the model’s accuracy dropped so low that it matched the “no-context” baseline, meaning the AI performed as if it had not read the document at all. 

Attention Matching drastically outperforms summarization, but enterprise architects will need to dial down the compression ratio for dense tasks compared to simpler reading comprehension tests. As Zweiger explains, “The main practical tradeoff is that if you are trying to preserve nearly everything in-context on highly information-dense tasks, you generally need a milder compaction ratio to retain strong accuracy.”

The researchers also explored what happens in cases where absolute precision isn’t necessary but extreme memory savings are. They ran Attention Matching on top of a standard text summary. This combined approach achieved 200x compression. It successfully matched the accuracy of standard summarization alone, but with a very small memory footprint.

Advertisement

One of the interesting experiments for enterprise workflows was testing online compaction, though they note that this is a proof of concept and has not been tested rigorously in production environments. The researchers tested the model on the advanced AIME math reasoning test. They forced the AI to solve a problem with a strictly capped physical memory limit. Whenever the model’s memory filled up, the system paused, instantly compressed its working memory by 50 percent using Attention Matching, and let it continue thinking. Even after hitting the memory wall and having its KV cache shrunk up to six consecutive times mid-thought, the model successfully solved the math problems. Its performance matched a model that had been given massive, unlimited memory.

There are caveats to consider. At a 50x compression ratio, Attention Matching is the clear winner in balancing speed and quality. However, if an enterprise attempts to push compression to extreme 100x limits on highly complex data, the slower, gradient-based Cartridges method actually outperforms it.

The researchers have released the code for Attention Matching. However, they note that this is not currently a simple plug-and-play software update. “I think latent compaction is best considered a model-layer technique,” Zweiger notes. “While it can be applied on top of any existing model, it requires access to model weights.” This means enterprises relying entirely on closed APIs cannot implement this themselves; they need open-weight models. 

The authors note that integrating this latent-space KV compaction into existing, highly optimized commercial inference engines still requires significant effort. Modern AI infrastructure uses complex tricks like prefix caching and variable-length memory packing to keep servers running efficiently, and seamlessly weaving this new compaction technique into those existing systems will take dedicated engineering work. However, there are immediate enterprise applications. “We believe compaction after ingestion is a promising use case, where large tool call outputs or long documents are compacted right after being processed,” Zweiger said.

Advertisement

Ultimately, the shift toward mechanical, latent-space compaction aligns with the future product roadmaps of major AI players, Zweiger argues. “We are seeing compaction to shift from something enterprises implement themselves into something model providers ship,” Zweiger said. “This is even more true for latent compaction, where access to model weights is needed. For example, OpenAI now exposes a black-box compaction endpoint that returns an opaque object rather than a plain-text summary.”

Source link

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Tech

Netflix’s version of Overcooked lets you play as Huntr/x

Published

on

Netflix’s library of streamable party games is expanding today with a custom version of Overcooked! All You Can Eat. Netflix launched its cloud gaming program with games like Lego Party and Tetris Time Warp, but Overcooked feels a bit unique because it features a roster of Netflix-affiliated characters from KPop Demon Hunters and Stranger Things.

For the uninitiated, Overcooked plays like a more manic version of Diner Dash, where teams attempt to prepare food together in increasingly elaborate kitchens filled with obstacles. The original version of Overcooked! All You Can Eat was released in 2020, and includes DLC and stages from previous versions of the game. Netflix’s version bundles in the same content, and “10 Netflix celebrity chefs” including “Dustin, Eleven, Lucas, and the Demogorgon from Stranger Things,” and “half-dozen faces from KPop Demon Hunters,” like “Mira, Rumi, Zoey, Jinu, Derpy and Sussie.” Like Netflix’s other streaming games, playing Overcooked also requires you to use a connected smartphone as a controller.

Offering a growing library of streaming games is part of Netflix’s new strategy under Alan Tascan, a former executive from Epic Games. Tascan took over as Netflix’s President of Games in 2024, and appeared to start revamping the company’s plans not long after, cancelling the release of several mobile games and reportedly shutting down its AAA game studio. Netflix is also continuing to adapt video games into content for its platform. For example, A24 is reportedly developing a game show based on Overcooked for the streaming service.

Source link

Advertisement
Continue Reading

Tech

Valve doesn’t sound confident the Steam Machine will ship in 2026

Published

on

As part of a Year in Review blog detailing changes Valve made to Steam in 2025, the company shared a minor update on its hardware plans that doesn’t sound good for anyone hoping to buy a Steam Machine, Steam Controller or Steam Frame in 2026. Specifically, the company is now opening up the possibility its new hardware won’t ship this year at all.

In February, when Valve acknowledged the ongoing memory and storage shortage had delayed the launch of its hardware and could lead to higher prices, the company was still committing to a (fairly wide) window of when its hardware would ship:

“Our goal of shipping all three products in the first half of the year has not changed. But we have work to do to land on concrete pricing and launch dates that we can confidently announce, being mindful of how quickly the circumstances around both of those things can change.”

As of the company’s latest post, however, things somehow sound even less certain. “We hope to ship in 2026, but as we shared recently, memory and storage shortages have created challenges for us,” Valve wrote in its Year in Review post. “We’ll share updates publicly when we finalize our plans!”

While Valve’s air of secrecy can make it easy to read too much into the limited information the company does share, moving from “the first half of the year” to “[hoping] to ship in 2026” certainly gives it wiggle room to not release new hardware this year. And considering the difficulties other companies are facing sourcing memory and storage, it wouldn’t be all that surprising.

Advertisement

HP said in February that RAM accounts for a third of its PC costs, and industry analysts expect the RAM shortage could radically alter the PC landscape as companies are forced to raise prices. Valve’s already struggling to keep the Steam Deck in stock due to its issues securing RAM, it stands to reason sourcing components for even more devices wouldn’t make that process any easier. Then again, the company hasn’t updated its launch timing FAQ, so there’s still reason to hope the Steam Machine ships in 2026.

Source link

Continue Reading

Tech

One Sailing Pulley To Rule Them All

Published

on

When thinking of humanity’s ability to harness wind energy, many people will conjure images of windmills from places like The Netherlands or Persia. But people have been using wind energy for far longer than that in the form of sailing ships. Using the wind for transportation goes back another four thousand years or so, but despite our vast experience navigating the seas with wind alone there is still some room for improvement. Many modern sailboats use a number of different pulleys to manage all of the rigging, but this new, open-source pulley can replace many of them.

The pulley, or “block” as they are sometimes called, is built with a polymer roller made out of a type of nylon, which has the benefit of being extremely durable and self-lubricating but is a bit expensive. Durability and lack of squeakiness is important in sailing applications, though. The body is made from CNC-machined aluminum and is composed of two parts, which pivot around the pulley’s axis to allow various ropes (or “lines”) to be inserted without freeing one end of the rope. In testing, this design outperformed some proprietary stainless steel pulleys of similar size.

Another perk of this design is that it can be set up to work in many different applications on a sailboat, whether that’s for hoisting a mainsail or pulling in a jib or any other task a pulley could be used for. It can also be stacked with others in many different configurations to build custom pulleys of almost any type, and can support up to 14 mm lines. For a sailor this could be extremely valuable, because as it stands each pulley on a ship tends to be used in only certain applications, and might also be proprietary from a specific company. This pulley is being released into the open-source world, allowing anyone to create them who wants one.

Advertisement

Thanks to [Keith] for the tip!

Advertisement

Source link

Continue Reading

Tech

Seagate is now shipping HAMR disk drives holding up to 44TB of data

Published

on


Seagate introduced the Mozaic 3+ platform in 2024, turning the heat-assisted magnetic recording (HAMR) dream into a real product for customers in need of massive storage capacities. The HDD maker is now introducing the next-generation Mozaic 4+ drives, which offer capacities up to 44TB.
Read Entire Article
Source link

Continue Reading

Tech

Apple thinks it can lure in the 'Apple curious' for $599

Published

on

Apple has made it pretty clear that it wants to siphon off Android and Windows users, and it’s doing it by adopting an aggressive, “budget-friendlier” model across nearly its entire ecosystem.

Large bold blue price text reading $599 with layered rainbow-colored shadows on a solid black background
Apple is using $599 devices to grow its ecosystem

When I first entered the Apple ecosystem, it was when I bought an iPhone 4 in 2011 — I got it right after the 4s made its debut. I don’t remember exactly what I paid, but I know it was less than the initial $199 price tag.
And back then, I thought that was a completely asinine amount of money to pay for a phone. Fortunately, or maybe unfortunately, I had more money in my pocket than brains in my head, so I bought it just the same.
Continue Reading on AppleInsider | Discuss on our Forums

Source link

Continue Reading

Tech

Anthropic will fight US ‘supply chain risk’ designation in court

Published

on

Anthropic confirmed it has been designated a ‘supply chain risk’ by the US administration, and said it has no choice but to challenge in the courts.

Despite ongoing talks between Anthropic and the US Department of Defense, Anthropic confirmed last night it had received a letter from defense secretary Pete Hegseth confirming the ‘supply chain risk’ designation that had been threatened.

“Yesterday (March 4) Anthropic received a letter from the Department of [Defense] confirming that we have been designated as a supply chain risk to America’s national security,” wrote co-founder and CEO Dario Amodei last night in an official statement. “We do not believe this action is legally sound, and we see no choice but to challenge it in court.”

Amodei was quick to point out that “even supposing it was legally sound”, the limited application of the designation means the “vast majority” of its customers will be unaffected by the move. He said the restriction clearly only applied to the use of Claude by customers as a direct part of contracts with the US defense department, “not all use of Claude by customers who have such contracts”.

Advertisement

“The Department’s letter has a narrow scope, and this is because the relevant statute is narrow, too,” wrote Amodei. “It exists to protect the government rather than to punish a supplier.”

As with previous statements, Amodei strikes a conciliatory tone, saying Anthropic is committed to US national security and will offer continuing support from its engineers to ensure a smooth transition from Claude “for as long as we are permitted to do so”.

Anthropic drew the ire of the US administration after a standoff with the Pentagon, where Anthropic refused to change its safeguards related to using its AI for fully autonomous weapons, or for mass surveillance of US citizens.

With many in Silicon Valley supporting its relatively principled stand, and general users sending it to the top of the US Apple charts in recent days for free downloads – beating OpenAI’s ChatGPT for the first time – its flagship Claude.ai and Claude Code apps went down for around three hours on 2 March due to “unprecedented demand”.

Advertisement

Claude Cowork in particular was already becoming the darling of AI enthusiasts in the professional world, and Bloomberg reported on Tuesday that Anthropic was on track to generate annual revenue of almost $20bn, more than double its run rate from late 2025, signalling the rapid growth at the AI company which is today valued at around $380bn.

Don’t miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic’s digest of need-to-know sci-tech news.

Source link

Advertisement
Continue Reading

Tech

Tinder settles age discrimination lawsuit for $60 million, see if you qualify for a payout

Published

on


According to the plaintiff, Tinder charged users aged 29 and older more for premium subscriptions such as Tinder Plus and Tinder Gold, while offering cheaper rates for the same services to users in their teens and 20s. The lawsuit claimed the tiered pricing model violated multiple California laws, including the…
Read Entire Article
Source link

Continue Reading

Tech

Cognizant TriZetto breach exposes health data of 3.4 million patients

Published

on

Cognizant TriZetto breach exposes health data of 3.4 million patients

TriZetto Provider Solutions, a healthcare IT company that develops software and services used by health insurers and healthcare providers, has suffered a data breach that exposed the sensitive information of over 3.4 million people.

The firm, which has been operating under the Cognizant umbrella since 2014, disclosed that it detected suspicious activity on a web portal on October 2, 2025, and launched an investigation with the help of external cybersecurity experts.

The investigation revealed that unauthorized access began nearly a year before, on November 19, 2024.

During the exposure period, the threat actors accessed records relating to insurance eligibility verification transactions, which are part of the process providers use to confirm a patient’s insurance coverage before treatment.

Advertisement

The types of data that have been exposed vary per individual, and may include one or more of the following:

  • Full names
  • Physical address
  • Date of birth
  • Social Security number
  • Health insurance member number
  • Medicare beneficiary identifier
  • Provider name
  • Health insurer name
  • Demographic, health, and insurance information

Affected providers were alerted on December 9, 2025, but customer notification started in early February 2026. According to a filing Maine’s Attorney General submitted today, the number of exposed individuals is 3,433,965.

TriZetto says that payment card, bank account, or other financial information was not exposed in this incident.

Also, the company is not aware of any cases where cybercriminals have attempted to misuse this information.

TriZetto says it has taken steps to strengthen cybersecurity on its systems and informed law enforcement authorities of the incident.

Advertisement

Notification recipients are offered free 12-month coverage of credit monitoring and identity protection services from Kroll to help mitigate risks arising from compromised data.

BleepingComputer has contacted TriZetto to learn more about the nature of the security breach and why the firm delayed notifications to consumers for several months, but we have not received a response by publication time.

No ransomware groups have taken responsibility for the attack yet, and no data leaks linked to TriZetto have appeared on underground forums.

Cognizant itself was rumored to have suffered a Maze ransomware breach in 2020. In June 2025, Clorox sued the IT firm for gross negligence after it allegedly let Scattered Spider operatives into its network following a social engineering attack in September 2023.

Advertisement

Malware is getting smarter. The Red Report 2026 reveals how new threats use math to detect sandboxes and hide in plain sight.

Download our analysis of 1.1 million malicious samples to uncover the top 10 techniques and see if your security stack is blinded.

Source link

Continue Reading

Tech

The remake of one of the best Assassin’s Creed games is actually happening

Published

on

Ubisoft has finally confirmed what Assassin’s Creed fans have suspected for years: a remake of Assassin’s Creed IV: Black Flag is officially in the works.

The company revealed the project, titled Assassin’s Creed: Black Flag Resynced, in a new blog post outlining the future of the long-running series.

We don’t know much about the game yet, but initial reports suggest that Resynced will be a full remake rather than a simple remaster, with upgraded visuals and gameplay improvements, bringing one of the best AC games into the modern age.

It’s also suggested that new story content will be added to flesh out the world around Edward Kenway’s life – at the expense of the modern day gameplay, which has apparently been removed from the remake altogether. It’ll be interesting to see how this all works, given how the original game weaved parts of both storylines into the ending.

Advertisement

We’ve known for quite some time that Ubisoft has been thinking about breathing life into the 2013 game, but this was more or less confirmed when the name surfaced on a European ratings board listing late last year.

Advertisement

We don’t yet have a release date for the game, but we know that an unannounced game was due to arrive before the end of the current financial year. Of course, Ubisoft delayed seven games earlier this year – and Black Flag is expected to be one of them.

Whether or not we see the game before the end of 2026 remains to be seen, but for now we’ll keep our “spyglass on the horizon”.

Advertisement

Source link

Continue Reading

Tech

Fully charged: Meet the local leader energizing the Pacific Northwest battery boom

Published

on

Grayson Shor, far right, at a recent Pacific Northwest Battery Collaborative meet up at a Seattle brewery on Capitol Hill. Shor launched the organization to help the sector build connections. (PNWBC Photo)

Grayson Shor, founder and executive director of the Pacific Northwest Battery Collaborative, is the driving force that’s uniting and energizing the region’s battery community.

The collaborative’s launch in October 2024 was so popular it ran out of chairs and the group now caps RSVPs because venues keep maxing out. The nonprofit has hosted 1,400 attendees at 17 different events in Washington, Oregon and online. Shor’s latest project is helping create a battery-focused mini-series he describes as a hybrid between Anthony Bourdain’s “Parts Unknown” and “Cosmos.”

Who knew that energy storage devices could generate so much enthusiasm?

“Batteries are sexy right now,” Shor said.

Batteries are making electric vehicle adoption more attractive as they’ve become increasingly powerful and quicker to recharge. They’re ubiquitous given the pervasive use of phones and consumer electronics. And as electricity demand is spiking thanks to data centers and other energy users, they’re a relatively quick, affordable way to add more power to the grid.

Advertisement

“We are installing more grid batteries in 2025 than the total amount that existed globally just two years ago,” Shor said. “This isn’t just growth, it’s a total reimagining of how our economy is powered.”

A battery ecosystem emerges

Part of the crowd at the Pacific Northwest Battery Collaborative launch party, with founder Grayson Shor in the front row in a tie. (PNWBC Photo)

Shor has spent nearly a decade working on sustainability, circular economy and battery-related issues for organizations ranging from the U.S. Department of State to Amazon to startups. When the former diplomat landed in Seattle from the other Washington more than two years ago, he was impressed by the region’s battery sector.

That included startups in electric aviation, alternative chemistries such as sodium batteries, and next-generation silicon battery materials, plus R&D resources and support at the University of Washington’s Clean Energy Institute.

But he realized the industry lacked the connections to bring together companies, academics, entrepreneurs and investors, and set out to address it. The sector welcomes his efforts.

“I’ve paid attention to folks trying to knit together community, and for the Northwest battery innovation and application ecosystem, Grayson Shor has been an unrelenting force seeking to build and amplify our unique strengths,” said Dan Schwartz, founding director of the Clean Energy Institute.

Advertisement

Tom Gurski, founder of the plug-in hybrid vehicle startup Blue Dot Motorworks, has attended the group’s functions. “In a region famous for introverted personalities their events and happy hours are invaluable for breaking down silos and getting people to connect,” Gurski said.

Beyond building community, Shor is lobbying for support for local and state policies that promote the industry and get more batteries deployed in the state. The energy storage devices have important societal benefits, he said, including better electrical grid performance and helping meet power needs during peak demand.

‘The Battery Life’

Shor speaking at a Pacific Northwest Battery Collaborative event in Seattle during 2025 PNW Climate Week. (PNBC Photo)

Shor is also the co-founder and chief product officer for Buckstop, an “urban mining” startup helping recover critical minerals from waste electronics. He also volunteers as the policy and government affairs director for the Volta Foundation, the world’s largest battery industry association.

And there’s the TV series, called “The Battery Life.” Crews recently spent three days in the Seattle area filming the first episode, visiting the battery materials company Group14 Technologies and interviewing startups at the UW’s Clean Energy Test Beds.

“We’re doing walks through factories. We’re meeting with the CEOs and the inventors, diving deep into their technology,” Shor said. But the series also has “the ‘Carl Sagan vibe,’” he added, explaining “how does this technology actually impact humanity, and why does it matter to the average person?”

Advertisement

Additional episodes will be shot in Portland and Vancouver, B.C. The plan is to air the series later this year at energy events in Oregon and Las Vegas, plus other area venues.

Future Pacific Northwest Battery Collaborative plans include a job fair and fundraising gala. Shor also envisions a convention where the entrepreneurs and innovators could set up booths to show off their technologies. The ideas keep coming.

“This is playing my little role in trying to tackle climate change, to try to advance the energy transition,” he said. “It helps with equity, it helps with economic opportunity …. It makes me happy.”
 

Source link

Advertisement
Continue Reading

Trending

Copyright © 2025