Tech

New KV cache compaction technique cuts LLM memory 50x without accuracy loss

Published

3 weeks ago

6 March 2026

Enterprise AI applications that handle large documents or long-horizon tasks face a severe memory bottleneck. As the context grows longer, so does the KV cache, the area where the model’s working memory is stored.

A new technique developed by researchers at MIT addresses this challenge with a fast compression method for the KV cache. The technique, called Attention Matching, manages to compact the context by up to 50x with very little loss in quality.

While it is not the only memory compaction technique available, Attention Matching stands out for its execution speed and impressive information-preserving capabilities.

The memory bottleneck of the KV cache

Large language models generate their responses sequentially, one token at a time. To avoid recalculating the entire conversation history from scratch for every predicted word, the model stores a mathematical representation of every previous token it has processed, also known as the key and value pairs. This critical working memory is known as the KV cache.

The KV cache scales with conversation length because the model is forced to retain these keys and values for all previous tokens in a given interaction. This consumes expensive hardware resources. “In practice, KV cache memory is the biggest bottleneck to serving models at ultra-long context,” Adam Zweiger, co-author of the paper, told VentureBeat. “It caps concurrency, forces smaller batches, and/or requires more aggressive offloading.”

In modern enterprise use cases, such as analyzing massive legal contracts, maintaining multi-session customer dialogues, or running autonomous coding agents, the KV cache can balloon to many gigabytes of memory for a single user request.

To solve this massive bottleneck, the AI industry has tried several strategies, but these methods fall short when deployed in enterprise environments where extreme compression is necessary. A class of technical fixes includes optimizing the KV cache by either evicting tokens the model deems less important or merging similar tokens into a single representation. These techniques work for mild compression but “degrade rapidly at high reduction ratios,” according to the authors.

Real-world applications often rely on simpler techniques, with the most common approach being to simply drop the older context once the memory limit is reached. But this approach causes the model to lose older information as the context grows long. Another alternative is context summarization, where the system pauses, writes a short text summary of the older context, and replaces the original memory with that summary. While this is an industry standard, summarization is highly lossy and heavily damages downstream performance because it might remove pertinent information from the context.

Recent research has proven that it is technically possible to highly compress this memory using a method called Cartridges. However, this approach requires training latent KV cache models through slow, end-to-end mathematical optimization. This gradient-based training can take several hours on expensive GPUs just to compress a single context, making it completely unviable for real-time enterprise applications.

How attention matching compresses without the cost

Attention Matching achieves high-level compaction ratios and quality while being orders of magnitude faster than gradient-based optimization. It bypasses the slow training process through clever mathematical tricks.

The researchers realized that to perfectly mimic how an AI interacts with its memory, they need to preserve two mathematical properties when compressing the original key and value vectors into a smaller footprint. The first is the “attention output,” which is the actual information the AI extracts when it queries its memory. The second is the “attention mass,” which acts as the mathematical weight that a token has relative to everything else in the model’s working memory. If the compressed memory can match these two properties, it will behave exactly like the massive, original memory, even when new, unpredictable user prompts are added later.

“Attention Matching is, in some ways, the ‘correct’ objective for doing latent context compaction in that it directly targets preserving the behavior of each attention head after compaction,” Zweiger said. While token-dropping and related heuristics can work, explicitly matching attention behavior simply leads to better results.

Before compressing the memory, the system generates a small set of “reference queries” that act as a proxy for the types of internal searches the model is likely to perform when reasoning about the specific context. If the compressed memory can accurately answer these reference queries, it will very likely succeed at answering the user’s actual questions later. The authors suggest various methods for generating these reference queries, including appending a hidden prompt to the document telling the model to repeat the previous context, known as the “repeat-prefill” technique. They also suggest a “self-study” approach where the model is prompted to perform a few quick synthetic tasks on the document, such as aggregating all key facts or structuring dates and numbers into a JSON format.

With these queries in hand, the system picks a set of keys to preserve in the compacted KV cache based on signals like the highest attention value. It then uses the keys and reference queries to calculate the matching values along with a scalar bias term. This bias ensures that pertinent information is preserved, allowing each retained key to represent the mass of many removed keys.

This formulation makes it possible to fit the values with simple algebraic techniques, such as ordinary least squares and nonnegative least squares, entirely avoiding compute-heavy gradient-based optimization. This is what makes Attention Matching super fast in comparison to optimization-heavy compaction methods. The researchers also apply chunked compaction, processing contiguous chunks of the input independently and concatenating them, to further improve performance on long contexts.

Attention matching in action

To understand how this method performs in the real world, the researchers ran a series of stress tests using popular open-source models like Llama 3.1 and Qwen-3 on two distinct types of enterprise datasets. The first was QuALITY, a standard reading comprehension benchmark using 5,000 to 8,000-word documents. The second, representing a true enterprise challenge, was LongHealth, a highly dense, 60,000-token dataset containing the complex medical records of multiple patients.

The key finding was the ability of Attention Matching to compact the model’s KV cache by 50x without reducing the accuracy, while taking only seconds to process the documents. To achieve that same level of quality previously, Cartridges required hours of intensive GPU computation per context.

Attention matching on Qwen 3 — Attention Matching with Qwen-3 (source: arXiv)

When dealing with the dense medical records, standard industry workarounds completely collapsed. The researchers noted that when they tried to use standard text summarization on these patient records, the model’s accuracy dropped so low that it matched the “no-context” baseline, meaning the AI performed as if it had not read the document at all.

Attention Matching drastically outperforms summarization, but enterprise architects will need to dial down the compression ratio for dense tasks compared to simpler reading comprehension tests. As Zweiger explains, “The main practical tradeoff is that if you are trying to preserve nearly everything in-context on highly information-dense tasks, you generally need a milder compaction ratio to retain strong accuracy.”

The researchers also explored what happens in cases where absolute precision isn’t necessary but extreme memory savings are. They ran Attention Matching on top of a standard text summary. This combined approach achieved 200x compression. It successfully matched the accuracy of standard summarization alone, but with a very small memory footprint.

One of the interesting experiments for enterprise workflows was testing online compaction, though they note that this is a proof of concept and has not been tested rigorously in production environments. The researchers tested the model on the advanced AIME math reasoning test. They forced the AI to solve a problem with a strictly capped physical memory limit. Whenever the model’s memory filled up, the system paused, instantly compressed its working memory by 50 percent using Attention Matching, and let it continue thinking. Even after hitting the memory wall and having its KV cache shrunk up to six consecutive times mid-thought, the model successfully solved the math problems. Its performance matched a model that had been given massive, unlimited memory.

There are caveats to consider. At a 50x compression ratio, Attention Matching is the clear winner in balancing speed and quality. However, if an enterprise attempts to push compression to extreme 100x limits on highly complex data, the slower, gradient-based Cartridges method actually outperforms it.

The researchers have released the code for Attention Matching. However, they note that this is not currently a simple plug-and-play software update. “I think latent compaction is best considered a model-layer technique,” Zweiger notes. “While it can be applied on top of any existing model, it requires access to model weights.” This means enterprises relying entirely on closed APIs cannot implement this themselves; they need open-weight models.

The authors note that integrating this latent-space KV compaction into existing, highly optimized commercial inference engines still requires significant effort. Modern AI infrastructure uses complex tricks like prefix caching and variable-length memory packing to keep servers running efficiently, and seamlessly weaving this new compaction technique into those existing systems will take dedicated engineering work. However, there are immediate enterprise applications. “We believe compaction after ingestion is a promising use case, where large tool call outputs or long documents are compacted right after being processed,” Zweiger said.

Ultimately, the shift toward mechanical, latent-space compaction aligns with the future product roadmaps of major AI players, Zweiger argues. “We are seeing compaction to shift from something enterprises implement themselves into something model providers ship,” Zweiger said. “This is even more true for latent compaction, where access to model weights is needed. For example, OpenAI now exposes a black-box compaction endpoint that returns an opaque object rather than a plain-text summary.”

Source link

Tech

Water-Cooled MacBook Neo Equals Double the Frames from a Fanless Budget Laptop

Published

4 minutes ago

27 March 2026

NewsAdmin

Water-Cooled MacBook Neo
Apple’s MacBook Neo brings the A18 Pro chip from the iPhone 16 to an entry level laptop priced to compete at the accessible end of the market. To keep it slim and completely silent, Apple ditched fans entirely in favor of a graphene thermal pad sandwiched between the processor and the chassis to dissipate heat. It is an elegant solution for everyday tasks, but it puts a ceiling on how hard the chip can push when the workload gets demanding.

ETA Prime saw room for improvement and immediately took the MacBook Neo apart to find out how much. He fashioned a custom copper sheet shaped to sit around the CPU, cleaned the chip with isopropyl alcohol, applied fresh thermal paste, and topped it with a thermal pad to help the copper pull heat away from the chip and into the chassis. No permanent modifications, no adhesive, just a few screws and careful hands.

Apple 2026 MacBook Neo 13-inch Laptop with A18 Pro chip: Built for AI and Apple Intelligence, Liquid…

HELLO, MACBOOK NEO — Ready for whatever your day brings, MacBook Neo flies through everyday tasks and apps. Choose from four stunning colors in a…
THE MOST COLORFUL MACBOOK LINEUP EVER — Choose from Silver, Blush, Citrus, or Indigo — each with a color-coordinated keyboard to complete the…
POWER FOR EVERYDAY TASKS — Ready the moment you open it, MacBook Neo with the A18 Pro chip delivers the performance and AI capabilities you need to…

The results were immediate, as frame rates in No Man’s Sky climbed from around 30 per second to a smooth 58, and processor temperatures dropped from 105 degrees Celsius down into the mid-eighties. Geekbench 6 scores followed suit, with multi-core performance up by around 10 percent and single-core gains exceeding 15 percent. With the chip staying cooler for longer, sustained performance improved noticeably across everyday tasks as well, and through all of it the MacBook Neo remained completely silent.

Water-Cooled MacBook Neo
The first modification made it clear that the processor had significantly more headroom than Apple was allowing it to use. ETA Prime pushed things further by adding a small magnetic Peltier cooler powered through a USB-C cable drawing 50 watts. The device uses electricity to generate a cold side capable of dropping below freezing, cold enough to form ice on the surface during testing, while liquid channels carry the heat away on the other side. A simple adapter clamped the whole thing firmly against the copper plate already in place.

Water-Cooled MacBook Neo
Temperatures dropped again, settling into the mid-seventies under the same gaming load and returning to just above room temperature at idle. The benchmarks told a compelling story. Geekbench 6 single core scores were up 17.5 percent over stock and multi core climbed 18.5 percent, while Cinebench showed similar gains of around 24 percent single core and 19 percent multi core. No Man’s Sky held a steady 80 frames per second over a 30 minute session, and Fallout 4 ran at a smooth 60 frames per second on just 8GB of RAM with the help of compatibility software and storage swap support.

Water-Cooled MacBook Neo
The entire project remained reversible at every stage, with the copper sheet and external cooler leaving no permanent mark on the hardware. The only real cost was the extra power draw from the Peltier unit, and the performance gains made that a very easy trade to justify. A laptop that was never intended for gaming suddenly becomes a surprisingly capable one.
[Source]

Source link

Tech

iPhone 17e vs iPhone 17: What’s the difference?

Published

17 minutes ago

27 March 2026

NewsAdmin

Tempted by the cheaper iPhone 17e but aren’t sure how it really compares to the iPhone 17? You’ve come to the right place.

As we’ve reviewed both the iPhone 17 and iPhone 17e, we’ve compared our experiences with the two handsets below. We’ve assessed everything from their design differences to how they perform on a day-to-day basis, to help you decide which iPhone will suit you best.

Keep reading to see how the iPhone 17e compares to the iPhone 17, and which one is likely to earn a space on our best smartphones guide.

Otherwise, check out our iPhone 17e vs iPhone 16e comparison to see what’s new with Apple’s affordable model, while iPhone 17 Pro vs iPhone 17 explains whether you need to splurge on the top-end iteration instead.

Price and Availability

When it first launched back in 2025, the iPhone 17 was actually Apple’s most affordable handset, with a starting price of £799/$799 for its 256GB iteration.

SQUIRREL_PLAYLIST_10207955

However, the more recently launched iPhone 17e has since taken the iPhone 17’s title of being Apple’s most affordable phone. With a starting price of £599/$599 for the 256GB model, you can save a hefty £200/$200 opting for the iPhone 17e.

SQUIRREL_PLAYLIST_10208288

Design

iPhone 17 includes the Action Button and Camera Control button, while the iPhone 17e only sports the former
Both have an IP68 rating and Ceramic Shield 2 protection
iPhone 17e is slightly thinner with a smaller display

Both the sport similar designs as their respective predecessors, and are fitted with flat edges and rounded corners.

Even the cheaper iPhone 17e is packed with many of the same durability features as the iPhone 17 including Ceramic Shield 2 at the front and an IP68 rating too. Now, although you may have seen many of the best Android phones boasting ratings of IP69 and even IP69K, we would argue this is more of a marketing ploy than traits that are genuinely useful. Unless, of course, you plan on pressure washing your smartphone.

iPhone 17 Camera Control — Camera Control on iPhone 17. Image Credit (Trusted Reviews)

Both also sport the Action Button, which is a customisable button that has replaced the old ringer switch. However, the iPhone 17 benefits from the Camera Control button too which acts as a shortcut to the Camera app and customising the shot.

iphone 17e back — iPhone 17e. Image Credit (Trusted Reviews)

Otherwise, the iPhone 17 is slightly thicker than the iPhone 17e although the difference is negligible, so you won’t really notice it.

Winner: iPhone 17

Screen

The iPhone 17 has a 6.3-inch display, although housed in the same physical footprint as its 6.1-inch predecessor
iPhone 17 finally includes ProMotion technology, while the iPhone 17e caps out at 60Hz
iPhone 17e has a 6.1-inch OLED display

Put simply, we think the iPhone 17 has the best screen that we’ve seen on an entry-level iPhone. In comparison, the iPhone 17e just hasn’t quite got the oomph to match it.

Firstly, the headline feature of the iPhone 17 is that it finally includes Apple’s ProMotion technology, meaning it has an LTPO-enabled 1-120Hz display. The difference is staggering, and makes scrolling and animations feel smoother than the iPhone 17e’s 60Hz maximum.

iPhone 17e display

iPhone 17 display

Not only that, but the iPhone 17 has a slightly larger 6.3-inch display compared to the iPhone 17e’s 6.1-inches. In fact, the iPhone 17 houses its screen in the same physical footprint as its predecessor, thanks to the slimmer bezels which helps make the handset look more premium than others. That’s not to say the bezels on the iPhone 17e are large or distracting, it’s just that the iPhone 17’s are slimmer.

The iPhone 17 also benefits from a higher peak brightness of 3000 nits, whereas we measured the iPhone 17e as having a maximum 750 nits instead. That’s a huge difference, and means the iPhone 17e is trickier to use when outdoors in bright sunlight.

Winner: iPhone 17

Camera

iPhone 17 has 48MP main and 48MP ultrawide rear sensors
iPhone 17 also boasts an 18MP square front camera for better selfies
iPhone 17e only has one rear sensor, making it much less versatile

One of the biggest reasons to opt for the iPhone 17 is due to its camera. While it may not be quite as slick as the iPhone 17 Pro, its dual set-up is likely enough for most users.

iPhone 17e camera

iPhone 17 cameras

The standout feature of the iPhone 17 is its 48MP main lens which we found delivers a consistently sharp and colour-accurate image, however the 48MP ultrawide does an admirable job too. It won’t match the main lens in dark conditions though.

While of course we’d like a telephoto lens here, the main camera’s 2x in-sensor zoom delivers good quality shots when you need them.

Flip the iPhone 17 over and you’ll find its 18MP selfie camera, which now sports a square sensor. This allows you to short full-res portrait and landscape shots without needing to rotate your phone.

iphone 17e camera night — iPhone 17e camera at night. Image Credit (Trusted Reviews)

In comparison, the iPhone 17e isn’t quite as impressive. Not only is its front camera just 12MP and doesn’t share the same square sensor, but at its rear is just one 48MP “Fusion” lens. While you can capture detailed shots, with accurate yet vibrant colours even at night, if you’re used to playing around with different lenses then you’ll be disappointed with the iPhone 17e.

Winner: iPhone 17

Performance

Both run on Apple’s A19 chip, although the iPhone 17e’s has a slightly downgraded GPU
Even so, in daily use the iPhone 17e feels just like the iPhone 17
The iPhone 17 does benefit from ProMotion which makes gaming feel smoother

Both the iPhone 17 and iPhone 17e run on Apple’s A19 chipset, although it’s worth noting that the iPhone 17e’s version has a slightly downgraded GPU. What this should mean is that gaming might not be as smooth as otherwise, however we didn’t report any differences there.

Generally, both iPhones open apps instantly, allowing you to scroll through social media and even game without any stutter. However, as the iPhone 17 sports ProMotion, gaming does have a slight edge here.

So, while we’ll give the win here to the iPhone 17, it’s worth noting that the iPhone 17e’s performance isn’t far behind.

Winner: iPhone 17

Software

Both run on iOS 26
Both support Apple Intelligence, although it’s a pretty underwhelming toolkit at present

There aren’t many differences between the iPhone 17 and iPhone 17e’s software, as both run on Apple’s iOS 26, have the Liquid Glass finish and support Apple Intelligence.

The Liquid Glass design is somewhat divisive, however we think it looks great as everything feels more fluid and responsive than before. Even so, you can turn its intensity down via your device settings.

iphone 17e front — iPhone 17e. Image Credit (Trusted Reviews)

Otherwise, both iPhones also support Apple Intelligence, Apple’s AI-toolkit that hasn’t really taken off. While some of its features are useful, such as Live Translation and call summaries, Siri remains dated while Image Playground falters in comparison to Google’s Nano Banana. Hopefully, Apple Intelligence will see improvements in the future, but for now it shouldn’t be the reason you choose an iPhone.

Winner: Tie

Battery

Both are solid all-day phones
iPhone 17 supports 40W wired and Qi2-powered MagSafe 25W wireless charging speeds
iPhone 17e supports 25W and MagSafe 15W wireless charging

Although neither the iPhone 17 nor iPhone 17e boasts the same capacity as the likes of the OnePlus 15, both are still solid all-day handsets. We found both could last for around four hours of screen time before needing a top-up.

Charging speeds, however, remain somewhat uninspiring here, especially when compared to the best Android phones. However, with the iPhone 17 supporting 40W wired and 25W wireless, it’s faster than the iPhone 17e’s speeds of 25W and 15W respectively.

Winner: iPhone 17

Verdict

We would recommend that, so long as your budget can swing it, you opt for the iPhone 17. Not only does it boast a brilliant screen, but its cameras are more versatile and it performs brilliantly in most tasks.

That’s not to say the iPhone 17e isn’t a decent iPhone, it’s just harder to recommend due to its single camera and standard display. Having said that, we found that it does perform most tasks as well as the iPhone 17.

Source link

Tech

Caviar’s iPhone 17 Pro 50th Anniversary Steve Jobs Edition Boasts Genuine Fragment of His Turtleneck

Published

30 minutes ago

27 March 2026

NewsAdmin

Caviar iPhone 17 Pro 50th Anniversary Steve Jobs Edition Turtleneck
Caviar created an extremely limited run of this Steve Jobs edition iPhone 17 Pro, only 9 copies. Each has a genuine piece of Steve Jobs’ iconic black Issey Miyake turtleneck, neatly tucked within the phone. The turtleneck piece is casually tucked away in the center of the back panel, but it’s still visible, shielded by a raised titanium Apple logo that serves as both a seal and a prominent focus point.

The main body is black titanium with carbon fiber woven in for texture and silver accents around the edges that quietly reference the original 2007 iPhone. The Apple logo sits slightly off center, and the understated engraving keeps things minimal, striking a balance between a clear nod to the past and something that still feels unmistakably current.

Sale

Apple 2026 MacBook Air 13-inch Laptop with M5 chip: Built for AI, 13.6-inch Liquid Retina Display, 16GB…

MIGHT TAKES FLIGHT — MacBook Air with the M5 chip packs blazing speed and powerful AI capabilities into an incredibly portable design. With Apple…
SUPERCHARGED BY M5 — With its faster CPU and unified memory, the M5 chip delivers even more performance and fluidity across apps, making…
APPLE INTELLIGENCE — Apple Intelligence is the personal intelligence system that helps you write, express yourself, and get things done…

Caviar iPhone 17 Pro 50th Anniversary Steve Jobs Edition Turtleneck
Steve Jobs’ signature is engraved into the frame alongside the words ’50th Anniversary Edition,’ giving the whole thing an unexpectedly personal quality. The accompanying certificate confirms that the fragment of turtleneck fabric worked into the design came from one of Jobs’ own jackets. In hand the phone feels exactly as considered as it looks, the titanium balanced and substantial, and the carbon fiber shifting in appearance as the light catches it from different angles.

Caviar iPhone 17 Pro 50th Anniversary Steve Jobs Edition Turtleneck
The back panel draws the eye straight to the Apple logo, with the turtleneck fragment subtle enough that you almost miss it until you know it is there. Flip it over and the signature engraving comes into view, a quiet nod to the anniversary that inspired the whole project. Only nine units were made, and they are available now through Caviar’s website. Each one comes fully authenticated, so buyers can be confident the turtleneck fragment is exactly what it claims to be.
[Source]