Tech

IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

Published

1 month ago

27 March 2026

Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the costs spiral. Researchers at Tsinghua University and Z.ai have built a technique called IndexCache that cuts up to 75% of the redundant computation in sparse attention models, delivering up to 1.82x faster time-to-first-token and 1.48x faster generation throughput at that context length.

The technique applies to models using the DeepSeek Sparse Attention architecture, including the latest DeepSeek and GLM families. It can help enterprises provide faster user experiences for production-scale, long-context models, a capability already proven in preliminary tests on the 744-billion-parameter GLM-5 model.

The DSA bottleneck

Large language models rely on the self-attention mechanism, a process where the model computes the relationship between every token in its context and all the preceding ones to predict the next token.

However, self-attention has a severe limitation. Its computational complexity scales quadratically with sequence length. For applications requiring extended context windows (e.g., large document processing, multi-step agentic workflows, or long chain-of-thought reasoning), this quadratic scaling leads to sluggish inference speeds and significant compute and memory costs.

Sparse attention offers a principled solution to this scaling problem. Instead of calculating the relationship between every token and all preceding ones, sparse attention optimizes the process by having each query select and attend to only the most relevant subset of tokens.

DeepSeek Sparse Attention (DSA) is a highly efficient implementation of this concept, first introduced in DeepSeek-V3.2. To determine which tokens matter most, DSA introduces a lightweight “lightning indexer module” at every layer of the model. This indexer scores all preceding tokens and selects a small batch for the main core attention mechanism to process. By doing this, DSA slashes the heavy core attention computation from quadratic to linear, dramatically speeding up the model while preserving output quality.

But the researchers identified a lingering flaw: the DSA indexer itself still operates at a quadratic complexity at every single layer. Even though the indexer is computationally cheaper than the main attention process, as context lengths grow, the time the model spends running these indexers skyrockets. This severely slows down the model, especially during the initial “prefill” stage where the prompt is first processed.

DSA index tax — The DSA indexing tax increases with context length (source: arXiv)

Caching attention with IndexCache

To solve the indexer bottleneck, the research team discovered a crucial characteristic of how DSA models process data. The subset of important tokens an indexer selects remains remarkably stable as data moves through consecutive transformer layers. Empirical tests on DSA models revealed that adjacent layers share between 70% and 100% of their selected tokens.

To capitalize on this cross-layer redundancy, the researchers developed IndexCache. The technique partitions the model’s layers into two categories. A small number of full (F) layers retain their indexers, actively scoring the tokens and choosing the most important ones to cache. The rest of the layers become shared (S), performing no indexing and reusing the cached indices from the nearest preceding F layer.

During inference, the model simply checks the layer type. If it reaches an F layer, it calculates and caches fresh indices. If it is an S layer, it skips the math and copies the cached data.

There is a wide range of optimization techniques that try to address the attention bottleneck by compressing the KV cache, where the computed attention values are stored. Instead of shrinking the memory footprint like standard KV cache compression, IndexCache attacks the compute bottleneck.

“IndexCache is not a traditional KV cache compression or sharing technique,” Yushi Bai, co-author of the paper, told VentureBeat. “It eliminates this redundancy by reusing indices across layers, thereby reducing computation rather than just memory footprint. It is complementary to existing approaches and can be combined with them.”

The researchers developed two deployment approaches for IndexCache. (It is worth noting that IndexCache only applies to models that use the DSA architecture, such as the latest DeepSeek models and the latest family of GLM models.)

For developers working with off-the-shelf DSA models where retraining is unfeasible or too expensive, they created a training-free method relying on a “greedy layer selection” algorithm. By running a small calibration dataset through the model, this algorithm automatically determines the optimal placement of F and S layers without any weight updates. Empirical evidence shows that the greedy algorithm can safely remove 75% of the indexers while matching the downstream performance of the original model.

For teams pre-training or heavily fine-tuning their own foundation models, the researchers propose a training-aware version that optimizes the network parameters to natively support cross-layer sharing. This approach introduces a “multi-layer distillation loss” during training. It forces each retained indexer to learn how to select a consensus subset of tokens that will be highly relevant for all the subsequent layers it serves.

Real-world speedups on production models

To test the impact of IndexCache, the researchers applied it to the 30-billion-parameter GLM-4.7 Flash model and compared it against the standard baseline.

At a 200K context length, removing 75% of the indexers slashed the prefill latency from 19.5 seconds down to just 10.7 seconds, delivering a 1.82x speedup. The researchers note these speedups are expected to be even greater in longer contexts.

During the decoding phase, where the model generates its response, IndexCache boosted per-request throughput from 58 tokens per second to 86 tokens per second at the 200K context mark, yielding a 1.48x speedup. When the server’s memory is fully saturated with requests, total decode throughput jumped by up to 51%.

IndexCache performance — IndexCache speeds up the prefill and decode stages significantly (source: arXiv)

For enterprise teams, these efficiency gains translate directly into cost savings. “In terms of ROI, IndexCache provides consistent benefits across scenarios, but the gains are most noticeable in long-context workloads such as RAG, document analysis, and agentic pipelines,” Bai said. “In these cases, we observe at least an approximate 20% reduction in deployment cost and similar improvements in user-perceived latency.” He added that for very short-context tasks, the benefits hover around 5%.

Remarkably, these efficiency gains did not compromise reasoning capabilities. Using the training-free approach to eliminate 75% of indexers, the 30B model matched the original baseline’s average score on long-context benchmarks, scoring 49.9 against the original 50.2. On the highly complex AIME 2025 math reasoning benchmark, the optimized model actually outperformed the original baseline, scoring 92.6 compared to 91.0.

The team also ran preliminary experiments on the production-scale 744-billion-parameter GLM-5 model. They found that eliminating 75% of its indexers with the training-free method yielded at least a 1.3x speedup on contexts over 100K tokens. At the same time, the model maintained a nearly identical quality average on long-context tasks.

IndexCache GLM-5 — IndexCache increases the speed of GLM-5 by 20% while maintaining the accuracy (source: arXiv)

Getting IndexCache into production

For development teams wanting to implement the training-free approach today, the process is straightforward but requires careful setup. While the greedy search algorithm automatically finds the optimal layer configuration, the quality of that configuration depends on the data it processes.

“We recommend using domain-specific data as a calibration set so that the discovered layer-sharing pattern aligns with real workloads,” Bai said.

Once calibrated, the optimization is highly accessible for production environments. Open-source patches are already available on GitHub for major serving engines. “Integration is relatively straightforward — developers can apply the patch to existing inference stacks, such as vLLM or SGLang, and enable IndexCache with minimal configuration changes,” Bai said.

While IndexCache provides an immediate fix for today’s compute bottlenecks, its underlying philosophy points to a broader shift in how the AI industry will approach model design.

“Future foundation models will likely be architected with downstream inference constraints in mind from the beginning,” Bai concluded. “This means designs that are not only scalable in terms of model size, but also optimized for real-world throughput and latency, rather than treating these as post-hoc concerns.”

Source link

Tech

‘Reservation Hijacking’ Scams Target Travelers. Here’s How to Stay Safe

Published

6 minutes ago

10 May 2026

NewsAdmin

There’s another type of digital scam to be aware of, as per the BBC. It’s called “reservation hijacking.”

The name gives you a clue as to how it works. Essentially, scammers use details about a booking you’ve placed (perhaps with a hotel or airline) to trick you into sending money somewhere you shouldn’t.

While this type of scam isn’t brand new, a recent data breach at Booking.com has raised the risk of people being caught out. With data about you and your reservation, a far more convincing setup can be put in place—why wouldn’t you believe that someone purporting to be an employee from a spa you’ve got a reservation with is telling the truth about who they are, especially if they know the dates of your trip, your phone number, and your email address?

According to Booking.com, no financial information was exposed in the April 2026 hack. However, names, email addresses, phone numbers, and booking details have been leaked. The travel portal says affected customers have been emailed about the heightened risk of scams, so that’s the first thing to check for when it comes to staying safe.

Minimizing the risk of getting scammed by a reservation hijack involves many of the same security precautions you may already be following, and just being aware that this is a way you might be targeted will make a difference.

How Reservation Hijacks Work

Image may contain File Webpage City Architecture Building and Text

We’ve already outlined the basics of a reservation hijack, but it can take several forms. As with other types of scams, it tends to evolve over time. The basic premise is that someone will get in touch with you claiming to be from a place you have a reservation with, whether it’s a car rental company or a hotel.

The scammers will try to pull together as much information as they can on you and your booking. Sometimes they’ll target employees of the place you’ve got the reservation with in order to get access to their systems, and other times they may take advantage of a wider data breach (as with the recent Booking.com hack).

They might also get information through other means. Maybe they’ve somehow got access to your email, or to some of your social media posts (where you’ve shared your next vacation destination and a countdown of how many days are left to go). Don’t be caught out if you find yourself speaking to someone who knows a lot about your travel plans.

Source link

Tech

Honda Patents a Fake Clutch for Electric Motorcycles

Published

16 minutes ago

10 May 2026

NewsAdmin

An anonymous reader shared this report from Electrek:

A newly revealed Honda patent shows the company developing a simulated electronic clutch system for electric motorcycles, complete with torque-boost launches and even haptic feedback designed to mimic the feel of a combustion engine…. Instead of using a traditional mechanical clutch, the system uses electronics to alter how the motor responds based on clutch lever position. Pull the clutch halfway in, and the system proportionally reduces motor output. Pull it fully, and power is cut entirely, regardless of throttle position.

But the more interesting part is how Honda intends to recreate the behavior riders actually use clutches for. According to the patent as reported by AMCN, riders could preload the throttle while holding in the clutch lever, then rapidly release the lever to trigger a burst of torque — essentially simulating the hard launches motocross riders rely on with gas bikes. Honda believes that could be useful in competitive riding situations where precise power modulation matters, especially on loose terrain or during aggressive starts.

Honda also appears to be working on recreating the feel of a gas bike, not just the control inputs. The patent describes multiple vibration motors placed in the handlebars and near the clutch lever to provide haptic feedback that simulates engine vibration and even the “bite point” sensation of a clutch engaging. In other words, Honda may be trying to make an electric dirt bike feel mechanically alive, or at least the old-school idea of what a breathing dirt bike used to feel like.

Source link

Tech

Why the DJI Osmo Action 4 Essential Combo Keeps Delivering in 2026

Published

28 minutes ago

10 May 2026

NewsAdmin

DJI Osmo Action 4 Essential Combo
When photography enthusiasts begin shopping for action cameras, the latest models with all of their bells and whistles typically receive all of the attention. These newer models have fancy resolutions and tons of sensors, but after months of putting the DJI Osmo Action 4 Essential Combo, priced at $198.99 (was $299), through its paces, hiking, bicycling, and even taking it underwater, users have surprisingly come to dub it the most undervalued action cam around.

It handles difficult terrain and quick moves with ease, keeping the film as steady as a rock thanks to its internal stabilizer. Getting pro-quality footage without having to deal with extra equipment and software is a huge plus, and 4K at 120 frames per second in slow-motion is just the icing on the cake. You also get a larger 1.3 inch sensor, which lets in more light and allows you to take stunning images in low light or dense forest.

Sale

DJI Osmo Action 4 Essential Combo, Waterproof Action Camera with 1/1.3″ Sensor, 4K/120fps Video, Stunning…

Superior Low-Light Performance with 1/1.3-inch Sensor – Struggling with grainy footage in low light? Osmo Action 4 action camera 4K uses 2.4μm pixel…
Pro-Grade Color Science – Worried your videos lack real-life colors? 10-bit color depth captures subtle shades, while D-Log M preserves more detail…
See More, Feel More – Want to capture more action and scenery? Shoot in 4K at 120fps with a 155° ultra-wide field of view. Minimal cropping means…

The battery lasts longer than expected, making it ideal for all-day shots, especially on frigid days when other cameras start to droop. The Essential Combo pack includes the main body, a replacement battery, and a quick-release frame that fits into any standard mount, which is really good value for the price.

DJI Osmo Action 4 Essential Combo
The D-Log mode provides 10-bit color depth and allows editors to experiment with tones while maintaining quality. The 4:3 aspect ratio allows you to easily upload some amazing tall vertical movies from social media, and the bitrate is sufficient to maintain those fine textures looking great in grass, water, or cloth, often exceeding newer models in real-world use. It also effortlessly pairs with DJI mics, allowing you to capture high-quality audio without much fuss.

DJI Osmo Action 4 Essential Combo
It has a water resistance rating of 18 meters on its own and 60 meters with the extra case, so you can go for a splash or dive with complete confidence. The magnetic clip mechanism allows you to swiftly attach and detach from handlebars, chest straps, and helmets. Yes, there is no built-in storage, but most folks will have an SD card lying around somewhere.

Source link

Tech

Will Liquid Glass be improved on the Mac?

Published

41 minutes ago

10 May 2026

NewsAdmin

When Apple unveils its next macOS at WWDC 2026, a new report says that it will have a slightly redesigned Liquid Glass interface, though really just the same design iterations the company has always done.

Liquid Glass has had vocal critics, but just as with every version of macOS before, Apple is going to refine and mildly redesign it each year. According to Bloomberg, this year’s revision is chiefly concerned with the appearance of different Mac elements with Liquid Glass.

Specifically, the “slight redesign” is to concentrate on improving various readability issues. Where those have arisen so far, it’s been in Liquid Glass’s transparency and shadow effects, so presumably that is what Apple will work on.

This is the same thing Apple does after every significant redesign, starting with the toning down of the Aqua interface in the first years of Mac OS X. It was perhaps most noticeable with iOS 7 which debuted a very flat design that over the next years was slowly improved and clarified.

Speaking of iOS, though, the report also says that Apple’s work on the next version of this will include a benefit for macOS 27. Apple is said to be working to have tabs in Safari automatically organize themselves, as can already be done in rival browsers.

Tab Groups is definitely an area that needs attention, and not only because elements of it are better on other browsers. Many years ago, Apple added an option to Shortcuts on the Mac that was supposed to let users switch automatically between Tab Groups, but to this day the Shortcut action fails with an “internal error.”

It does work perfectly on iOS and iPadOS, though, so hopes that improvements on those platforms will come to the Mac as well could yet be wishful thinking.

Source link

Tech

Today’s NYT Strands Hints, Answer and Help for May 11 #799

Published

54 minutes ago

10 May 2026

NewsAdmin

Looking for the most recent Strands answer? Click here for our daily Strands hints, as well as our daily answers and hints for The New York Times Mini Crossword, Wordle, Connections and Connections: Sports Edition puzzles.

Some really old-timey words appear in today’s NYT Strands puzzle. I found a few of the answers difficult to unscramble, so if you need hints and answers, read on.

I go into depth about the rules for Strands in this story.

If you’re looking for today’s Wordle, Connections and Mini Crossword answers, you can visit CNET’s NYT puzzle hints page.

Hint for today’s Strands puzzle

Today’s Strands theme is: A nice medley

If that doesn’t help you, here’s a clue: This and that.

Clue words to unlock in-game hints

Your goal is to find hidden words that fit the puzzle’s theme. If you’re stuck, find any words you can. Every time you find three words of four letters or more, Strands will reveal one of the theme words. These are the words I used to get those hints, but any words of four or more letters that you find will work:

PODGE, MELD, BEND, SHAME, DOPE, RIDE, HAMS, BARN, DOSE

Answers for today’s Strands puzzle

These are the answers that tie into the theme. The goal of the puzzle is to find them all, including the spangram, a theme word that reaches from one side of the puzzle to the other. When you have all of them (I originally thought there were always eight but learned that the number can vary), every letter on the board will be used. Here are the nonspangram answers:

JUMBLE, RAGBAG, VARIETY, HODGEPODGE, MISHMASH

Today’s Strands spangram

The completed NYT Strands puzzle for May 11, 2026.
NYT/Screenshot by CNET

Today’s Strands spangram is ODDSANDENDS. To find it, look for the O that is five letters down on the far-left vertical row, and wind across.

Source link

Tech

Why Using Cardboard For A PC Case Is A Chore

Published

1 hour ago

10 May 2026

NewsAdmin

The idea of using cardboard for a sloppy PC case isn’t new; it’s a time-honored tradition dating back to at least the 1990s. That said, with today’s CNC cutters and other advanced tooling available to hobbyists, you might be curious to see how far you can push the concept. As demonstrated in a recent video by [mryeester], the answer appears to be that good planning and a solid understanding of cardboard’s limitations are as essential as ever.

After having the PC case drawn up in CAD and cut on a professional CNC cutter by a buddy who makes commercial cardboard displays, the installation procedure for the PC components showed where a bit of foresight could have saved a lot of time and effort.

The first problem was that the GPU couldn’t be installed due to wrong measurements on where the IO bracket normally is screwed into the case. Some cardboard cutting later, the GPU slid into place, but of course, there’s no way to screw it down, putting the full weight on the PCIe slot of the mainboard. Fortunately, the mainboard was quite literally bolted into place, and the case consists of multiple layers of corrugated cardboard to add some rigidity.

Next was more carving as the PSU cut-out was designed for an SFX PSU, not an ATX one. After that ordeal, one could say that perhaps a nice thing about a cardboard case is that you get to pick where buttons are located, though this comes with its own logistical issues.

Finally, mounting side panels turned into another chore, with perhaps some engineering possible to make it work better. For example, we recently looked at making cardboard hinges that would look pretty good on a cardboard PC case. You can also waterproof cardboard and make it much stronger, turning a throwaway, temporary cardboard solution into something that will last for years, even with occasional exposure to moisture and a water-cooling leak. (more…)

Tech

Whoop Will Soon Offer Users In-App Video Consultations With Licensed Clinicians

Published

1 hour ago

10 May 2026

NewsAdmin

Whoop

Starting this summer, Whoop users in the US will have access to on-demand video consultations with licensed clinicians from within the fitness tracker’s app. Along with that, the wearables company also announced on Friday that it is adding support for Electronic Health Record (EHR) syncing, so members and the clinicians they connect with will be able to easily pull up their medical histories.

“Unlike traditional healthcare experiences that rely on brief, episodic snapshots, these consultations begin with a comprehensive understanding of the member’s health, powered by months of continuous data and, when available, bloodwork and medical history,” Whoop said in a press release. It hasn’t yet revealed how much this service will cost.

Whoop also announced new AI features coming to the app: My Memory, where users can customize the “personal context” that goes into their coaching, and Proactive Check-Ins, which will provide users with training and recovery recommendations based on what’s going on in their life. The announcement comes right on the heels of Google unveiling its new Fitbit Air, which, much like Whoop’s device, is a screenless fitness wearable.

Source link

Tech

These Are the Very Last Tesla Model S and Model X Vehicles Ever Built

Published

2 hours ago

10 May 2026

NewsAdmin

Last Tesla Model S, Model X Production Line Ever
On May 9th, workers at Tesla’s Fremont factory in northern California were left staring at a still production line, which is an unusual sight to say the least. Aerial images of the final cars in the outbound lot showed them lined up and ready to be picked up by their new owners, just like they used to be, though these weren’t quite like the typical batch. There were only 250 Model S cars and 100 Model X SUVs produced, all of which included the Plaid powertrain and a number of handcrafted features that set them apart.

The people waiting to take delivery of these cars were handpicked from a small list of devoted Tesla owners who had received invitations just a month before. However, none of these were ever going to be widely distributed. Each one of these vehicles is an eye-catcher, painted in a deep rich red, while the seats are pristine white with gold piping, and they even have one of those gold Tesla logos on the front. On top of that, they each have a small plaque in the dash that tells you where in the production line it came from, e.g. number 47 out of 250. It’s what happens when you open the door that’s truly remarkable, as the Signature Edition lights turn on and are fully loaded with everything included in the Premium trim.

Amphibious Remote Control Car, 1:18 Monster Truck Toys for Boys RC Cars, 2.4 GHz Waterproof RC Trucks…

2.4 GHz Remote Control Car – 1:18 scale cool design, waterproof RC truck toys made of premium material and sturdy, with LED lights, waterproof remote…
High Quality & DIY Removable Toys RC Cars – This remote control monster truck structure design quality, flexibility and strength in one. The rc truck…
All Terrain Amphibious Monster Truck – 4-wheel drive off-road design rc trucks for kids, with high-quality tires (shock absorption, strong grip…

Last Tesla Model S, Model X Production Line Signature Edition

These sold for $159,420, and before they could drive off the lot and take possession, each buyer had to make a clear pledge to hold onto the car for at least a full year. To ensure they kept their promise, Tesla has the option to buy back the car at the conclusion of this period if the owner attempts to sell it. The owners of these very unusual automobiles will then be invited to a special delivery celebration on May 12th to meet and celebrate with the other owners.

The Model S has been a staple of the lineup since its introduction in 2012, and it has remained a prominent participant for the past 14 years. And, of course, the Model X sold eleven. This may not sound like much, but it is a remarkable run for any vehicle. Selling hundreds of thousands over the years, the Model X demonstrated that electric vehicles could be top of the line for speed, range, and comfort while still being the real deal. Meanwhile, the production plant is being fully rebuilt in preparation for mass production of Optimus, Tesla’s latest humanoid robot.
[Source]

Source link

Tech

Sharp & Roku launch first joint QLED TV in the UK

Published

2 hours ago

10 May 2026

NewsAdmin

Sharp and Roku are expanding their partnership in the UK with the launch of the first Sharp Roku TV QLED. This is a new 50-inch 4K set that combines Sharp’s display tech with Roku’s streaming platform.

Available now through Currys, the new model marks the first QLED TV the two brands have released for the UK market. It looks aimed squarely at buyers wanting a simpler smart TV experience and without stepping into flagship-level pricing.

The TV pairs a 4K UHD panel with QLED (Quantum Dot) colour. This promise a brighter and more vibrant picture quality than standard LED sets. While Sharp hasn’t positioned it as a premium Mini LED competitor, the addition of QLED should give colours more punch for films, sports and streaming content.

More importantly for many users, though, the set runs on the Roku TV OS, which remains one of the cleaner and easier smart TV interfaces around. The software offers access to thousands of free and paid apps.

Alongside this, there are features like universal search, automatic updates and a customisable home screen. This home screen is refreshingly straightforward compared to some cluttered TV platforms.

Roku says the launch comes as more UK households upgrade to larger TVs and higher-quality streaming setups. “It combines premium picture performance with the award-winning simplicity people expect from Roku TVs,” said Rob Woollard, Director of Retail Partnerships at Roku UK.

The partnership itself isn’t entirely new as Sharp has released Roku-powered TVs before. However, this is the first time the collaboration has moved into QLED territory in the UK. That makes this model feel less like an entry-level streaming TV. Instead, it feels more like a step toward the mid-range market currently dominated by brands like TCL and Hisense.

The new Sharp Roku TV QLED is available now in a 50-inch size through Currys.

Source link

Tech

Challenging UPS and FedEx, Amazon Opens Its Shipping Network to All Businesses

Published

2 hours ago

10 May 2026

NewsAdmin

This week Amazon opened up its parcel shipping, fulfillment, and distribution “to businesses of all types and sizes.” Any business can now ship, store, and deliver “using the same supply chain that supports Amazon,” according to Monday’s announcement of “Amazon Supply Chain Services.”

The move sent shares of UPS and FedEx “tumbling” Monday writes GeekWire. And though both stocks bounced back as the week went on, GeekWire sees this as the latest example of Amazon “turning its internal capabilities into products and services for sale…”

“Amazon had already surpassed both carriers to become the nation’s largest parcel shipper by volume, according to parcel-analytics firm ShipMatrix.”

Initial customers include Procter & Gamble, which is using Amazon’s freight network to transport raw materials; 3M, which is using it to move products to distribution centers; Lands’ End, which is fulfilling orders across sales channels from Amazon’s warehouses; and American Eagle Outfitters, which is using Amazon’s parcel service for last-mile delivery. The service can fulfill orders placed through platforms that compete with Amazon’s own marketplace, including Walmart, Shopify, TikTok, and others… Peter Larsen, vice president of Amazon Supply Chain Services, compared the launch to the origins of Amazon’s cloud business…

In addition to putting Amazon in competition with existing players in the logistics industry, the move also raises questions about data privacy. Amazon has faced accusations of using nonpublic seller data to compete against merchants on its marketplace, which it has denied. Larsen told the Wall Street Journal that the company prohibits using supply chain customer data for its own marketplace decisions, noting that hundreds of thousands of Amazon sellers already trust the company to fulfill orders placed on rival platforms.

The article notes taht in his annual shareholder letter Amazon’s CEO “said the company is also exploring selling its custom AI chips and robotics to outside customers.”

Source link