GMKtec has, however, made significant changes to the chassis, abandoning the flat square box typical of most mini PCs entirely.
Latest Videos From
A tower-style redesign built to fix old complaints
The EVO-X3 trades the EVO-X2’s flat footprint for a tall, triple-fan tower that resembles a steel-wrapped graphics card more than a conventional mini PC.
Advertisement
Despite the added height, the footprint remains compact, comparable in size to a PS4 console sitting upright, with GMKtec saying the redesign balances performance, efficiency, and thermal stability across continuous professional workloads.
Reviewers had criticized the EVO-X2 mainly for build quality issues, citing a cheap-feeling case, difficult internal access, and persistent fan noise under load.
This probably informed the design changes on the EVO-X3, though whether the new chassis actually resolves those issues remains to be seen.
Advertisement
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
GMKtec crushed the expectations of enthusiasts when it snubbed AMD’s newer Ryzen AI Max+ 495 chip for the Ryzen AI Max+ 395 silicon.
The processor combines CPU, GPU, and a large NPU rated at 50 TOPS, comfortably above the 40 TOPS threshold required for Microsoft‘s Copilot+ designation.
The EVO-X3 will be available in two storage configurations — 2 TB or 4 TB — and both versions carry the same 128 GB of LPDDR5X-8000 memory.
Advertisement
The device will also feature two M.2 2280 PCIe Gen4x4 slots, allowing total storage to scale up to 8 TB on either configuration.
GMKtec bundles its proprietary Claw+Wrangler suite directly onto the EVO-X3, a local-inference toolkit built for one-click setup and round-the-clock AI agents.
The company claims the 128 GB memory configuration can run models as large as 235 billion parameters entirely on-device, and none of that inference relies on cloud servers, which means no per-token fees and no user data ever leaving the machine.
Advertisement
A steep price jump for a familiar chip
GMKtec lists pre-launch pricing at $3,600 for the 128 GB and 2 TB configuration, rising to $3,849 for the 4 TB version, both described as discounted early figures.
Early access registration opened on June 22, offering a further $20 discount, with the global launch and shipping date both set for July 6.
For comparison, the EVO-X2 launched at $1,999 with 64 GB of memory and a 1 TB drive, making the jump considerable even accounting for the EVO-X3’s larger memory and storage allowances.
Advertisement
It is even a higher jump from the EVO-X1, the model that began GMKtec’s mini PC lineage in late 2024, priced near $900 with a Ryzen AI 9 HX 370 processor.
This means GMKtec has roughly quadrupled its mini PC pricing within two years, a jump of close to 300% from the EVO-X1’s original $900 price point.
It is even a high jump from where GMKtec’s mini PC lineage began with the EVO-X1 in late 2024, a Ryzen AI 9 HX 370 machine priced near $900
The EVO-X3 will face direct competition from other Strix Halo devices carrying the same 128 GB memory ceiling, including the MINIX ER939-AI Pro and the ONEXStation.
A visit by iFixit to one of China’s large battery production sites offers a rare look at how replacement batteries for iPhones actually get finished and tested. The team captured the work on video, showing lead teardown technician Shahram Mokhtari walking through the final assembly steps that turn a bare lithium-polymer cell into a complete, safe pack ready for installation.
The facility operates on a massive scale, manufacturing approximately 13 million battery cells per month. These cells begin life as a stack of dozens of ultra-thin layers that are sealed to extremely tight tolerances, ensuring that the chemistry inside remains stable and efficient throughout years of continuous use. Quality control tests are performed at each stage to detect any potential problems that could affect capacity, heat buildup, or long-term reliability, down to the smallest details that can make a significant difference.
25.6-inch Retractable Cable (65cm) – Built-in & Clutter-Free: Stay organized during business travel. The integrated 25.6inch (65cm) retractable…
165W Max Output – Dual Device Power Without Compromise: Charge your MacBook Pro and iPhone simultaneously with 100W + 65W output. Efficiently handle…
100W Fast Recharging – Fully Recharged in Just 1.9 Hours: Quick top-ups between flights or overnight hotel stays. With 100W max input, this power…
When a finished cell reaches the assembly area, the true integration begins. Rows of blank battery management system boards, or BMS boards, are waiting to be programmed. A machine places a contact pin into each board and applies the firmware that protects the cell from damage. That software protects the battery from overcharging or overdischarge, monitors the temperature, and delivers correct health data to the phone. Without it, even raw cells cannot be trusted to function securely within an iPhone.
Advertisement
The next step is attachment, which involves a machine pressing a programmed BMS board and its flexible cable onto the bare cell extremely nicely. It’s critical that the connection is solid but small, as any misalignment at this step could come back to get you later when the battery needs to fit into an iPhone. Folding follows, with workers or machines folding the BMS board down twice to fit snuggly against the cell. The edges are wrapped with Kapton tape to prevent any exposed contacts from contacting and causing a short, and the sticker machine applies a little label that folds back on itself to keep the board in place and from shifting during handling or installation.
Now it’s time to remove the protective films that were applied to both sides of the cell during early manufacture. Those films have kept the surfaces pristine up until now. Removing them prepares the battery for the adhesive strips that will keep it securely in place within the iPhone case. Quality control must be nearly excellent at this time. A testing machine takes the battery through a variety of checks, including impedance, capacity, and overcurrent tests, and returns a simple pass or fail result. A pass indicates that the battery is in good working order and will behave as expected in a genuine device, whereas failed batteries are removed.
Mohktari then plugs the finished battery with a diagnostic tool. The screen displays all of the live data obtained directly from the BMS, such as the current charge level, condition of health, temperature, design capacity, and actual maximum capacity. It’s all the proof you need to know the battery will function correctly, just like a fresh new pack in a phone. The final step in preparation is to apply the adhesive pull strips that Apple uses to secure batteries inside iPhones. Those strips allow technicians to cleanly remove the old battery during a repair and secure the new one without adding excessive bulk. To ensure that everything works properly, the completed battery is inserted into an actual iPhone, which switches on without a hitch, demonstrating that the pack works from start to finish. Every step up to that point has been taken to ensure that the last bit happens as planned.
Anthropic says Claude Fable 5 won’t be accessible via Claude subscriptions after July 7, but it’s not a permanent change, and the company expects the model to return outside the usage-based plan soon.
Fable 5 was recently restored after the US government lifted export controls on Anthropic’s most powerful models, Fable 5 and Mythos 5.
As part of the redeployment, Anthropic said Fable 5 would be available globally on Claude.ai, Claude Code, Claude Cowork, and the Claude Platform.
However, Anthropic has restricted Claude Fable usage due to high demand, and plans to move the model to usage-based billing next week.
“For Pro, Max, Team, and select Enterprise plans, Fable 5 will be included for up to 50% of weekly usage limits through July 7, after which it will be available via usage credits,” Anthropic said in its original blog post.
Advertisement
That line led to concerns that Fable 5, Anthropic’s most powerful model, was becoming a permanent pay-to-play upgrade for regular Claude users.
However, a Claude Code lead engineer has now clarified that Fable is expected to return to subscriptions once Anthropic has enough capacity.
“I’ve heard a lot of questions about Fable’s availability on subscription plans,” the engineer wrote in a post on X. “While it will come off subscriptions after July 7th, we aim to restore Fable as a standard part of our subscriptions as soon as capacity allows, as we mentioned in our original blog post.”
Anthropic says Fable 5 demand is difficult to predict
In its announcement, Anthropic said it expects demand for Fable 5 to be “very high, and difficult to predict.”
Advertisement
The company said Fable 5 is fully available today on the Claude API and consumption-based Enterprise plans, but access on subscription plans is being handled more conservatively.
“For subscription plans, we’d rather give access sooner than later, so we’re rolling out more conservatively, in stages,” Anthropic said.
Anthropic also said that after the included subscription window ends, it aims to restore Fable 5 as a standard part of subscription plans “when sufficient capacity allows us to do so.”
For now, Claude users who rely on Fable 5 should expect usage-credit billing after the deadline, and there’s nothing you can do about it.
Advertisement
Security teams log 54% of successful attacks and alert on just 14%. The rest move through your environment unseen.
The Picus whitepaper shows how breach and attack simulation tests your SIEM and EDR rules so threats stop slipping by detection.
Meta appears to have soft-launched a new app called Pocket that’s aimed at getting people to vibe-code their own minigames. Mobile developer and reverse engineer Alessandro Paluzzi spotted Pocket and posted about it to X today, but reporting platform AppFigures told TechCrunch that the app has been available on both iOS and Android since June 29. Though the app is listed publicly, it’s not available in the US on any of the half dozen phone models associated with our Google accounts, and a help page on Meta’s site says “the Pocket app is not yet available everywhere.”
The company has not made any public announcement yet about the launch or where the app is being trialed. We’ve reached out for comment and will update this post if we receive a response.
From cosmetic tweaks to a standalone app for AI slop, Meta has been going gangbusters on getting artificial intelligence into its services in the past year. TechCrunch suggested that Pocket may be the result of the company wholesale hiring the team behind of Gizmo, an app that used AI to create interactive experiences based on prompts from users, earlier this year. Pocket uses that exact same nomenclature, dubbing itself “a creative platform for making and sharing gizmos” in the app listing, and the Play Store shortcode of “com.facebook.gizmo” does little to dispel the notion either.
Claude Fable, the company’s most powerful model, is now available to all users, but early impressions are disappointing, as it appears to be nowhere near the original release.
When the Department of Commerce announced that it was lifting the ban on Claude Fable, I was holding my breath and counting seconds for the model to show up on Claude Code. I had also loaded up my usage-based credit wallet, just in case the model debuted as strictly usage-based.
To our surprise, Claude Fable shipped for everyone, including those with a $100 Max subscription, but there are multiple restrictions.
According to Anthropic, while Fable 5 is included in Max, Pro, and Team plans, it is heavily capped.
For example, you can use Fable for up to 50% of your weekly usage limits, which is not significant for such a powerful model. But it’ll get worse after July 7, as the model will transition entirely to a pay-to-play system via usage credits.
Advertisement
However, the real gut punch is the degraded performance, or as famously used in the AI community, the “nerfed” performance.
On Reddit, users are reporting that the restored Fable 5 feels weaker, or is simply being routed through stricter safety systems more often than before.
“The new guardrails are kicking in on way too many tasks and falling back to Opus 4.8,” one user wrote in a Reddit post. “This is not the model that got banned.”
The problem is not just limited to Claude desktop, as Claude Code is also struggling with similar issues.
Advertisement
One user said Fable “didn’t even let me search for dead code without switching to Opus,” while another said it was “very very obvious” when the fallback triggers because Claude tells the user and visibly shifts to Opus.
Another developer claimed the model was unusable for some systems-level coding work, saying that C, C++, Rust, Win32 API references, memory-related work, and files mentioning words like “security,” “vulnerable,” “unsafe,” or “hook” appeared to trigger a fallback or block.
Fable 5 may still be powerful when it actually handles the task, but the restored version appears to be far more sensitive to prompts, project files, and security-adjacent language.
However, BleepingComputer understands that the model itself has not been nerfed. Instead, it is likely that Anthropic is being extra careful with the safety guardrails, which is negatively affecting Fable’s daily use cases.
Advertisement
In fact, we observed that Fable is sometimes routed to Opus 4.8 even when the task does not appear to be a safety risk.
Anthropic has said that its updated safeguards rely on a large “safety margin,” which could explain the subpar experience some users are seeing with Fable.
Anthropic hasn’t acknowledged the reports of false positives yet, but it’s likely the company is aware of the problem and will address it in a future update.
Security teams log 54% of successful attacks and alert on just 14%. The rest move through your environment unseen.
The Picus whitepaper shows how breach and attack simulation tests your SIEM and EDR rules so threats stop slipping by detection.
Earlier this year we wrote about the ridiculous thin-skinned executives at Palantir suing a small independent Swiss online magazine, Republik, that had reported on the great lengths the company had gone to, trying to get the Swiss government to purchase Palantir’s surveillance technology. Palantir knew they couldn’t sue for defamation because, you know, everything Republik reported was true. Instead, they sued, trying to invoke a Swiss “right of reply” law, claiming that because Republik refused to publish the press release Palantir wanted to run in response to the reporting, the magazine had violated the law.
As we said at the time, this is the height of entitlement. Palantir doesn’t get to tell Republik how and what it must publish.
And, thankfully, a court has agreed. Zurich’s commercial court rejected 22 of 23 claims that Palantir made.
The data analytics company lost on 22 out of 23 counts of the suit. In a ruling on Friday, Zurich’s commercial court dismissed the majority of counterstatement requests filed by the company and its Swiss subsidiary finding that only a single passage in one article warranted a published response from the company.
While the court agrees that there is a “right of reply” law in Switzerland, it has limitations:
Advertisement
While Swiss media law allows the subjects of a story to request a right of reply, this has caveats: the right of reply has to be concise and stick to the facts of the story.
The one count that stuck: the court found that a single passage in just one article warranted a limited published reply from Palantir.
Also, the court told Palantir to pay Republik for its legal expenses wasted on this SLAPP suit:
The court on Friday ordered Palantir to bear 95% of the 9,000 Swiss francs ($11,300; £8,400) court costs and to pay Republik 9,900 francs in legal expenses.
Of course, this case was always less about the ‘right of reply’ than about making it clear to anyone who reports critically on Palantir that the company will go to war with them, seeking any legal theory, no matter how ridiculous, to tie them up in court — the textbook logic of a SLAPP suit. Republik has said that defending the case cost the small organization quite a lot in time and resources:
Balz Oertli, a journalist with WAV research collective, said: “We invested a great deal of effort into this case, and we are very pleased with the outcome.”
Anyway, given that Palantir seems really upset about Republik’s reporting, it sure would be a shame if you decided to go read this critical reporting of Palantir’s relentless attempts to win business from the Swiss government.
Many of us remember back in our school days taking tests and filling out answers on a Scantron sheet, those long rows of A, B, C, D, and E that had to be filled in with a #2 pencil. Ever wonder why it needed a #2 pencil, or what the point of using a Scantron was at all? That question is answered in the latest video from [SimonRetro], where he takes a look at the Scantron and how it works.
One of the more interesting things about the Scantron is that it’s such a standalone device. No software needed, no keypad to mess with just two rocker switches. The on/off switch is also the way you tell it to forget the last answer sheet and allow you to program in a new test. Upon booting, you feed in a Scantron sheet with some specific boxes filled in, and then it’s programmed and ready to take in and grade all the students’ answers. Opening up the Scantron reveals it’s pretty interesting inside: one control board with early-’90s-era chips. There’s also a lightbulb (no LEDs) shining through the six reading sections of the card, as well as an arrangement of belts and motors to move the card through the machine. The printer is a seven-pin printer used in conjunction with a pair of ink rollers to print out the results on the cards.
[SimonRetro] also went ahead and tried different ways to mark the sheets including pens, Sharpies, colored pencils, and different thicknesses of pencils besides the #2 to see which would and wouldn’t work in the Scantron. Thanks [SimonRetro] for exploring this machine from many of our childhoods and sharing its inner workings. Be sure to check out some of our other reverse engineering articles that explore how classic devices work.
Researchers have found a never-before-seen piece of macOS malware that combines a series of clever tradecraft to infect Macs with stealthy, custom-developed credential-stealing code.
The malware is delivered in two stages. The first is distributed in a disk image that masquerades as Maccy, a clipboard manager for Macs. It’s compiled as AppleScript that is notable for the way it delivers the second stage. The malware is named PamStealer because the Rust-written infostealer uses the Pluggable Authentication Modules interface built into macOS to validate the target’s login password before sending it to an attacker-controlled server.
A quieter execution chain
The use of both disk image and AppleScript is common in malware for Macs. More unusual is the way PamStealer combines them to gain stealth. When the AppleScript is double-clicked, it’s opened in the macOS Script Editor, where the malicious functionality is buried deep within the file.
“Rather than relying on shell commands such as curl or zsh, the AppleScript executes a self-contained JavaScript for Automation (JXA) downloader that retrieves and stages the payload using native Objective-C APIs,” researchers from Jamf, a security firm for macOS users, wrote. “Combined with a Rust-based second stage and a password capture workflow that validates credentials locally through PAM, the result is a quieter execution chain than we typically observe in commodity macOS stealers.”
Advertisement
When a user, expecting to install a trustworthy clipboard manager, encounters the disk image, they’re prompted to press Command-R immediately after double-clicking it. This command executes malicious code inside the AppleScript directly. It also allows the execution to bypass com.apple.quarantine, a macOS attribute that provides warnings and restrictions when executable files have been downloaded from the Internet.
As Jamf explained:
PamStealer combines a recently emerging delivery surface with a less familiar payload. While the clickable .scpt and Script Editor lure build on tradecraft that is already gaining adoption across the macOS threat landscape, the malware distinguishes itself through a self-contained JXA dropper, a Rust-based second stage, and a password capture workflow that validates credentials locally through PAM before harvesting them. That second stage puts considerable effort into staying hidden, masquerading as Finder, encrypting its command-and-control traffic, and holding back prompts like the Full Disk Access request for as long as forty minutes so its activity does not line up with launch. Together, these behaviors illustrate how commodity macOS stealers continue to evolve, adopting quieter execution chains and native implementations that reduce traditional detection opportunities while remaining compatible with standard macOS features.
The first stage puts its payload inside an app bundle that impersonates real components built into macOS. The component changes from sample to sample of the malware. Finder.app under com.apple.finder.core or com.apple.finder.monitor, and a Software Update.app under com.apple.security.daemon, are two examples. In either case, they run hidden. They also display macOS’s genuine Finder.icns as its icon.
The idea of an AI-powered device that’s not a smartphone is weird, but not unheard of. According to a report from The Wall Street Journal on Wednesday, SpaceX has already shown investors an early prototype of one.
The report says that Elon Musk’s SpaceX — which includes the social media platform X and the artificial intelligence startup xAI — has developed a handset-like device that’s sleeker and slimmer than an iPhone and runs a proprietary operating system that integrates xAI’s own technologies. The device reportedly runs on a Qualcomm Snapdragon chip, a common feature in many Android phones today.
On Thursday, Musk publicly denied the existence of such a device, calling the claims “utterly false” in a post on X.
Advertisement
In February, Musk publicly stated that a phone was not being developed. Earlier, during an event last October, Musk said, “the idea of making a phone makes me want to die,” while adding, “if we have to make a phone, we will.” However, there’s enough rumored evidence to believe that such a device may exist, even if Musk refuses to call it a phone.
SpaceX began being publicly traded earlier this month. Whether we see a device with its branding remains to be seen, but it wouldn’t be too much of a surprise.
SpaceX did not immediately respond to a request for comment.
Artificial intelligence is already everywhere on our smartphones, but tech companies are racing to build entirely new AI gadgets. OpenAI and Jony Ive are said to be working on a screenless AI device that might be worn on your ear as an always-on assistant.
In a world saturated with “smart” and AI technologies, creating a new device running a different operating system would free Musk from the potential restrictions imposed by Apple and Google’s ecosystems. It could allow SpaceX and xAI to rely on their own technology rather than the big players.
And given Apple and Google’s stranglehold on the smartphone industry, breaking away from the phone format would also let SpaceX’s new device escape strict app store rules.
When shown to institutional investors, SpaceX reportedly said the device was in the early stages of development and that the design could change over time. Although it’s not called a “phone,” it’s logical to assume the device could connect to SpaceX’s Starlink satellite network for connectivity.
In fact, while a physical smartphone has been denied, a branded consumer mobile service is likely. Last week, The Financial Times reported that SpaceX is actively weighing a Starlink-branded retail mobile plan, directly competing with T-Mobile, AT&T and Verizon.
Qualified immunity — crafted out of thin air by the US Supreme Court — has rarely been anything but an easy way for government employees to duck out of lawsuits before they’re actually asked to defend themselves against allegations of rights violations.
The Supreme Court has continually narrowed this doctrine, pretty much ensuring that if every single fact of an allegation doesn’t perfectly align with precedential rulings, qualified immunity will be awarded. The Supreme Court has ensured no further movement will take place by continually refusing to establish rights violations, even when it (very rarely!) disagrees with a lower court’s granting of qualified immunity.
The doctrine has been memorably pilloried more than once by appellate judges. Most famously, Judge Don Willett of the Fifth Circuit Appeals Court had this to say about the qualified immunity doctrine — something tends to reward rights violators just because they happened to find a slightly different way to violate someone’s rights.
To some observers, qualified immunity smacks of unqualified impunity, letting public officials duck consequences for bad behavior—no matter how palpably unreasonable—as long as they were the first to behave badly.
That was the wind-up. Here’s the pitch:
Advertisement
Section 1983 meets Catch-22. Plaintiffs must produce precedent even as fewer courts are producing precedent. Important constitutional questions go unanswered precisely because those questions are yet unanswered. Courts then rely on that judicial silence to conclude there’s no equivalent case on the books. No precedent = no clearly established law = no liability. An Escherian Stairwell. Heads defendants win, tails plaintiffs lose.
Justice Sotomayor’s dissent [PDF] isn’t as immediately quotable, but it still delivers a stinging indictment of the qualified immunity doctrine. The facts of the case are unpleasant, as they almost always are when government defendants start invoking qualified immunity.
Green Bay, Wisconsin jail staff responded to prisoner Antonio Smith’s refusal to submit to a wellness check (on day 46 of his hunger strike) by pepper spraying him in the face, ordering him to strip naked, and taking him to the health unit. When Smith refused the wellness check, he was dumped clothed in nothing but a small towel into an unheated, unfurnished “control cell” for the next 23 hours. The temperature in the cell ranged from “25 to 57 degrees Farenheit,” according to uncontested testimony.
When Smith was first placed in the cell around noon, Van Lanen told Smith that Smith could request a shower any time and that he would come back to discuss “‘clothing and stuff,’” but he never returned. Ibid. Three and a half hours later, Smith requested clothing, bedding, and a mattress from Lieutenant Timothy Retzlaff and asked to be moved to a warmer cell given the cold. Retzlaff said he would check with Van Lanen. Twelve additional hours went by with no word from Van Lanen or Retzlaff. Then, around 3 o’clock in the morning, a different officer told Smith that if he submitted to future wellness checks, he could have a smock, but that otherwise, “he would remain naked and cold.” Ibid. Smith declined. Another eight hours came and went without any word from Van Lanen or Retzlaff. Smith remained naked and frigid overnight as the temperature dropped below freezing to 25 degrees. After 23 hours, prison staff removed Smith from the cell. Smith later stated that he stayed on his feet for most of those 23 hours because it was too painful to sit, lie down, or sleep.
The Seventh Circuit Appeals Court actually said exactly this in its ruling granting qualified immunity to the defendants.
The Seventh Circuit held that the officers violated Smith’s Eighth Amendment right to be free from cruel and unusual punishment but nevertheless granted them qualified immunity, reasoning that the Circuit “had never held it unconstitutional on closely analogous facts to house an inmate in a cell that ranged in temperature from 25 to 57 degrees over a 23-hour period without clothes or a way to keep warm.”
Yep, that’s how fucking insane this doctrine is. The court even said this was a rights violation, but since it hadn’t said the same thing earlier about a nearly exactly matching set of circumstances, the defendants apparently had no way of knowing tossing someone naked in a freezing cell for nearly 24 hours would violate the prisoner’s rights.
Advertisement
As Sotomayor points out, the Seventh Circuit appeared to willfully disregard its own precedent when handing down this ruling.
As Judge Hamilton explained in dissent, the Seventh Circuit has itself held that intentionally subjecting prisoners to extreme cold conditions without any way to stay warm violates the Eighth Amendment. In Gillis v. Litscher(2006), for example, the Circuit held that a reasonable jury could find that prison officials violated a prisoner’s Eighth Amendment right when they deliberately left him naked in a cell blowing cool air for five days as part of an effort to “conform [his conduct] to the rules.” [S]ee Del Raine v. Williford,(1994) (officers deliberately strip-searched prisoner in cell for 15 to 30 minutes when windchill was 40 to 50 degrees below zero). The Seventh Circuit has also held that, when cold conditions are the product of heating-system failures, officers violate the Eighth Amendment if they are aware of such conditions and fail to take corrective measures such as providing an alternative way to keep warm.
That should have been enough for SCOTUS to review this one and, hopefully, send it back with a reminder that QI readings need to be narrow, but perhaps not so narrow they provoke gasps of disbelief.
But that’s not how this Supreme Court majority operates. Sotomayor calls them out for only reviewing certain QI cases. You know the ones.
This Term… the Court has exercised its discretion to summarily reverse supposed errors that were far less clear than the one here. See, e.g., McCarthy v. Hernandez, 607 U. S. _ (2026) (per curiam); Zorn v. Linton, 607 U. S.(2026) (per curiam); see also Smith v. Scott, 608 U. S. __ (2026) (summarily vacating and remanding denial of qualified-immunity in light of Zorn). If those cases were clear enough for summary action, the Court here should have readily concluded, based on precedent and basic human decency, that it is beyond debate that it is cruel and unusual to lock someone intentionally in a freezing prison cell completely naked for 23 hours.
The Court’s decision not to do so today exacerbates its asymmetrical trend of declining to intervene when courts wrongly afford officers the benefit of qualified immunity, but unflinchingly summarily reversing when it believes courts have wrongly denied officers the protection of qualified immunity.
Advertisement
This would be hypocrisy if it were being carried out by people who actually maintained a pretense of judicial fairness. But it’s being carried out by people who actively believe in the message they’re sending to the public, as well as to the administration they are so clearly devoted to pleasing.
Reversing only denials of qualified immunity sends the regrettable message that, when choosing between shielding government officials from liability and vindicating individuals’ constitutional rights, this Court will almost always choose the former.
Sotomayor is right. The message being sent is “regrettable.” Unfortunately for America, the people sending it have no regrets at all.
As enterprise AI systems scale to handle complex workflows, practitioners face the challenge of routing subtasks to the right tools and skills. Agents can have hundreds of tools and skills and get confused on which one to use for each step of a workflow.
To address this challenge, researchers at Alibaba developed SkillWeaver, a framework that creates an execution graph for a given task and chooses the right skills for each of the nodes. They also introduce Skill-Aware Decomposition (SAD), a novel technique that uses a feedback loop to enable the agent to fetch and vet relevant tool candidates iteratively. This compositional approach and feedback loop mechanism distinguishes SkillWeaver from other tool-routing frameworks that choose tools in a one-shot fashion.
SkillWeaver relates to real-world AI applications where agents autonomously orchestrate multi-tool ecosystems, such as the Model Context Protocol (MCP), to execute multi-step business operations like downloading datasets, transforming information, and creating visual reports.
In practice, the researchers’ experiments with SkillWeaver show that implementing this retrieve-and-route approach significantly increases accuracy while reducing token consumption by over 99% compared to naively exposing agents to an entire tool library.
Advertisement
For practitioners building AI agents, the main takeaway is that the granularity of task decomposition is the biggest bottleneck to accurate tool retrieval.
The challenge of skill routing
Skills are a key pattern in modern LLM agent architectures. A skill is a modular, reusable tool specification that uses structured natural language documentation.
As enterprise agents integrate with massive tool ecosystems, accurately routing user queries to the right skills becomes a difficult task. Exposing an entire library to an LLM to find the right tool is highly inefficient, quickly overwhelms context limits, and consumes hundreds of thousands of tokens.
Most current tool-use frameworks attempt to solve this through API retrieval, documentation matching, or hierarchical structures that treat routing strictly as a single-skill selection or per-step problem.
Advertisement
However, this single-skill paradigm is insufficient for enterprise environments because real-world queries are inherently compositional. A standard business request such as “Download the dataset, transform it, and create visual reports” cannot be fulfilled by one tool. It requires breaking the prompt down and sequencing an API client, a data processor, and a visualization tool into a cohesive, multi-step execution plan.
How SkillWeaver and SAD work
To tackle this, the researchers frame the problem of handling complex tasks that require multiple skills as “compositional skill routing.” Given a complex user prompt and a vast library of tools, an agent must simultaneously figure out how to break the request into a sequence of atomic sub-tasks, how to map each sub-task to the single best available skill, and how to compose those skills into an executable plan.
SkillWeaver orchestrates this process through three distinct stages: Decompose, Retrieve, and Compose. In the first stage, an LLM acts as a task decomposer, breaking the user’s complex query down into a sequence of sub-tasks that each require one skill. Once the sub-tasks are clearly defined, the system uses an embedding model to compare each subtask against the skill library to pull a shortlist of the top candidate tools for each step.
In the final stage, a planner evaluates the retrieved candidates based on how well they work together. It checks for inter-skill compatibility to ensure the outputs of one tool naturally flow into the inputs of the next. It then creates a final execution plan as a Directed Acyclic Graph (DAG) that maps out dependencies so independent tasks can potentially execute in parallel.
Advertisement
For example, consider a user asking an AI agent to “Download the dataset, transform it, and create visual reports.” In the decompose stage, the decomposer LLM breaks this into three distinct sub-tasks: downloading the dataset, transforming the data, and creating the reports.
In the retrieve stage, the system searches the library and finds candidates like “api-client” or “http-fetch” for task one, “csv-parser” or “etl-pipeline” for task two, and so on. Finally, the compose stage evaluates these options, selects the specific combination of “api-client,” “csv-parser,” and “chart-gen” that are most compatible, and wires them together into a final, ready-to-execute workflow.
A key challenge of this pipeline is that LLMs often produce generic step descriptions that fail to match the specific, technical vocabulary of the actual skills available in the library. To fix this, SkillWeaver introduces Iterative Skill-Aware Decomposition (SAD), a novel feedback loop. SAD works by having the LLM draft an initial plan, conducting a preliminary search to find loosely matching skills, and then feeding those retrieved skills back into the LLM as hints. This allows the LLM to rewrite its decomposition so the granularity and vocabulary perfectly align with the actual tools that exist.
SkillWeaver in action
To evaluate how SkillWeaver performs in realistic enterprise scenarios, the researchers created a custom benchmark called CompSkillBench. It consists of 300 multi-step queries of different difficulty levels. To mirror real-world environments, they used a library of 2,209 real-world skills sourced from the public MCP ecosystem, covering 24 functional categories like cloud infrastructure, finance, and databases.
For the core engine, the researchers primarily used a lightweight 7-billion parameter model (Qwen2.5-7B-Instruct) for task decomposition, paired with a standard semantic search retriever (MiniLM with a FAISS index) to find the tools. SkillWeaver was evaluated against three main setups: a brute-force “LLM-Direct” method where they stuffed all the tool names into the prompt of a large model, a vanilla LLM-based decomposition without SAD, and a ReAct-style agent loop.
Advertisement
The experiments indicate that task decomposition is the main bottleneck. Standard LLM behavior falls short when dealing with large tool libraries, but the SAD feedback loop dramatically moves the needle. In the vanilla setup, the 7B model achieved a decomposition accuracy (i.e., predicting the correct number of steps) only 51.0% of the time. By activating the SAD feedback loop, accuracy jumped to 67.7% (with the larger Qwen-Max model, the accuracy reached 92%). On “hard” tasks requiring four to five distinct skills, SAD improved accuracy by 50%.
In comparison to the naive approach, SkillWeaver reduces token consumption by more than 99% (source: arXiv)
One fascinating finding was that larger models can actually perform worse when unguided. When tested in the vanilla setup, a larger 14-billion parameter model saw its accuracy plummet below the 7B model’s accuracy because it tended to over-decompose tasks into microscopic, unnecessary steps. Once SAD was introduced, the retrieved tool hints anchored the model back to reality and increased its accuracy. This suggests that aligning an agent with the vocabulary of specific tools is often more impactful than paying for a larger, more expensive LLM.
Another important takeaway is token savings. The LLM-Direct baseline, which used the very large Qwen-Max model, showed that feeding all tools into the prompt of a large model fails. Despite near-perfect task breakdown capabilities, the massive model only retrieved the right tool category 21.1% of the time when flooded with tool options. SkillWeaver’s targeted retrieve-and-route approach vastly outperformed this in accuracy while slashing context window consumption from an estimated 884,000 tokens down to roughly 1,160 tokens per query, a 99.9% reduction. For practitioners, this translates directly to drastically lower API costs and faster response times.
Advertisement
Finally, the traditional ReAct baseline completely failed, achieving 0% decomposition accuracy. Its loop naturally collapses multi-step plans into isolated actions rather than explicitly mapping out a cohesive, multi-tool sequence.
Considerations for developers
While the researchers have not yet released the source code for SkillWeaver, their work was built on off-the-shelf tools that can easily be reproduced.
Skill-Aware Decomposition (SAD), which is the key innovation at the heart of the framework, is a clever prompt-engineering and retrieval loop. The authors have shared the prompt templates in their paper, and developers can implement it themselves quite easily using standard orchestration libraries like LangChain, LlamaIndex, or even raw Python scripts.
As for the retrieval component, the authors built the core framework using all-MiniLM-L6-v2, an open-source embedding model. They found that swapping in a slightly stronger off-the-shelf encoder (BGE-base-en-v1.5) immediately boosted accuracy without any fine-tuning. While an off-the-shelf bi-encoder is great at getting a relevant tool into the top 10 candidates nearly 70% of the time, it struggles to consistently rank the perfect tool at exactly number one, achieving that only about 37% of the time. To bridge this gap, teams will likely need to implement a secondary cross-encoder or LLM-based reranker to re-order those top 10 candidates.
Advertisement
One upfront preparation requirement is vectorizing the tool library and building a FAISS index in advance. In practice, this is a negligible hurdle. Embedding and indexing all 2,209 skills in the benchmark took a mere 15 seconds. Once built, retrieving tools from the index adds less than 15 milliseconds of latency per query. For enterprise environments, syncing the tool index is a trivial background job.
A current limitation in SkillWeaver is the lack of error recovery. While SkillWeaver successfully maps out a compatible DAG for execution, the authors’ pilot study revealed the challenges of multi-step tool chains. For example, if an API call fails in step two, the entire chain breaks. The paper’s core contribution is limited to the routing and planning phase. For a true production deployment, practitioners must build their own error recovery, fallback, and retry mechanisms on top of the compose stage to handle real-world API timeouts or malformed outputs.
You must be logged in to post a comment Login