Connect with us
DAPA Banner

Tech

Threat actor uses Microsoft Teams to deploy new “Snow” malware

Published

on

Hackers deploy new Snow custom malware suite via Microsoft Teams

A threat group tracked as UNC6692 uses social engineering to deploy a new, custom malware suite named “Snow,” which includes a browser extension, a tunneler, and a backdoor.

Their goal is to steal sensitive data after deep network compromise through credential theft and domain takeover.

According to Google’s Mandiant researchers, the attacker uses “email bombing” tactics to create urgency, then contact targets via Microsoft Teams, posing as IT helpdesk agents.

image

A recent Microsoft report highlighted the growing popularity of this tactic in the cybercrime space, tricking users into granting attackers remote access via Quick Assist or other remote access tools.

In the case of UNC6692, the victim is prompted to click a link to install a patch that would block email spam. In reality, the victims get a dropper that executes AutoHotkey scripts loading “SnowBelt,” a malicious Chrome extension.

Advertisement
Malicious page used in the attacks
Malicious page used in the attacks
Source: Google

The extension executes on a headless Microsoft Edge instance, so the victim doesn’t notice anything, while scheduled tasks and a startup folder shortcut are also created for persistence.

SnowBelt serves as a persistence mechanism and a relay mechanism for commands the operator sends to a Python-based backdoor named SnowBasin.

Commands are delivered through a WebSocket tunnel established by a tunneler tool called SnowGlaze, to mask communications between the host and the command-and-control (C2) infrastructure.

SnowGlaze also facilitates SOCKS proxy operations, allowing arbitrary TCP traffic to be routed through the infected host.

SnowBasin runs a local HTTP server and executes attacker-supplied CMD or PowerShell commands on the infected system, relaying the results back to the operator through the same pipeline.

Advertisement

The malware supports remote shell access, data exfiltration, file download, screenshot capturing, and basic file management operations.

The operator can also issue a self-termination command to shut down the backdoor at the host.

SnowBasin capabilities
SnowBasin capabilities
Source: Google

Mandiant has found that, post-compromise, the attackers performed internal reconnaissance, scanning for services such as SMB and RDP to identify additional targets, and then moved laterally on the network.

The attackers dumped LSASS memory to extract credential material and used pass-the-hash techniques to authenticate to additional hosts, eventually reaching domain controllers.

At the final stage of the attack, the threat actor deployed FTK Imager to extract the Active Directory database, along with SYSTEM, SAM, and SECURITY registry hives.

Advertisement

These files were exfiltrated from the network using LimeWire, giving the attackers access to sensitive credential data across the domain.

Attack lifecycle
Attack lifecycle
Source: Google

The report provides extensive indicators of compromise (IoCs) and also YARA rules to help detect the “Snow” toolset.


article image

AI chained four zero-days into one exploit that bypassed both renderer and OS sandboxes. A wave of new exploits is coming.

At the Autonomous Validation Summit (May 12 & 14), see how autonomous, context-rich validation finds what’s exploitable, proves controls hold, and closes the remediation loop.

Claim Your Spot

Source link

Advertisement
Continue Reading
Click to comment

You must be logged in to post a comment Login

Leave a Reply

Tech

Speed vs. Depth: How Does Using AI for Work Affect Our Confidence?

Published

on

Be careful delegating your work to that chatbot: A new peer-reviewed study published this month by the American Psychological Association found that people who heavily rely on AI tools for work tasks reported feeling less confident in their abilities and had less ownership over their work.

There has been growing research on how our brains function when we use AI tools. A landmark study from MIT in 2025 found that our brains don’t retain as much information or employ necessary critical thinking skills when writing tasks are outsourced to AI chatbots. 

This new study aimed to understand how our human behavior, specifically executive functions — like strategic planning and decision making — can change when AI is part of the process. 

Advertisement

Sarah Baldeo, the study’s author and a Ph.D. candidate in AI and neuroscience at Middlesex University in England, noted in the paper that these findings do not show that AI is harming or causing cognitive decline. Rather, they “highlight variability in how users distribute effort between themselves and AI systems under conditions of convenience and competence.” Meaning, people who use AI are making conscious trade-offs, and their confidence fluctuates as a result.

The study encouraged nearly 2,000 adults to use AI for a variety of workplace tasks, like prioritizing projects based on deadlines, explaining a strategy and developing plans with incomplete information. It then asked them to self-report their levels of confidence, ownership and AI reliance, including whether they significantly altered the AI-generated outputs. 

Overall, confidence varied with AI use. A greater reliance on AI was associated with lower confidence in their ability to reason independently. Participants also reported relatively few modifications, meaning they often did not tweak or put their own stamp on what the AI spit out. But those who modified the AI’s work reported feeling more confident and more like the author. Men reported higher reliance on AI than women.

The trade-off between speed and depth was one of the main themes participants reported.

Advertisement

“I got an answer faster, but I don’t think I thought as deeply as I normally would,” one of the participants said.

AI Atlas

This reflects one of the biggest caveats of using AI tools. Chatbots, for example, can produce text quickly, but it doesn’t always have the same level of subject matter expertise you need. AI tools can also hallucinate, or make up facts, so AI-generated output needs to be verified before it’s used. 

The office is one of the main places where people use AI tools. We’re moving beyond just chatbots, with agents that can autonomously handle tasks that would’ve otherwise required a human. 

But these tools aren’t necessarily making our work lives better; one study found they made workdays longer and more unpleasant. As AI becomes increasingly embedded in our work lives, it’s important to understand how it’s shaping our mental attitudes. Qualities like confidence and ownership of our work are important factors in determining the quality of our work life. 

Advertisement

Source link

Continue Reading

Tech

Samsung Messages is Shutting Down: Here’s How to Rescue All Your Messages Before It’s Gone

Published

on

The era of the Samsung Messages app is officially coming to an end. After years of preinstalling Google’s alternative on its newest Galaxy devices, Samsung is finally moving to deactivate its legacy texting platform for good this July. For those who have avoided the switch until now, the transition is no longer optional-failing to migrate means risking a major disruption in how you send and receive daily chats.

On a page with information about the switch, Samsung points to instructions on how to swap over to Google’s Messages app, including for phones that are still on Android 12 and Android 13. Samsung has historically preinstalled its own Messages app on Galaxy phones, but began transitioning toward Google Messages as early as 2021.

To encourage people to switch to Google Messages, Samsung’s instructions list new features offered by Google Messages, like RCS-enabled texting for features like typing indicators, easier group chats and sending higher-quality images. Google’s Messages app also has AI-powered spam detection and spam filters, multi-device access to messages and some built-in Gemini AI features. It’s also the app that most Android phones use as their default texting app, including Samsung’s more recent Galaxy S26. There are other SMS texting app alternatives in the Google Play Store if you don’t want to use the one made by Google.

Advertisement

Samsung has not said when exactly in July messaging will no longer work in the app. A Samsung representative didn’t immediately respond to a request for comment. Once the app is deactivated, only messaging to emergency services will work on Samsung Messages. 

While Samsung did stop including it as the default texting app in 2021, it wasn’t until 2024 that Samsung stopped preinstalling the texting app alongside Google Messages. The Galaxy S26 can’t download the Samsung Messages app, and other phones won’t be able to download it after the app’s July sunset.

Samsung said users of Android 11 or lower aren’t affected by the end of service, but would also likely benefit from switching to a supported texting app like Google Messages. To switch to Google Messages, the company asks users to download the app if it’s not already installed and to set it as the default SMS app when prompted after launching it. 

The post also notes that anyone using an older Galaxy Watch that runs on Samsung’s Tizen operating system will no longer have access to their full conversation history since these watches cannot use Google Messages. Samsung said that they will still be able to read and send text messages, but the company’s newer watches (Galaxy Watch 4 and later) that run WearOS will still have access to full conversations.

Advertisement

Source link

Continue Reading

Tech

Discord Sleuths Gained Unauthorized Access to Anthropic’s Mythos

Published

on

As researchers and practitioners debate the impact that new AI models will have on cybersecurity, Mozilla said on Tuesday it used early access to Anthropic’s Mythos Preview to find and fix 271 vulnerabilities in its new Firefox 150 browser release. Meanwhile, researchers identified a group of moderately successful North Korean hackers using AI for everything from vibe coding malware to creating fake company websites—stealing up to $12 million in three months.

Researchers have finally cracked disruptive malware known as Fast16 that predates Stuxnet and may have been used to target Iran’s nuclear program. It was created in 2005 and was likely deployed by the US or an ally.

Meta is being sued by the Consumer Federation of America, a nonprofit, over scam ads on Facebook and Instagram and allegedly misleading consumers about the company’s efforts to combat them. A United States surveillance program that lets the FBI view Americans’ communications without a warrant is up for renewal, but lawmakers are deadlocked on next steps. A new bill aims to address mounting lawmaker concerns, but lacks substance.

And if you’re looking for a deep dive, WIRED investigated the yearslong feud behind the prominent privacy and security conscious mobile operating system GrapheneOS. Plus we looked at the strange tale of how China spied on US figure skater Alysa Liu and her dad.

Advertisement

And there’s more. Each week, we round up the security and privacy news we didn’t cover in depth ourselves. Click the headlines to read the full stories. And stay safe out there.

Anthropic’s Mythos Preview AI model has been touted as a dangerously capable tool for finding security vulnerabilities in software and networks, so powerful that its creator has carefully restricted its release. But one group of amateur sleuths on Discord found their own, relatively simple ways—no AI hacking required—to gain unauthorized access to a coveted digital prize: Mythos itself.

Despite Anthropic’s efforts to control who can use Mythos Preview, a group of Discord users gained access to the tool through some straightforward relatively detective work: They examined data from a recent breach of Mercor, an AI training startup that works with developers, and “made an educated guess about the model’s online location based on knowledge about the format Anthropic has used for other models”—a phrase that many observers have speculated refers to a web URL—according to Bloomberg, which broke the story.

The person also reportedly took advantage of permissions they already possessed to access other Anthropic models, thanks to their work for an Anthropic contracting firm. As a result of their probing, however, they allegedly gained access to not only Mythos but other unreleased Anthropic AI models, too. Thankfully, according to Bloomberg, the group that accessed Mythos has only used it so far to build simple websites—a decision designed to prevent its detection by Anthropic—rather than hack the planet.

Advertisement

Security researchers have long warned that the telecom protocols known as Signaling System 7, or SS7, which govern how phone networks connect to one another and route calls and texts, are vulnerable to abuse that would allow surreptitious surveillance. This week researchers at the digital rights organization Citizen Lab revealed that at least two for-profit surveillance vendors have actually used those vulnerabilities—or similar ones in the next generation of telecom protocols—to spy on real victims. Citizen Lab found that two surveillance firms had essentially acted as rogue phone carriers, exploiting access to three small telecom firms—Israeli carrier 019Mobile, British cell provider Tango Mobile, and Airtel Jersey, based on the island of Jersey in the English Channel—to track the location of targets’ phones. Citizen Lab’s researchers say that “high-profile” people were tracked by the two surveillance firms, though it declined to name either the firms or their targets. Researchers warn, too, that the two companies they discovered abusing the protocols are likely not alone, and that the vulnerability of global telecom protocols remains a very real vector for phone spying worldwide.

In a sign of a growing—if belated—crackdown by US law enforcement on the sprawling criminal industry of human-trafficking-fueled scam compounds across Southeast Asia, the Department of Justice this week announced charges against two Chinese men for allegedly helping to manage a scam compound in Myanmar and seeking to open a second compound in Cambodia. Jiang Wen Jie and Huang Xingshan were both arrested in Thailand earlier this year on immigration charges, according to prosecutors, and now face charges for allegedly running a vast scamming operation that lured human trafficking victims to their compound with fake job offers and then forced them to scam victims, including Americans, for millions of dollars with cryptocurrency fraudulent investments. The DOJ says it also “restrained” $700 million in funds belonging to the operation—essentially freezing the funds in preparation for seizure—and also seized a channel on the messaging app Telegram prosecutors say was used to bait and enslave trafficking victims. The Justice Department’s statement claims that Huang personally took part in the physical punishment of workers in one compound, and that Jiang at one point oversaw the theft of $3 million from a single US scam victim.

Three scientific research institutions have been found selling British citizens’ health information on Alibaba, the British government and the nonprofit UK Biobank revealed this week. Over the last two decades, more than 500,000 people have shared their health data—including medical images, genetic information, and health care records—with UK Biobank, which allows scientists around the world to access the information to conduct medical research. However, the charity said the data leak involved a “breach of the contract” signed by three organizations, with one of the datasets for sale believed to have included data on all half-million research subjects. It did not detail the full types of data that were listed for sale but said it has suspended the Biobank accounts of those allegedly selling the information. The ads for the data have also been removed.

Earlier this month, 404 Media reported that the FBI was able to get copies of Signal messages from a defendant’s iPhone as the content of the messages, which are encrypted within Signal, were saved in an iOS push notification database. In this instance, the copies of the messages were still accessible even though Signal had been removed from the phone—though the issue affected all apps that send push notifications.

Advertisement

This week, in response to the issue, Apple released an iOS and iPadOS security update to fix the flaw. “Notifications marked for deletion could be unexpectedly retained on the device,” Apple’s security update for iOS 26.4.2 says. “A logging issue was addressed with improved data redaction.”

While the issue has been fixed, it is still worth changing what appears in notifications on your device. For Signal you can open the app, go to Settings, Notifications, and toggle notifications to show Name Only or No Name or Content. It is another reminder that while apps such as Signal are end-to-end encrypted, this applies to the content as it moves between devices: If someone can physically access and unlock your phone, there is the potential they can access everything on your device.

Source link

Advertisement
Continue Reading

Tech

2026 Green Powered Challenge: Ventilate Your Way To Power!

Published

on

Have you ever looked out across the rooftops of a city and idly gazed at the infrastructure that remains unseen from the street? It seems [varunsontakke80] has, because here’s their project, harvesting energy from the rotation of a rooftop ventilator.

The build is a relatively straightforward one, with a pair of disks with magnets attached being mounted on the ventilator shaft inside its dome. A third disk sits between them and is stationary, with a set of coils in which the magnets induce current as they move. A rectifier and charge circuit completes the picture.

This appears to be part of a college project, but despite searching, we can’t find any measure of how much power this thing generates. We’d be concerned that it might reduce the efficiency of the ventilator somewhat. There will be an inevitable tradeoff as power is harvested. Still, it’s a neat use of a ubiquitous piece of hardware, and we like it for that.

Advertisement

This hack is part of our 2026 Green Powered Challenge. You’ve got time to get your own entry in, so get a move on!

Source link

Advertisement
Continue Reading

Tech

From the ‘scurfy’ mouse to the Nobel Prize: How a Seattle biotech pioneer’s long game paid off

Published

on

The biotech industry is increasingly shaped by computer-designed drugs and investor pressure to move fast and show commercial traction. Nobel laureate Fred Ramsdell took a different path — one built on cell-based therapies, philanthropic funding and patient investing.

That path began at Darwin Molecular, a biotech startup in Bothell, Wash., that launched in 1992 with backing from Bill Gates and Paul Allen. The Microsoft co-founders weren’t chasing quick returns, Ramsdell said, and that freedom attracted dedicated researchers.

“People bought into that because you’re trying to do something that would make a difference,” he said. “It wasn’t a one-drug company. It wasn’t hyper-focused on something very specific. It was trying to figure out how we can affect change in patients.”

That mission-drive culture proved fertile ground. Ramsdell’s work at Darwin ultimately led to a Nobel Prize in Physiology or Medicine, awarded in October and shared with former Darwin colleague Mary Brunkow and Shimon Sakaguchi of Osaka University in Japan. The trio was recognized for foundational work in regulatory T cells, or Tregs — dubbed the “immune system’s security guards.”

Advertisement

The discovery of Tregs changed therapeutics by showing that the immune system has a built-in braking mechanism that can be enhanced to treat autoimmune disease, transplant rejection and graft-versus-host disease, or blocked to improve cancer immunotherapy.

Ramsdell recounted his journey at Life Science Washington’s annual conference in Seattle on Tuesday, tracing the unlikely origins of the discovery back to the Cold War.

The Darwin team studied a line of mice descended from post-Manhattan Project research into the effects of radiation on living organisms. In 1949, the program produced a mouse from a naturally occurring, non-radiation-induced mutation, later named “scurfy.”

A fraction of the male mice were riddled with illness and lived for only a few weeks. “They had every autoimmune disease in one animal,” Ramsdell said — diabetes, Crohn’s disease, psoriasis, myocarditis and more.

Advertisement

That suffering pointed to something important. The scurfy mice carried a mutation the Darwin scientists identified and named Foxp3 — a gene essential to keeping the immune system from attacking the body’s own healthy cells. The mouse gene has a human counterpart, FOXP3.

“We recognized the potential of these cells,” Ramsdell said. Introducing healthy Tregs into people with autoimmune disease could treat the condition — but the scientific tools to make that a reality didn’t yet exist.

Darwin was acquired in 1996 by London-based Chiroscience Group, which merged with the British company Celltech. When the company shut down its Washington R&D operations in 2004, Ramsdell and Brunkow moved on.

Ramsdell eventually landed at the Parker Institute for Cancer Immunotherapy, which he helped launch in 2016. The nonprofit research institute presented another unique opportunity. Founded with a $250 million grant from tech entrepreneur Sean Parker, it operates as a collaborative network across seven major U.S. cancer centers, applying immunotherapy to cancer in ways that siloed institutions couldn’t.

Advertisement

The secret ingredient, Ramsdell said, was trust — built deliberately through Parker Institute retreats that included scientists and their families.

“The ability to build trust and collaboration, true collaboration, and combine [research] that wouldn’t otherwise be combined, was incredibly appealing to me,” he said.

Today, Ramsdell serves as a scientific advisor for the Parker Institute and for Sonoma Biotherapeutics, a Seattle- and South San Francisco-based startup he co-founded that is focused on Treg cells. The company has a partnership with Regeneron to co-develop cell therapies for Crohn’s disease, ulcerative colitis and other conditions — a direct line from the scurfy mice of the 1940s to the clinic.

Even in advisory roles, Ramsdell keeps returning to big-picture biological questions. He’s currently intrigued by people who carry genetic predispositions for diseases that never materialize — and what that might reveal about the hidden coding in their DNA that hold illness at bay.

Advertisement

Looking at this phenomenon across populations, scientists can explore these genetic factors, he said, “and that will open up a lot of your doors.”

Source link

Continue Reading

Tech

Monitoring LLM behavior: Drift, retries, and refusal patterns

Published

on

The stochastic challenge

Traditional software is predictable: Input A plus function B always equals output C. This determinism allows engineers to develop robust tests. On the other hand, generative AI is stochastic and unpredictable. The exact same prompt often yields different results on Monday versus Tuesday, breaking the traditional unit testing that engineers know and love.

To ship enterprise-ready AI, engineers cannot rely on mere “vibe checks” that pass today but fail when customers use the product. Product builders need to adopt a new infrastructure layer: The AI Evaluation Stack.

This framework is informed by my extensive experience shipping AI products for Fortune 500 enterprise customers in high-stakes industries, where “hallucination” is not funny — it’s a huge compliance risk.

Defining the AI evaluation paradigm

Traditional software tests are binary assertions (pass/fail). While some AI evals use binary asserts, many evaluate on a gradient. An eval is not a single script; it is a structured pipeline of assertions — ranging from strict code syntax to nuanced semantic checks — that verify the AI system’s intended function.

Advertisement

The taxonomy of evaluation checks

To build a robust, cost-effective pipeline, asserts must be separated into two distinct architectural layers:

Layer 1: Deterministic assertions

A surprisingly large share of production AI failures aren’t semantic “hallucinations” — they are basic syntax and routing failures. Deterministic assertions serve as the pipeline’s first gate, using traditional code and regex to validate structural integrity.

Instead of asking if a response is “helpful,” these assertions ask strict, binary questions:

  • Did the model generate the correct JSON key/value schema?

  • Did it invoke the correct tool call with the required arguments?

  • Did it successfully slot-fill a valid GUID or email address?

// Example: Layer 1 Deterministic Tool Call Assertion

Advertisement

{

  “test_scenario”: “User asks to look up an account”,

  “assertion_type”: “schema_validation”,

  “expected_action”: “Call API: get_customer_record”,

Advertisement

  “actual_ai_output”: “I found the customer.”,

  “eval_result”: “FAIL – AI hallucinated conversational text instead of generating the required API payload.”

}

In the example above, the test failed instantly because the model generated conversational text instead of the required tool call payload.

Advertisement

Architecturally, deterministic assertions must be the first layer of the stack, operating on a computationally inexpensive “fail-fast” principle. If a downstream API requires a specific schema, a malformed JSON string is a fatal error. By failing the evaluation immediately at this layer, engineering teams prevent the pipeline from triggering expensive semantic checks (Layer 2) or wasting valuable human review time (Layer 3).

Layer 2: Model-based assertions

When deterministic assertions pass, the pipeline must evaluate semantic quality. Because natural language is fluid, traditional code cannot easily assert if a response is “helpful” or “empathetic.” This introduces model-based evaluation, commonly referred to as “LLM-as-a-Judge” or “LLM-Judge.”

While using one non-deterministic system to evaluate another seems counterintuitive, it is an exceptionally powerful architectural pattern for use cases requiring nuance. It is virtually impossible to write a reliable regex to verify if a response is “actionable” or “polite.” While human reviewers excel at this nuance, they cannot scale to evaluate tens of thousands of CI/CD test cases. Thus, the LLM-as-a-Judge becomes the scalable proxy for human discernment.

3 critical inputs for model-based assertions

However, model-based assertions only yield reliable data when the LLM-as-a-Judge is provisioned with three critical inputs:

Advertisement
  1. A state-of-the-art reasoning model: The Judge must possess superior reasoning capabilities compared to the production model. If your app runs on a smaller, faster model for latency, the judge must be a frontier reasoning model to approximate human-level discernment.

  2. A strict assessment rubric: Vague evaluation prompts (“Rate how good this answer is”) yield noisy, stochastic evaluations. A robust rubric explicitly defines the gradients of failure and success. (For example, a “Helpfulness” rubric should define Score 1 as an irrelevant refusal, Score 2 as addressing the prompt but lacking actionable steps, and Score 3 as providing actionable next steps strictly within context.)

  3. Ground truth (golden outputs): While the rubric provides the rules, a human-vetted “expected answer” acts as the answer key. When the LLM-Judge can compare the production model’s output against a verified Golden Output, its scoring reliability increases dramatically.

Architecture: The offline vs online pipeline

A robust evaluation architecture requires two complementary pipelines. The online pipeline monitors post-deployment telemetry, while the offline pipeline provides the foundational baseline and deterministic constraints required to evaluate stochastic models safely.

The offline evaluation pipeline

The offline pipeline’s primary objective is regression testing — identifying failures, drift, and latency before production. Deploying an enterprise LLM feature without a gating offline evaluation suite is an architectural anti-pattern; it is the equivalent of merging uncompiled code into a main branch.

Process

1. Curating the golden dataset

The offline lifecycle begins by curating a “golden dataset” — a static, version-controlled repository of 200 to 500 test cases representing the AI’s full operational envelope. Each case pairs an exact input payload with an expected “golden output” (ground truth).

Crucially, this dataset must reflect expected real-world traffic distributions. While most cases cover standard “happy-path” interactions, engineers must systematically incorporate edge cases, jailbreaks, and adversarial inputs. Evaluating “refusal capabilities” under stress remains a strict compliance requirement.

Advertisement

Example test case payload (standard tool use):

  • Input: “Schedule a 30-minute follow-up meeting with the client for next Tuesday at 10 a.m.”

  • Expected output (golden): The system successfully invokes the schedule_meeting tool with the correct JSON payload: {“duration_minutes”: 30, “day”: “Tuesday”, “time”: “10 AM”, “attendee”: “client_email”}.

While manually curating hundreds of edge cases is tedious, the process can be accelerated with synthetic data generation pipelines that use a specialized LLM to produce diverse TSV/CSV test payloads. However, relying entirely on AI-generated test cases introduces the risk of data contamination and bias. A human-in-the-loop (HITL) architecture is mandatory at this stage; domain experts must manually review, edit, and validate the synthetic dataset to ensure it accurately reflects real-world user intent and enterprise policy before it is committed to the repository.

2. Defining the evaluation criteria

Once the dataset is curated, engineers must design the evaluation criteria to compute a composite score for each model output. A robust architecture achieves this by assigning weighted points across a hybrid of Layer 1 (deterministic) and Layer 2 (model-based) asserts.

Consider an AI agent executing a “send email” tool. An evaluation framework might utilize a 10-point scoring system:

Advertisement
  • Layer 1: Deterministic asserts (6 points): Did the agent invoke the correct tool? (2 pts). Did it produce a valid JSON object? (2 pts). Does the JSON strictly adhere to the expected schema? (2 pts).

  • Layer 2: Model-based asserts (4 points): (Note: Semantic rubrics must be highly use-case specific). Does the subject line reflect user intent? (1 pt). Does the email body match expected outputs without hallucination? (1 pt). Were CC/BCC fields leveraged accurately? (1 pt). Was the appropriate priority flag inferred? (1 pt).

To understand why the LLM-Judge awarded these points, the engineer must prompt the judge to supply its reasoning for each score. This is crucial for debugging failures.

The passing threshold and short-circuit logic 

In this example, an 8/10 passing threshold requires 8 points for success. Crucially, the evaluation pipeline must enforce strict short-circuit evaluation (fail-fast logic). If the model fails any deterministic assertion — such as generating a malformed JSON schema — the system must instantly fail the entire test case (0/10). There is zero architectural value in invoking an expensive LLM-Judge to assess the semantic “politeness” of an email if the underlying API call is structurally broken.

3. Executing the pipeline and aggregating signals

Using an evaluation infrastructure of choice, the system executes the offline pipeline — typically integrated as a blocking CI/CD step during a pull request. The infrastructure iterates through the golden dataset, injecting each test payload into the production model, capturing the output, and executing defined assertions against it.

Advertisement

Each output is scored against the passing threshold. Once batch execution is complete, results are aggregated into an overall pass rate. For enterprise-grade applications, the baseline pass rate must typically exceed 95%, scaling to 99%-plus for strict compliance or high-risk domains.

4. Assessment, iteration, and alignment

Based on aggregated failure data, engineering teams conduct a root-cause analysis of failing test cases. This assessment drives iterative updates to core components: refining system prompts, modifying tool descriptions, augmenting knowledge sources, or adjusting hyperparameters (like temperature or top-p). Continuous optimization remains best practice even after achieving a 95% pass rate.

Crucially, any system modification necessitates a full regression test. Because LLMs are inherently non-deterministic, an update intended to fix one specific edge case can easily cause unforeseen degradations in other areas. The entire offline pipeline must be rerun to validate that the update improved quality without introducing regressions.

The online evaluation pipeline

While the offline pipeline acts as a strict pre-deployment gatekeeper, the online pipeline is the post-deployment telemetry system. Its objective is to monitor real-world behavior, capturing emergent edge cases, and quantifying model drift. Architects must instrument applications to capture five distinct categories of telemetry:

Advertisement

1. Explicit user signals

Direct, deterministic feedback indicating model performance:

  • Thumbs up/down: Disproportionate negative feedback is the most immediate leading indicator of system degradation, directing immediate engineering investigation.

  • Verbatim in-app feedback: Systematically parsing written comments identifies novel failure modes to integrate back into the offline “golden dataset.”

2. Implicit behavioral signals

Behavioral telemetry reveals silent failures where users give up without explicit feedback:

  • Regeneration and retry rates: High frequencies of retries indicate the initial output failed to resolve user intent.

  • Apology rate: Programmatically scanning for heuristic triggers (“I’m sorry”) detects degraded capabilities or broken tool routing.

  • Refusal rate: Artificially high refusal rates (“I can’t do that”) indicate over-calibrated safety filters rejecting benign user queries.

3. Production deterministic asserts (synchronous)

Because deterministic code checks execute in milliseconds, teams can seamlessly reuse Layer 1 offline asserts (schema conformity, tool validity) to synchronously evaluate 100% of production traffic. Logging these pass/fail rates instantly detects anomalous spikes in malformed outputs — the earliest warning sign of silent model drift or provider-side API changes.

4. Production LLM-as-a-Judge (asynchronous)

If strict data privacy agreements (DPAs) permit logging user inputs, teams can deploy model-based asserts. Architecturally, production LLM-Judges must never execute synchronously on the critical path, which doubles latency and compute costs. Instead, a background LLM-Judge asynchronously samples a fraction (5%) of daily sessions, grading outputs against the offline rubric to generate a continuous quality dashboard.

Advertisement

Engineering the feedback loop (the “flywheel”)

Evaluation pipelines are not “set-it-and-forget-it” infrastructure. Without continuous updates, static datasets suffer from “rot” (concept drift) as user behavior evolves and customers discover novel use cases.

For example, an HR chatbot might boast a pristine 99% offline pass rate for standard payroll questions. However, if the company suddenly announces a new equity plan, users will immediately begin prompting the AI about vesting schedules — a domain entirely missing from the offline evaluations.

To make the system smarter over time, engineers must architect a closed feedback loop that mines production telemetry for continuous improvement.

The continuous improvement workflow:

Advertisement
  1. Capture: A user triggers an explicit negative signal (a “thumbs down”) or an implicit behavioral flag in production.

  2. Triage: The specific session log is automatically flagged and routed for human review.

  3. Root-cause analysis: A domain expert investigates the failure, identifies the gap, and updates the AI system to successfully handle similar requests.

  4. Dataset augmentation: The novel user input, paired with the newly corrected expected output, is appended to the offline Golden Dataset alongside several synthetic variations.

  5. Regression testing: The model is continuously re-evaluated against this newly discovered edge case in all future runs.

Building an evaluation pipeline without monitoring production logs and updating datasets is fundamentally insufficient. Users are unpredictable. Evaluating on stale data creates a dangerous illusion: High offline pass rates masking a rapidly degrading real-world experience.

Conclusion: The new “definition of done”

In the era of generative AI, a feature or product is no longer “done” simply because the code compiles and the prompt returns a coherent response. It is only done when a rigorous, automated evaluation pipeline is deployed and stable — and when the model consistently passes against both a curated golden dataset and newly discovered production edge cases.

This guide has equipped you with a comprehensive blueprint for building that reality. From architecting offline regression pipelines and online telemetry to the continuous feedback flywheel and navigating enterprise anti-patterns, you now have the structural foundation required to deploy AI systems with greater confidence.

Now, it is your turn. Share this framework with your engineering, product, and legal teams to establish a unified, cross-functional standard for AI quality in your organization. Stop guessing whether your models are degrading in production, and start measuring.

Advertisement

Derah Onuorah is a Microsoft senior product manager.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!

Advertisement

Source link

Continue Reading

Tech

PenPal, A Robotic Drawing Assistant

Published

on

Emergent properties include examples like murmurations of starlings which can’t be predicted from looking at a single bird, weather which can’t be predicted by looking at a few air molecules, and consciousness which can’t be predicted by looking at a neuron. Likewise, when adding a new tool to a workflow, emergent properties can show up as well. A group at Chicago University developed a robotic drawing tool and a few artists developed some unique drawing methods using it.

The robotic pen uses a pair of tendons to extend the working end out a certain amount. From there it uses a set of servos to can be programmed to revolve around in a defined path, making repeating movements while the artist makes larger movements over the paper. Originally meant for shading, small circles or simpler back-and-forth movements were preset, but with full control over the pen’s behavior the artist can shift focus away to other tasks within the creative process. A study with ten participants was done which showed artists coming up with novel ways of using a tool like this, and others reporting that it’s almost like drawing together with another person.

Looking for novel ways that humans can interact with computers and robots can often lead to surprising outcomes like this. Members of this group aren’t new to novel human interface devices either; they’ve also built a squishy dynamic button as well.

Advertisement

Source link

Advertisement
Continue Reading

Tech

Microsoft rolls out revamped Windows Insider Program

Published

on

Windows 11

Microsoft says it’s rolling out a revamped Windows Insider Program experience as part of the broader plans to address reliability concerns in Windows 11.

For those unaware, the Windows Insider Program is a beta testing program that allows you to test early Windows releases and provide your feedback to Microsoft.

Until now, Microsoft has not really listened to all the feedback from testers, and all that has added up to a poor Windows experience.

image

To address this, Microsoft is now making the Windows Insider Program simpler and more transparent in the hope that it will help with the development of Windows 11.

In a blog post, Microsoft admitted that the current channel structure is confusing.

Advertisement

Insider Program used to be simple when Microsoft replaced Insider Rings with Channels, similar to Chromium (Beta, Dev, and Canary), but over time, the structure has become more and more confusing.

There’s no clarity on what channel you should pick if you want to be on the edge and test new features as they develop internally at Microsoft. In fact, most testers never get access to experimental features, thanks to Microsoft’s Controlled Feature Rollout (CFR).

Microsoft has acknowledged that the experience is frustrating: you read about a new feature on the internet, update your PC, hoping to test and provide feedback, and then find out it’s not there.

“That experience, where features are announced but only some of you receive them due to how we gradually roll things out, is the single biggest frustration we hear,” writes Alec Oot, who is responsible for the Windows Update experience at Microsoft.

Advertisement

While you can use third-party tools like ViveTool to enable experimental features, it’s not the ideal experience and isn’t what you signed up for.

Microsoft says the Windows Insider Program is now simpler and more transparent

Microsoft says it’s listening to feedback, making all channels simpler, and moving the Insider Program to just two channels.

The first new channel is ‘Experimental,’ which replaces the Dev and Canary channels. The name makes it obvious that it’s the channel you should sign up for if all you want to do is test experimental features, which may never ship in production.

The second new channel is still called ‘Beta,’ which is an updated version of the original Beta Channel.

Advertisement
Windows Insider Progra
Windows Insider Program now has only two channels

Source: Microsoft​​​​​

In the Beta Channel, Microsoft is ending gradual feature rollouts, which means all new features mentioned in the release notes will be immediately available.

In the Experimental channel, you’ll be given access to some features out of the box, but others will be locked behind a flag.

Feature flags in Experimental Channel
Feature flags to turn on features gradually rolling out.

Source: Microsoft

The good news is you can manually toggle experimental features from Windows Settings.

For example, if you want to try out new haptic features for the mouse but the feature isn’t showing due to a gradual rollout, you can open Windows Insider Program Settings > Feature flagsthen turn on the feature.

Microsoft explains how it’s rolling out the new channels to Windows Insiders

Microsoft says it is moving Insiders to the new channels in phases, starting with Dev Channel users, who will now move to Experimental.

If you are in Dev and do not see the new Experimental channel UI yet, Microsoft says you can manually turn it on by going to Settings > Windows Update > Windows Insider Program > Feature flags and enabling the new experience.

Advertisement

Over the next few weeks, Microsoft will also move Canary users to specific versions of Experimental.

Those on the Canary 28000 series will move to Experimental (26H1), while users who installed the optional 29500 series update will move to Experimental (Future Platforms).

Future platforms
Advanced Insider Program controls to test future platform releases

Source: Microsoft

Beta Channel users will move to the new Beta experience, but Microsoft says some minor feature changes may happen during the transition.

If you want to keep access to all existing experimental features, Microsoft recommends moving from Beta to Dev before the transition, as Dev is being moved to Experimental. Microsoft is also changing how it shares build details.

As part of today’s rollout, Microsoft is shipping Build 26220.8283 for Beta, Build 26300.8289 for Experimental, Build 28020.1873 for Experimental 26H1, and Build 29576.1000 for Experimental Future Platforms.

Advertisement

Today’s update includes early access to a new Windows Update experience where you can pause updates as you desire, avoid forced reboots, and more.


article image

AI chained four zero-days into one exploit that bypassed both renderer and OS sandboxes. A wave of new exploits is coming.

At the Autonomous Validation Summit (May 12 & 14), see how autonomous, context-rich validation finds what’s exploitable, proves controls hold, and closes the remediation loop.

Claim Your Spot

Source link

Advertisement
Continue Reading

Tech

BMW brings color changing tech closer to production with the iX3 Flow Edition

Published

on


Unveiled at the 2026 Beijing Auto Show, the BMW iX3 Flow Edition integrates E Ink’s Prism technology directly into the vehicle’s hood, bringing the concept closer to real-world application. Unlike earlier efforts that relied on external layers of segmented panels, this version embeds the electrophoretic system into the structure of…
Read Entire Article
Source link

Continue Reading

Tech

What’s The Difference Between Kelly And Goodyear Tires?

Published

on





If you’re shopping for Kelly tires, you might be surprised to find yourself on the Goodyear site. No, this isn’t a fluke: Goodyear and Kelly have been sister brands since 1935. Today, Goodyear is the Tire & Rubber Company’s premium flagship brand. It’s the more high-end of the two, offering more durability across a wider range of different driving conditions than Kelly. Rain, snow, or rugged terrain, Goodyear probably has a tire for you.

Kelly Tires is more straightforward. Of the two, it’s definitely the most budget-friendly option. The Kelly brand is technically older than Goodyear itself, but it’s existed under the Goodyear corporate umbrella since the 1930s. It might not be on the cutting edge of innovation, and it might not be advertising the same top-tier performance specs, but Kelly does do one thing better than Goodyear: Give you fine-enough tires at a lower price point. You still get all-season traction and year-round reliability, but just at a much more accessible cost per tire. Beyond pricing, the product lines are pretty different. Goodyear has six different tire types for over half a dozen different kinds of vehicles, but Kelly’s lineup is much simpler.

Advertisement

Differences in warranty and product lineup

Goodyear’s full lineup covers snow, sport, heavy-duty, and all-season tires for cars, trucks, SUVs, trailers, and more. Kelly’s selection is much smaller and more streamlined than that; just five tire models, and all five of them are all-season, no winter tires or summer tires. Not a lot of variety there compared to Goodyear, but that’s okay. It’s not trying to be Goodyear.

Then there’s the respective warranties. Goodyear has one of the best tire warranties around; a 60-day satisfaction guarantee that basically gives drivers two whole months to think about their purchase. Kelly also has a satisfaction guarantee, but it’s a little more limited than Goodyear’s; 45 days compared to Goodyear’s 60, or about a month and a half. Still, both Goodyear and Kelly give you price matching and access to post-purchase customer support. When it comes down to it, the difference is less about quality versus inferiority, and more about intended use and budget. Goodyear’s more premium, while Kelly’s more affordable.

Advertisement



Source link

Advertisement
Continue Reading

Trending

Copyright © 2025