Connect with us
DAPA Banner

Tech

HBO Max: The 26 Absolute Best Movies to Watch

Published

on

Continue Reading
Click to comment

You must be logged in to post a comment Login

Leave a Reply

Tech

Masimo's Apple Watch ban complaint dismissed by U.S. District Court

Published

on

Masimo’s long-time lawsuit over Apple Watch patent infringement has encountered another setback, as a U.S. District Court filing reveals the complaint against the USITC will be dismissed with prejudice.

Silver Apple Watch lying face down on a dark surface, showing circular heart-rate sensor array, side button, and red-ringed digital crown without the watchband attached
Sensor on Apple Watch Series 9

The USITC’s (United States International Trade Commission) decision to deny a reinstatement of a ban on the Apple Watch has turned into a hurdle for Masimo in its long-running blood oxygen patent lawsuit against Apple. In an April 24 filing by the U.S. District Court for the District of Columbia, a complaint against the USITC about a ban has been dealt with.
The filing, detailing a complaint between Masimo and the ITC, as well as U.S. Customs and Border Protection, explains a brief history of the court’s dealings with the ITC. While it doesn’t mention the Apple Watch directly, it’s all about the patent infringement suit with Apple and the implementation and dismissal of a ban.
Continue Reading on AppleInsider | Discuss on our Forums

Source link

Continue Reading

Tech

iVanky FusionDock Max 2 review: Hugely better value the second time around

Published

on

The iVanky FusionDock Max 2 adds a massive 23 ports, can support up to four displays if your Mac can, and is a formidable Thunderbolt 5 dock that improves on the original in utility as well as value for money.

Black iVANKY docking station with multiple USB ports and SD card slots on a white desk, next to a small blue spherical smart speaker against a light textured wall
iVanky FusionDock Max 2

In March 2024, AppleInsider reviewed the iVanky FusionDock Max 1. While we enjoyed the massive amount of ports, but its nosebleed pricing at the time and very limited selection of ideal host computers made it not the best option.
A lot has happened since then, not the least of which is Thunderbolt 5 on Mac.
Continue Reading on AppleInsider | Discuss on our Forums

Source link

Continue Reading

Tech

Maine’s governor vetoes data center moratorium

Published

on

Maine Governor Janet Mills has vetoed a bill that would have temporarily brought permits for new data centers to a halt.

If it had become law, L.D. 307 would have imposed the country’s first statewide moratorium on new data centers — lasting, in this case, until November 1, 2027. The bill also called for the creation of a 13-person council to study and make recommendations on data center construction.

With public opposition to data centers rising, other states including New York have considered similar moratoriums.

In a letter to the state legislature, Mills — a Democrat currently running for the U.S. Senate — said that pausing new data centers would be “appropriate given the impacts of massive data centers in other states on the environment and on electricity rates” and that she “would have signed this bill” if it had included an exemption for a data center project in the Town of Jay.

Advertisement

That project, Mills said, “enjoys strong local support from its host community and region.”

Melanie Sachs, a Democratic state representative who sponsored the bill, said Mills’ veto “poses significant potential consequences for all ratepayers, our electric grid, our environment, and our shared energy future.”

Source link

Advertisement
Continue Reading

Tech

Speed vs. Depth: How Does Using AI for Work Affect Our Confidence?

Published

on

Be careful delegating your work to that chatbot: A new peer-reviewed study published this month by the American Psychological Association found that people who heavily rely on AI tools for work tasks reported feeling less confident in their abilities and had less ownership over their work.

There has been growing research on how our brains function when we use AI tools. A landmark study from MIT in 2025 found that our brains don’t retain as much information or employ necessary critical thinking skills when writing tasks are outsourced to AI chatbots. 

This new study aimed to understand how our human behavior, specifically executive functions — like strategic planning and decision making — can change when AI is part of the process. 

Advertisement

Sarah Baldeo, the study’s author and a Ph.D. candidate in AI and neuroscience at Middlesex University in England, noted in the paper that these findings do not show that AI is harming or causing cognitive decline. Rather, they “highlight variability in how users distribute effort between themselves and AI systems under conditions of convenience and competence.” Meaning, people who use AI are making conscious trade-offs, and their confidence fluctuates as a result.

The study encouraged nearly 2,000 adults to use AI for a variety of workplace tasks, like prioritizing projects based on deadlines, explaining a strategy and developing plans with incomplete information. It then asked them to self-report their levels of confidence, ownership and AI reliance, including whether they significantly altered the AI-generated outputs. 

Overall, confidence varied with AI use. A greater reliance on AI was associated with lower confidence in their ability to reason independently. Participants also reported relatively few modifications, meaning they often did not tweak or put their own stamp on what the AI spit out. But those who modified the AI’s work reported feeling more confident and more like the author. Men reported higher reliance on AI than women.

The trade-off between speed and depth was one of the main themes participants reported.

Advertisement

“I got an answer faster, but I don’t think I thought as deeply as I normally would,” one of the participants said.

AI Atlas

This reflects one of the biggest caveats of using AI tools. Chatbots, for example, can produce text quickly, but it doesn’t always have the same level of subject matter expertise you need. AI tools can also hallucinate, or make up facts, so AI-generated output needs to be verified before it’s used. 

The office is one of the main places where people use AI tools. We’re moving beyond just chatbots, with agents that can autonomously handle tasks that would’ve otherwise required a human. 

But these tools aren’t necessarily making our work lives better; one study found they made workdays longer and more unpleasant. As AI becomes increasingly embedded in our work lives, it’s important to understand how it’s shaping our mental attitudes. Qualities like confidence and ownership of our work are important factors in determining the quality of our work life. 

Advertisement

Source link

Continue Reading

Tech

Samsung Messages is Shutting Down: Here’s How to Rescue All Your Messages Before It’s Gone

Published

on

The era of the Samsung Messages app is officially coming to an end. After years of preinstalling Google’s alternative on its newest Galaxy devices, Samsung is finally moving to deactivate its legacy texting platform for good this July. For those who have avoided the switch until now, the transition is no longer optional-failing to migrate means risking a major disruption in how you send and receive daily chats.

On a page with information about the switch, Samsung points to instructions on how to swap over to Google’s Messages app, including for phones that are still on Android 12 and Android 13. Samsung has historically preinstalled its own Messages app on Galaxy phones, but began transitioning toward Google Messages as early as 2021.

To encourage people to switch to Google Messages, Samsung’s instructions list new features offered by Google Messages, like RCS-enabled texting for features like typing indicators, easier group chats and sending higher-quality images. Google’s Messages app also has AI-powered spam detection and spam filters, multi-device access to messages and some built-in Gemini AI features. It’s also the app that most Android phones use as their default texting app, including Samsung’s more recent Galaxy S26. There are other SMS texting app alternatives in the Google Play Store if you don’t want to use the one made by Google.

Advertisement

Samsung has not said when exactly in July messaging will no longer work in the app. A Samsung representative didn’t immediately respond to a request for comment. Once the app is deactivated, only messaging to emergency services will work on Samsung Messages. 

While Samsung did stop including it as the default texting app in 2021, it wasn’t until 2024 that Samsung stopped preinstalling the texting app alongside Google Messages. The Galaxy S26 can’t download the Samsung Messages app, and other phones won’t be able to download it after the app’s July sunset.

Samsung said users of Android 11 or lower aren’t affected by the end of service, but would also likely benefit from switching to a supported texting app like Google Messages. To switch to Google Messages, the company asks users to download the app if it’s not already installed and to set it as the default SMS app when prompted after launching it. 

The post also notes that anyone using an older Galaxy Watch that runs on Samsung’s Tizen operating system will no longer have access to their full conversation history since these watches cannot use Google Messages. Samsung said that they will still be able to read and send text messages, but the company’s newer watches (Galaxy Watch 4 and later) that run WearOS will still have access to full conversations.

Advertisement

Source link

Continue Reading

Tech

Discord Sleuths Gained Unauthorized Access to Anthropic’s Mythos

Published

on

As researchers and practitioners debate the impact that new AI models will have on cybersecurity, Mozilla said on Tuesday it used early access to Anthropic’s Mythos Preview to find and fix 271 vulnerabilities in its new Firefox 150 browser release. Meanwhile, researchers identified a group of moderately successful North Korean hackers using AI for everything from vibe coding malware to creating fake company websites—stealing up to $12 million in three months.

Researchers have finally cracked disruptive malware known as Fast16 that predates Stuxnet and may have been used to target Iran’s nuclear program. It was created in 2005 and was likely deployed by the US or an ally.

Meta is being sued by the Consumer Federation of America, a nonprofit, over scam ads on Facebook and Instagram and allegedly misleading consumers about the company’s efforts to combat them. A United States surveillance program that lets the FBI view Americans’ communications without a warrant is up for renewal, but lawmakers are deadlocked on next steps. A new bill aims to address mounting lawmaker concerns, but lacks substance.

And if you’re looking for a deep dive, WIRED investigated the yearslong feud behind the prominent privacy and security conscious mobile operating system GrapheneOS. Plus we looked at the strange tale of how China spied on US figure skater Alysa Liu and her dad.

Advertisement

And there’s more. Each week, we round up the security and privacy news we didn’t cover in depth ourselves. Click the headlines to read the full stories. And stay safe out there.

Anthropic’s Mythos Preview AI model has been touted as a dangerously capable tool for finding security vulnerabilities in software and networks, so powerful that its creator has carefully restricted its release. But one group of amateur sleuths on Discord found their own, relatively simple ways—no AI hacking required—to gain unauthorized access to a coveted digital prize: Mythos itself.

Despite Anthropic’s efforts to control who can use Mythos Preview, a group of Discord users gained access to the tool through some straightforward relatively detective work: They examined data from a recent breach of Mercor, an AI training startup that works with developers, and “made an educated guess about the model’s online location based on knowledge about the format Anthropic has used for other models”—a phrase that many observers have speculated refers to a web URL—according to Bloomberg, which broke the story.

The person also reportedly took advantage of permissions they already possessed to access other Anthropic models, thanks to their work for an Anthropic contracting firm. As a result of their probing, however, they allegedly gained access to not only Mythos but other unreleased Anthropic AI models, too. Thankfully, according to Bloomberg, the group that accessed Mythos has only used it so far to build simple websites—a decision designed to prevent its detection by Anthropic—rather than hack the planet.

Advertisement

Security researchers have long warned that the telecom protocols known as Signaling System 7, or SS7, which govern how phone networks connect to one another and route calls and texts, are vulnerable to abuse that would allow surreptitious surveillance. This week researchers at the digital rights organization Citizen Lab revealed that at least two for-profit surveillance vendors have actually used those vulnerabilities—or similar ones in the next generation of telecom protocols—to spy on real victims. Citizen Lab found that two surveillance firms had essentially acted as rogue phone carriers, exploiting access to three small telecom firms—Israeli carrier 019Mobile, British cell provider Tango Mobile, and Airtel Jersey, based on the island of Jersey in the English Channel—to track the location of targets’ phones. Citizen Lab’s researchers say that “high-profile” people were tracked by the two surveillance firms, though it declined to name either the firms or their targets. Researchers warn, too, that the two companies they discovered abusing the protocols are likely not alone, and that the vulnerability of global telecom protocols remains a very real vector for phone spying worldwide.

In a sign of a growing—if belated—crackdown by US law enforcement on the sprawling criminal industry of human-trafficking-fueled scam compounds across Southeast Asia, the Department of Justice this week announced charges against two Chinese men for allegedly helping to manage a scam compound in Myanmar and seeking to open a second compound in Cambodia. Jiang Wen Jie and Huang Xingshan were both arrested in Thailand earlier this year on immigration charges, according to prosecutors, and now face charges for allegedly running a vast scamming operation that lured human trafficking victims to their compound with fake job offers and then forced them to scam victims, including Americans, for millions of dollars with cryptocurrency fraudulent investments. The DOJ says it also “restrained” $700 million in funds belonging to the operation—essentially freezing the funds in preparation for seizure—and also seized a channel on the messaging app Telegram prosecutors say was used to bait and enslave trafficking victims. The Justice Department’s statement claims that Huang personally took part in the physical punishment of workers in one compound, and that Jiang at one point oversaw the theft of $3 million from a single US scam victim.

Three scientific research institutions have been found selling British citizens’ health information on Alibaba, the British government and the nonprofit UK Biobank revealed this week. Over the last two decades, more than 500,000 people have shared their health data—including medical images, genetic information, and health care records—with UK Biobank, which allows scientists around the world to access the information to conduct medical research. However, the charity said the data leak involved a “breach of the contract” signed by three organizations, with one of the datasets for sale believed to have included data on all half-million research subjects. It did not detail the full types of data that were listed for sale but said it has suspended the Biobank accounts of those allegedly selling the information. The ads for the data have also been removed.

Earlier this month, 404 Media reported that the FBI was able to get copies of Signal messages from a defendant’s iPhone as the content of the messages, which are encrypted within Signal, were saved in an iOS push notification database. In this instance, the copies of the messages were still accessible even though Signal had been removed from the phone—though the issue affected all apps that send push notifications.

Advertisement

This week, in response to the issue, Apple released an iOS and iPadOS security update to fix the flaw. “Notifications marked for deletion could be unexpectedly retained on the device,” Apple’s security update for iOS 26.4.2 says. “A logging issue was addressed with improved data redaction.”

While the issue has been fixed, it is still worth changing what appears in notifications on your device. For Signal you can open the app, go to Settings, Notifications, and toggle notifications to show Name Only or No Name or Content. It is another reminder that while apps such as Signal are end-to-end encrypted, this applies to the content as it moves between devices: If someone can physically access and unlock your phone, there is the potential they can access everything on your device.

Source link

Advertisement
Continue Reading

Tech

2026 Green Powered Challenge: Ventilate Your Way To Power!

Published

on

Have you ever looked out across the rooftops of a city and idly gazed at the infrastructure that remains unseen from the street? It seems [varunsontakke80] has, because here’s their project, harvesting energy from the rotation of a rooftop ventilator.

The build is a relatively straightforward one, with a pair of disks with magnets attached being mounted on the ventilator shaft inside its dome. A third disk sits between them and is stationary, with a set of coils in which the magnets induce current as they move. A rectifier and charge circuit completes the picture.

This appears to be part of a college project, but despite searching, we can’t find any measure of how much power this thing generates. We’d be concerned that it might reduce the efficiency of the ventilator somewhat. There will be an inevitable tradeoff as power is harvested. Still, it’s a neat use of a ubiquitous piece of hardware, and we like it for that.

Advertisement

This hack is part of our 2026 Green Powered Challenge. You’ve got time to get your own entry in, so get a move on!

Source link

Advertisement
Continue Reading

Tech

From the ‘scurfy’ mouse to the Nobel Prize: How a Seattle biotech pioneer’s long game paid off

Published

on

The biotech industry is increasingly shaped by computer-designed drugs and investor pressure to move fast and show commercial traction. Nobel laureate Fred Ramsdell took a different path — one built on cell-based therapies, philanthropic funding and patient investing.

That path began at Darwin Molecular, a biotech startup in Bothell, Wash., that launched in 1992 with backing from Bill Gates and Paul Allen. The Microsoft co-founders weren’t chasing quick returns, Ramsdell said, and that freedom attracted dedicated researchers.

“People bought into that because you’re trying to do something that would make a difference,” he said. “It wasn’t a one-drug company. It wasn’t hyper-focused on something very specific. It was trying to figure out how we can affect change in patients.”

That mission-drive culture proved fertile ground. Ramsdell’s work at Darwin ultimately led to a Nobel Prize in Physiology or Medicine, awarded in October and shared with former Darwin colleague Mary Brunkow and Shimon Sakaguchi of Osaka University in Japan. The trio was recognized for foundational work in regulatory T cells, or Tregs — dubbed the “immune system’s security guards.”

Advertisement

The discovery of Tregs changed therapeutics by showing that the immune system has a built-in braking mechanism that can be enhanced to treat autoimmune disease, transplant rejection and graft-versus-host disease, or blocked to improve cancer immunotherapy.

Ramsdell recounted his journey at Life Science Washington’s annual conference in Seattle on Tuesday, tracing the unlikely origins of the discovery back to the Cold War.

The Darwin team studied a line of mice descended from post-Manhattan Project research into the effects of radiation on living organisms. In 1949, the program produced a mouse from a naturally occurring, non-radiation-induced mutation, later named “scurfy.”

A fraction of the male mice were riddled with illness and lived for only a few weeks. “They had every autoimmune disease in one animal,” Ramsdell said — diabetes, Crohn’s disease, psoriasis, myocarditis and more.

Advertisement

That suffering pointed to something important. The scurfy mice carried a mutation the Darwin scientists identified and named Foxp3 — a gene essential to keeping the immune system from attacking the body’s own healthy cells. The mouse gene has a human counterpart, FOXP3.

“We recognized the potential of these cells,” Ramsdell said. Introducing healthy Tregs into people with autoimmune disease could treat the condition — but the scientific tools to make that a reality didn’t yet exist.

Darwin was acquired in 1996 by London-based Chiroscience Group, which merged with the British company Celltech. When the company shut down its Washington R&D operations in 2004, Ramsdell and Brunkow moved on.

Ramsdell eventually landed at the Parker Institute for Cancer Immunotherapy, which he helped launch in 2016. The nonprofit research institute presented another unique opportunity. Founded with a $250 million grant from tech entrepreneur Sean Parker, it operates as a collaborative network across seven major U.S. cancer centers, applying immunotherapy to cancer in ways that siloed institutions couldn’t.

Advertisement

The secret ingredient, Ramsdell said, was trust — built deliberately through Parker Institute retreats that included scientists and their families.

“The ability to build trust and collaboration, true collaboration, and combine [research] that wouldn’t otherwise be combined, was incredibly appealing to me,” he said.

Today, Ramsdell serves as a scientific advisor for the Parker Institute and for Sonoma Biotherapeutics, a Seattle- and South San Francisco-based startup he co-founded that is focused on Treg cells. The company has a partnership with Regeneron to co-develop cell therapies for Crohn’s disease, ulcerative colitis and other conditions — a direct line from the scurfy mice of the 1940s to the clinic.

Even in advisory roles, Ramsdell keeps returning to big-picture biological questions. He’s currently intrigued by people who carry genetic predispositions for diseases that never materialize — and what that might reveal about the hidden coding in their DNA that hold illness at bay.

Advertisement

Looking at this phenomenon across populations, scientists can explore these genetic factors, he said, “and that will open up a lot of your doors.”

Source link

Continue Reading

Tech

Monitoring LLM behavior: Drift, retries, and refusal patterns

Published

on

The stochastic challenge

Traditional software is predictable: Input A plus function B always equals output C. This determinism allows engineers to develop robust tests. On the other hand, generative AI is stochastic and unpredictable. The exact same prompt often yields different results on Monday versus Tuesday, breaking the traditional unit testing that engineers know and love.

To ship enterprise-ready AI, engineers cannot rely on mere “vibe checks” that pass today but fail when customers use the product. Product builders need to adopt a new infrastructure layer: The AI Evaluation Stack.

This framework is informed by my extensive experience shipping AI products for Fortune 500 enterprise customers in high-stakes industries, where “hallucination” is not funny — it’s a huge compliance risk.

Defining the AI evaluation paradigm

Traditional software tests are binary assertions (pass/fail). While some AI evals use binary asserts, many evaluate on a gradient. An eval is not a single script; it is a structured pipeline of assertions — ranging from strict code syntax to nuanced semantic checks — that verify the AI system’s intended function.

Advertisement

The taxonomy of evaluation checks

To build a robust, cost-effective pipeline, asserts must be separated into two distinct architectural layers:

Layer 1: Deterministic assertions

A surprisingly large share of production AI failures aren’t semantic “hallucinations” — they are basic syntax and routing failures. Deterministic assertions serve as the pipeline’s first gate, using traditional code and regex to validate structural integrity.

Instead of asking if a response is “helpful,” these assertions ask strict, binary questions:

  • Did the model generate the correct JSON key/value schema?

  • Did it invoke the correct tool call with the required arguments?

  • Did it successfully slot-fill a valid GUID or email address?

// Example: Layer 1 Deterministic Tool Call Assertion

Advertisement

{

  “test_scenario”: “User asks to look up an account”,

  “assertion_type”: “schema_validation”,

  “expected_action”: “Call API: get_customer_record”,

Advertisement

  “actual_ai_output”: “I found the customer.”,

  “eval_result”: “FAIL – AI hallucinated conversational text instead of generating the required API payload.”

}

In the example above, the test failed instantly because the model generated conversational text instead of the required tool call payload.

Advertisement

Architecturally, deterministic assertions must be the first layer of the stack, operating on a computationally inexpensive “fail-fast” principle. If a downstream API requires a specific schema, a malformed JSON string is a fatal error. By failing the evaluation immediately at this layer, engineering teams prevent the pipeline from triggering expensive semantic checks (Layer 2) or wasting valuable human review time (Layer 3).

Layer 2: Model-based assertions

When deterministic assertions pass, the pipeline must evaluate semantic quality. Because natural language is fluid, traditional code cannot easily assert if a response is “helpful” or “empathetic.” This introduces model-based evaluation, commonly referred to as “LLM-as-a-Judge” or “LLM-Judge.”

While using one non-deterministic system to evaluate another seems counterintuitive, it is an exceptionally powerful architectural pattern for use cases requiring nuance. It is virtually impossible to write a reliable regex to verify if a response is “actionable” or “polite.” While human reviewers excel at this nuance, they cannot scale to evaluate tens of thousands of CI/CD test cases. Thus, the LLM-as-a-Judge becomes the scalable proxy for human discernment.

3 critical inputs for model-based assertions

However, model-based assertions only yield reliable data when the LLM-as-a-Judge is provisioned with three critical inputs:

Advertisement
  1. A state-of-the-art reasoning model: The Judge must possess superior reasoning capabilities compared to the production model. If your app runs on a smaller, faster model for latency, the judge must be a frontier reasoning model to approximate human-level discernment.

  2. A strict assessment rubric: Vague evaluation prompts (“Rate how good this answer is”) yield noisy, stochastic evaluations. A robust rubric explicitly defines the gradients of failure and success. (For example, a “Helpfulness” rubric should define Score 1 as an irrelevant refusal, Score 2 as addressing the prompt but lacking actionable steps, and Score 3 as providing actionable next steps strictly within context.)

  3. Ground truth (golden outputs): While the rubric provides the rules, a human-vetted “expected answer” acts as the answer key. When the LLM-Judge can compare the production model’s output against a verified Golden Output, its scoring reliability increases dramatically.

Architecture: The offline vs online pipeline

A robust evaluation architecture requires two complementary pipelines. The online pipeline monitors post-deployment telemetry, while the offline pipeline provides the foundational baseline and deterministic constraints required to evaluate stochastic models safely.

The offline evaluation pipeline

The offline pipeline’s primary objective is regression testing — identifying failures, drift, and latency before production. Deploying an enterprise LLM feature without a gating offline evaluation suite is an architectural anti-pattern; it is the equivalent of merging uncompiled code into a main branch.

Process

1. Curating the golden dataset

The offline lifecycle begins by curating a “golden dataset” — a static, version-controlled repository of 200 to 500 test cases representing the AI’s full operational envelope. Each case pairs an exact input payload with an expected “golden output” (ground truth).

Crucially, this dataset must reflect expected real-world traffic distributions. While most cases cover standard “happy-path” interactions, engineers must systematically incorporate edge cases, jailbreaks, and adversarial inputs. Evaluating “refusal capabilities” under stress remains a strict compliance requirement.

Advertisement

Example test case payload (standard tool use):

  • Input: “Schedule a 30-minute follow-up meeting with the client for next Tuesday at 10 a.m.”

  • Expected output (golden): The system successfully invokes the schedule_meeting tool with the correct JSON payload: {“duration_minutes”: 30, “day”: “Tuesday”, “time”: “10 AM”, “attendee”: “client_email”}.

While manually curating hundreds of edge cases is tedious, the process can be accelerated with synthetic data generation pipelines that use a specialized LLM to produce diverse TSV/CSV test payloads. However, relying entirely on AI-generated test cases introduces the risk of data contamination and bias. A human-in-the-loop (HITL) architecture is mandatory at this stage; domain experts must manually review, edit, and validate the synthetic dataset to ensure it accurately reflects real-world user intent and enterprise policy before it is committed to the repository.

2. Defining the evaluation criteria

Once the dataset is curated, engineers must design the evaluation criteria to compute a composite score for each model output. A robust architecture achieves this by assigning weighted points across a hybrid of Layer 1 (deterministic) and Layer 2 (model-based) asserts.

Consider an AI agent executing a “send email” tool. An evaluation framework might utilize a 10-point scoring system:

Advertisement
  • Layer 1: Deterministic asserts (6 points): Did the agent invoke the correct tool? (2 pts). Did it produce a valid JSON object? (2 pts). Does the JSON strictly adhere to the expected schema? (2 pts).

  • Layer 2: Model-based asserts (4 points): (Note: Semantic rubrics must be highly use-case specific). Does the subject line reflect user intent? (1 pt). Does the email body match expected outputs without hallucination? (1 pt). Were CC/BCC fields leveraged accurately? (1 pt). Was the appropriate priority flag inferred? (1 pt).

To understand why the LLM-Judge awarded these points, the engineer must prompt the judge to supply its reasoning for each score. This is crucial for debugging failures.

The passing threshold and short-circuit logic 

In this example, an 8/10 passing threshold requires 8 points for success. Crucially, the evaluation pipeline must enforce strict short-circuit evaluation (fail-fast logic). If the model fails any deterministic assertion — such as generating a malformed JSON schema — the system must instantly fail the entire test case (0/10). There is zero architectural value in invoking an expensive LLM-Judge to assess the semantic “politeness” of an email if the underlying API call is structurally broken.

3. Executing the pipeline and aggregating signals

Using an evaluation infrastructure of choice, the system executes the offline pipeline — typically integrated as a blocking CI/CD step during a pull request. The infrastructure iterates through the golden dataset, injecting each test payload into the production model, capturing the output, and executing defined assertions against it.

Advertisement

Each output is scored against the passing threshold. Once batch execution is complete, results are aggregated into an overall pass rate. For enterprise-grade applications, the baseline pass rate must typically exceed 95%, scaling to 99%-plus for strict compliance or high-risk domains.

4. Assessment, iteration, and alignment

Based on aggregated failure data, engineering teams conduct a root-cause analysis of failing test cases. This assessment drives iterative updates to core components: refining system prompts, modifying tool descriptions, augmenting knowledge sources, or adjusting hyperparameters (like temperature or top-p). Continuous optimization remains best practice even after achieving a 95% pass rate.

Crucially, any system modification necessitates a full regression test. Because LLMs are inherently non-deterministic, an update intended to fix one specific edge case can easily cause unforeseen degradations in other areas. The entire offline pipeline must be rerun to validate that the update improved quality without introducing regressions.

The online evaluation pipeline

While the offline pipeline acts as a strict pre-deployment gatekeeper, the online pipeline is the post-deployment telemetry system. Its objective is to monitor real-world behavior, capturing emergent edge cases, and quantifying model drift. Architects must instrument applications to capture five distinct categories of telemetry:

Advertisement

1. Explicit user signals

Direct, deterministic feedback indicating model performance:

  • Thumbs up/down: Disproportionate negative feedback is the most immediate leading indicator of system degradation, directing immediate engineering investigation.

  • Verbatim in-app feedback: Systematically parsing written comments identifies novel failure modes to integrate back into the offline “golden dataset.”

2. Implicit behavioral signals

Behavioral telemetry reveals silent failures where users give up without explicit feedback:

  • Regeneration and retry rates: High frequencies of retries indicate the initial output failed to resolve user intent.

  • Apology rate: Programmatically scanning for heuristic triggers (“I’m sorry”) detects degraded capabilities or broken tool routing.

  • Refusal rate: Artificially high refusal rates (“I can’t do that”) indicate over-calibrated safety filters rejecting benign user queries.

3. Production deterministic asserts (synchronous)

Because deterministic code checks execute in milliseconds, teams can seamlessly reuse Layer 1 offline asserts (schema conformity, tool validity) to synchronously evaluate 100% of production traffic. Logging these pass/fail rates instantly detects anomalous spikes in malformed outputs — the earliest warning sign of silent model drift or provider-side API changes.

4. Production LLM-as-a-Judge (asynchronous)

If strict data privacy agreements (DPAs) permit logging user inputs, teams can deploy model-based asserts. Architecturally, production LLM-Judges must never execute synchronously on the critical path, which doubles latency and compute costs. Instead, a background LLM-Judge asynchronously samples a fraction (5%) of daily sessions, grading outputs against the offline rubric to generate a continuous quality dashboard.

Advertisement

Engineering the feedback loop (the “flywheel”)

Evaluation pipelines are not “set-it-and-forget-it” infrastructure. Without continuous updates, static datasets suffer from “rot” (concept drift) as user behavior evolves and customers discover novel use cases.

For example, an HR chatbot might boast a pristine 99% offline pass rate for standard payroll questions. However, if the company suddenly announces a new equity plan, users will immediately begin prompting the AI about vesting schedules — a domain entirely missing from the offline evaluations.

To make the system smarter over time, engineers must architect a closed feedback loop that mines production telemetry for continuous improvement.

The continuous improvement workflow:

Advertisement
  1. Capture: A user triggers an explicit negative signal (a “thumbs down”) or an implicit behavioral flag in production.

  2. Triage: The specific session log is automatically flagged and routed for human review.

  3. Root-cause analysis: A domain expert investigates the failure, identifies the gap, and updates the AI system to successfully handle similar requests.

  4. Dataset augmentation: The novel user input, paired with the newly corrected expected output, is appended to the offline Golden Dataset alongside several synthetic variations.

  5. Regression testing: The model is continuously re-evaluated against this newly discovered edge case in all future runs.

Building an evaluation pipeline without monitoring production logs and updating datasets is fundamentally insufficient. Users are unpredictable. Evaluating on stale data creates a dangerous illusion: High offline pass rates masking a rapidly degrading real-world experience.

Conclusion: The new “definition of done”

In the era of generative AI, a feature or product is no longer “done” simply because the code compiles and the prompt returns a coherent response. It is only done when a rigorous, automated evaluation pipeline is deployed and stable — and when the model consistently passes against both a curated golden dataset and newly discovered production edge cases.

This guide has equipped you with a comprehensive blueprint for building that reality. From architecting offline regression pipelines and online telemetry to the continuous feedback flywheel and navigating enterprise anti-patterns, you now have the structural foundation required to deploy AI systems with greater confidence.

Now, it is your turn. Share this framework with your engineering, product, and legal teams to establish a unified, cross-functional standard for AI quality in your organization. Stop guessing whether your models are degrading in production, and start measuring.

Advertisement

Derah Onuorah is a Microsoft senior product manager.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!

Advertisement

Source link

Continue Reading

Tech

PenPal, A Robotic Drawing Assistant

Published

on

Emergent properties include examples like murmurations of starlings which can’t be predicted from looking at a single bird, weather which can’t be predicted by looking at a few air molecules, and consciousness which can’t be predicted by looking at a neuron. Likewise, when adding a new tool to a workflow, emergent properties can show up as well. A group at Chicago University developed a robotic drawing tool and a few artists developed some unique drawing methods using it.

The robotic pen uses a pair of tendons to extend the working end out a certain amount. From there it uses a set of servos to can be programmed to revolve around in a defined path, making repeating movements while the artist makes larger movements over the paper. Originally meant for shading, small circles or simpler back-and-forth movements were preset, but with full control over the pen’s behavior the artist can shift focus away to other tasks within the creative process. A study with ten participants was done which showed artists coming up with novel ways of using a tool like this, and others reporting that it’s almost like drawing together with another person.

Looking for novel ways that humans can interact with computers and robots can often lead to surprising outcomes like this. Members of this group aren’t new to novel human interface devices either; they’ve also built a squishy dynamic button as well.

Advertisement

Source link

Advertisement
Continue Reading

Trending

Copyright © 2025