Tech

DOGE Goes Nuclear: How Trump Invited Silicon Valley Into America’s Nuclear Power Regulator

Published

3 weeks ago

3 April 2026

NewsAdmin

from the move-fast-and-nuke-things dept

This story was originally published by ProPublica. Republished under a CC BY-NC-ND 3.0 license.

Last summer, a group of officials from the Department of Energy gathered at the Idaho National Laboratory, a sprawling 890-square-mile complex in the eastern desert of Idaho where the U.S. government built its first rudimentary nuclear power plant in 1951 and continues to test cutting-edge technology.

On the agenda that day: the future of nuclear energy in the Trump era. The meeting was convened by 31-year-old lawyer Seth Cohen. Just five years out of law school, Cohen brought no significant experience in nuclear law or policy; he had just entered government through Elon Musk’s Department of Government Efficiency team.

As Cohen led the group through a technical conversation about licensing nuclear reactor designs, he repeatedly downplayed health and safety concerns. When staff brought up the topic of radiation exposure from nuclear test sites, Cohen broke in.

“They are testing in Utah. … I don’t know, like 70 people live there,” he said.

“But … there’s lots of babies,” one staffer pushed back. Babies, pregnant women and other vulnerable groups are thought to be potentially more susceptible to cancers brought on by low-level radiation exposure, and they are usually afforded greater protections.

“They’ve been downwind before,” another staffer joked.

“This is why we don’t use AI transcription in meetings,” another added.

ProPublica reviewed records of that meeting, providing a rare look at a dramatic shift underway in one of the most sensitive domains of public policy. The Trump administration is upending the way nuclear energy is regulated, driven by a desire to dramatically increase the amount of energy available to power artificial intelligence.

Career experts have been forced out and thousands of pages of regulations are being rewritten at a sprint. A new generation of nuclear energy companies — flush with Silicon Valley cash and boasting strong political connections — wield increasing influence over policy. Figures like Cohen are forcing a “move fast and break things” Silicon Valley ethos on one of the country’s most important regulators.

The Trump administration has been particularly aggressive in its attacks on the Nuclear Regulatory Commission, the bipartisan independent regulator that approves commercial nuclear power plants and monitors their safety. The agency is not a household name. But it’s considered the international gold standard, often influencing safety rules around the world.

The NRC has critics, especially in Silicon Valley, where the often-cautious commission is portrayed as an impediment to innovation. In an early salvo, President Donald Trump fired NRC Commissioner Christopher Hanson last June after Hanson spoke out about the importance of agency independence. It was the first time an NRC commissioner had been fired.

During that Idaho meeting, Cohen shot down any notion of NRC independence in the new era.

“Assume the NRC is going to do whatever we tell the NRC to do,” he said, records reviewed by ProPublica show. In November, Cohen was made chief counsel for nuclear policy at the Department of Energy, where he oversees a broad nuclear portfolio.

The aggressive moves have sent shock waves through the nuclear energy world. Many longtime promoters of the industry say they worry recklessness from the Trump administration could discredit responsible nuclear energy initiatives.

“The regulator is no longer an independent regulator — we do not know whose interests it is serving,” warned Allison Macfarlane, who served as NRC chair during the Obama administration. “The safety culture is under threat.”

A ProPublica analysis of staffing data from the NRC and the Office of Personnel Management shows a rush to the exits: Over 400 people have left the agency since Trump took office. The losses are particularly pronounced in the teams that handle reactor and nuclear materials safety and among veteran staffers with 10 or more years of experience. Meanwhile, hiring of new staff has proceeded at a snail’s pace, with nearly 60 new arrivals in the first year of the Trump administration compared with nearly 350 in the last year of the Biden administration.

Some nuclear power supporters say the administration is providing a needed level of urgency given the energy demands of AI. They also contend the sweeping changes underway aren’t as dangerous or dire as some experts suggest.

“I think the NRC has been frozen in time,” said Brett Rampal, the senior director of nuclear and power strategy at the investment and strategy consultancy Veriten. “It’s a great time to get unfrozen and aim to work quickly.”

The White House referred most of ProPublica’s questions to the Department of Energy, where spokesperson Olivia Tinari said the agency is committed to helping build more safe, high-quality nuclear energy facilities.

“Thanks to President Trump’s leadership, America’s nuclear industry is entering a new era that will provide reliable, abundant power for generations to come,” she wrote. The DOE is “committed to the highest standards of safety for American workers and communities.”

Cohen did not respond to multiple requests for comment. The NRC declined to comment.

Blindsided by DOGE

The U.S. has not had a serious nuclear incident since the Three Mile Island partial meltdown in 1979, a track record many experts attribute to a rigorous regulatory environment and an intense safety culture.

Major nuclear incidents around the world have only strengthened the resolve of past regulators to stay independent from industry and from political winds. A chief cause of Japan’s Fukushima accident, investigators found, was the cozy relationship between the country’s industry and oversight body, which opened the door for thin safety assessments and inaccurate projections overlooking the possible impact of a major tsunami.

“We knew regulatory capture led directly to Fukushima and to Chernobyl,” said Kathryn Huff, who was assistant secretary for the Office of Nuclear Energy during the Biden administration.

The U.S. has barely built any nuclear power plants in recent decades. Only three new reactors have been completed in the last 25 years, and since 1990 the U.S has barely added any net new nuclear electricity to its grid. Though about 20% of U.S. energy is supplied by nuclear power plants, the fleet is aging. Some experts blame the slow build-out on the challenging economics of financing a multibillion-dollar project and the uncertainty of accessing and disposing of nuclear fuels.

But an increasingly vocal group of industry voices and deregulation advocates have blamed the slow build-out on overly cautious and inefficient regulators. Among the most powerful exponents of this view are billionaires Peter Thiel and Marc Andreessen; both venture capitalists have their own investments in the nuclear energy sector and are influential Trump supporters.

Andreessen camped out at Mar-a-Lago, Trump’s private club in Florida, after Trump won the 2024 election, helping pick staff for the new administration. In late 2024, Thiel personally vetted at least one candidate for the Office of Nuclear Energy, according to people familiar with the conversations. Neither responded to requests for comment.

Four months into his second term, Trump signed a series of executive orders designed to supercharge nuclear power build-out. “It’s a hot industry, it’s a brilliant industry,” said Trump, flanked by nuclear energy CEOs in the Oval Office. He added: “And it’s become very safe.”

Under those orders, the NRC was directed to reduce its workforce, speed up the timeline for approving nuclear reactors and rewrite many of its safety rules. The DOE — which has a vast nuclear portfolio, including waste cleanup sites and government research labs — was tasked with creating a pathway for so-called advanced nuclear companies to test their designs.

The goal, Trump said, was to quadruple nuclear energy output and provide new power to the data centers behind the AI boom.

As DOGE gutted agencies, departures mounted in the nuclear sector. Career experts in nuclear regulations and safety departed or were forced out. When Trump fired Hanson, a Democratic NRC commissioner, the president’s team explained the move by saying, “All organizations are more effective when leaders are rowing in the same direction.”

In an unsigned email to ProPublica, the White House press office wrote: “All commissioners are presidential appointees and can be fired just like any other appointee.”

In August, the NRC’s top attorney resigned and was replaced by oil and gas lawyer David Taggart, who had been working on DOGE cuts at the DOE. In all, the nuclear office at the DOE had lost about a third of its staff, according to a January 2026 count by the Federation of American Scientists, a nonprofit focused on science and technology policy.

That summer, Cohen and a team of DOGE operatives touched down at the NRC offices, a series of nondescript towers across from a Dunkin’ in suburban Maryland. He was joined by Adam Blake, an investor who had recently founded an AI medical startup and has a background in real estate and solar energy, and Ankur Bansal, president of a company that created software for real estate agents. Neither would comment for this story.

Many career officials who spoke with ProPublica were blindsided: The new Trump officials at the NRC seemed to have no experience with the intricacies of nuclear energy policy or law, they said. One NRC lawyer who briefed some of the new arrivals decided to resign. “They were talking about quickly approving all these new reactors, and they didn’t seem to care that much about the rules — they wanted to carry out the wishes of the White House,” the official said.

At one point, Cohen began passing out hats from nuclear energy startup Valar Atomics, one of the companies vying to build a new reactor, according to sources familiar with the matter and records seen by ProPublica. NRC staffers balked; they were supposed to monitor companies like Valar for safety violations, not wear its swag.

NRC ethics officials warned Cohen that the hat handout was a likely violation of conflict rules. It betrayed a misunderstanding of the safety regulator’s role, said a former official familiar with the exchange. “Imagine you live near a nuclear power plant, and you find out a supposedly independent safety regulator — the watchdog — is going around wearing the power plant’s branded hats,” the official said. “Would that make you feel safe?” The NRC and Cohen did not respond to requests for comment about the hat incident.

Valar counts Trump’s Silicon Valley allies as angel investors. They include Palmer Luckey, a technology executive and founder of the defense contractor Anduril, and Shyam Sankar, chief technology officer of Palantir, the software company helping power Immigration and Customs Enforcement’s deportation raids.

It was among three nuclear reactor companies that sued the NRC last year in an attempt to strip it of its authority to regulate its reactors and replace it with a state-level regulator. Before the Trump administration came into office, lawyers watching the case were confident the courts would quickly dismiss the suit, as the NRC’s authority to regulate reactors is widely acknowledged. But new Trump appointees pushed for a compromise settlement — which is still being negotiated. The career NRC lawyer working on the case quietly left the agency.

Valar and its executives did not reply to requests for comment.

“Going So Fast”

The deregulatory push is the culmination of mounting pressure — both political and economic — to make it easier to build nuclear power in the U.S. Over the years, a bipartisan coalition supporting nuclear expansion brought together environmentalists who favor zero-carbon power and defense hawks focused on abundant domestic energy production.

Anti-nuclear activists still argue that renewable energy like wind and solar are safer and more economical. But streamlining the NRC has been a bipartisan priority as well. The latest major reform came in 2024, when President Joe Biden signed into law the ADVANCE Act, which went as far as changing the mission statement of the NRC to ensure it “does not unnecessarily limit” nuclear energy development.

Some nuclear power supporters say the Trump administration is merely accelerating these changes. They cite instances in which the current regulations appear out of sync with the times. The NRC’s byzantine rules are designed for so-called large light-water reactors — massive facilities that can power entire cities — and not the increasingly in vogue smaller advanced reactor designs popular among Silicon Valley-backed firms.

Rules that require fences of certain heights might make little sense for new reactors buried in the earth; and rules that require a certain number of operators per reactor could be a bad fit for a cluster of smaller reactors with modern controls. Advances in sensors, modeling and safety technologies, they say, should be taken into account across the board.

The NRC has said it expects over two dozen new license requests from small modular and advanced reactor companies in coming years. Many of those requests are likely to come from new, Silicon Valley-based nuclear firms.

“There was a missing link in the innovation cycle, and it was very difficult to build something and test it in the U.S. because of mostly licensing and site availability constraints in the past,” said Adam Stein of the pro-nuclear nonprofit Breakthrough Institute.

The regulatory changes are in flux: This spring, the NRC is starting to release thousands of pages of new rules governing everything from the safety and emergency preparedness plans reactor companies are required to submit to the procedures for objecting to a reactor license.

“It’s hard to know if they are getting rid of unnecessary processes or if it’s actually reducing public safety,” said one official working on reactor licensing, who, like others, spoke on the condition of anonymity for fear of retaliation from the Trump administration. “And that’s just the problem with going so fast — everything just kind of gets lost in a mush.”

Lawyers from the Executive Office of the President have been sent to the NRC to keep an eye on the new rules, a move that further raised alarms about the agency’s independence.

Nicholas Gallagher — a relatively recent New York University law school graduate and conservative writer whom ProPublica previously identified as a DOGE operative at the General Services Administration — has been involved in conversations about overhauling environmental rules.

He’s working alongside Sydney Volanski, a 30-year-old recent law school graduate who rose to national attention while she was in high school for her campaign against the Girl Scouts of America, which she accused of promoting “Marxists, socialists and advocates of same-sex lifestyle.”

NRC lawyers working on the rules were told last October that Gallagher and Volanski would be joining them, and they both appear on the regular NRC rulemaking calendar invite.

The White House maintains, however, that “zero lawyers from the Executive Office of the President have been dispatched to work on rulemaking.” Neither Gallagher nor Volanski replied to requests for comment.

The administration is routing the new rules through an office overseen by Trump’s cost-cutting guru Russell Vought, a move that was previously unheard of for an independent regulator like the NRC. The White House spokesperson noted that, under a recent executive order, this process is now required for all agencies.

Political operatives have been “inserted into the senior leadership team to the point where they could significantly influence decision-making,” said Scott Morris, who worked at the NRC for more than 32 years, most recently as the No. 2 career operations official. “I just think that would be a dangerous proposition.”

Morris voted for Trump twice and broadly supports the goals of deregulating and expanding nuclear energy, but he has begun speaking out against the administration’s interference at the NRC. He retired in May 2025 as part of a wave of retirements and firings.

At a recent hearing before the Atomic Safety and Licensing Board — an independent body that helps adjudicate nuclear licensing — NRC lawyers withdrew from the proceedings, citing “limited resources.” The judge remarked that it was the first time in over 20 years the NRC had done so.

Meanwhile, some staff members, other career officials say, are afraid to voice dissenting views for fear of being fired. “It feels like being a lobster in a slowly boiling pot,” one NRC official who has been working on the rule changes told ProPublica, describing the erosion of independence.

The official was one of three who compared their recent experience at NRC to being in a pot of slowly boiling water. “If somebody is raising something that they think that the industry or the White House would have a problem with, they think twice,” the official said.

Inside the NRC, the steering committee overseeing the changes includes Cohen, Taggart and Mike King, a career NRC official who is the newly installed executive director for operations. The former director, Mirela Gavrilas, a 21-year veteran of the agency, retired after getting boxed out of decision-making, according to a person familiar with her departure. Gavrilas did not respond to a request for comment.

Any final changes will be approved by the NRC’s five commissioners, three of whom are Republicans. In September, the two Democratic commissioners told a Senate committee they might be fired at any time if they get crosswise with Trump — including over revisions to safety rules.

Draft rules being circulated inside the NRC propose drastic rollbacks of security and safety inspections at nuclear facilities. Those include a proposed 56% cut in emergency preparedness inspection time, CNN reported in March.

Even some pro-nuclear groups are troubled by the emerging order. Some have tried to backchannel to their contacts in the Trump administration to explain the importance of an independent regulator to help maintain public support for nuclear power. Without it, they risk losing credibility.

“You have to make sure you don’t throw out the baby with the bathwater,” said Judi Greenwald, president and CEO of the Nuclear Innovation Alliance, a nonprofit that promotes nuclear energy and supports many of the regulatory changes being proposed by the Trump administration.

Greenwald’s group favors faster timelines for approving nuclear reactors, but she worries that the agency’s fundamental independence has been undermined. “We would prefer that they yield back more of NRC independence,” she said.

“Nuke Bros” in Silicon Valley

One Trump administration priority has been making it easier for so-called advanced reactor companies to navigate the regulatory process. These firms, mostly backed by Silicon Valley tech and venture money, are often working on designs for much smaller reactors that they hope to mass produce in factories.

“There are two nuclear industries,” said Macfarlane, the former NRC chair. “There are the actual people who use nuclear reactors to produce power and put it on the grid … and then there are the ‘nuke bros’” in Silicon Valley.

Trump’s Silicon Valley allies have loomed large over his nuclear policy. One prospective political appointee for a top DOE nuclear job got a Christmas Eve call from Thiel, the rare Silicon Valley leader to back Trump in 2016. Thiel, whose Founders Fund invested in a nuclear fuel startup and an advanced reactor company, quizzed the would-be official about deregulation and how to rapidly build more nuclear energy capacity, said sources familiar with the conversation.

Nuclear energy startups jockeyed to spend time at Mar-a-Lago in the months before the start of Trump’s second term. Balerion Space Ventures, a venture capital firm that has invested in multiple companies, convened an investor summit there in January 2025, according to an invitation viewed by ProPublica. Balerion did not reply to a request for comment.

A few months later, when Trump was drawing up the executive orders, leaders at many of those nuclear companies were given advanced access to drafts of the text — and the opportunity to provide suggested edits, documents viewed by ProPublica show.

Those orders created a new program to test out experimental reactor designs, addressing a common complaint that companies are not given opportunities to experiment. There are currently about a dozen advanced reactor companies planning to participate. Each has a concierge team within the DOE to help navigate bureaucracy. As NPR reported in January, the DOE quietly overhauled a series of safety rules that would apply to these new reactors and shared the new regulations with these companies before making them public.

Secretary of Energy Chris Wright — who served on the board of one of those companies, Oklo — has said fast nuclear build-out is a priority: “We are moving as quickly as we can to permit, build and enable the rapid construction of as much nuke capacity as possible,” he told CNBC last fall. Oklo noted that Wright stepped down from the board when he was confirmed.

The Trump administration hopes some of the companies would have their reactors “go critical” — a key first step on the way to building a functioning power plant — by July 2026. Then the NRC, which signs off on the safety designs of commercial nuclear power plants, could be expected to quickly OK these new reactors to get to market.

According to people familiar with the conversations, at least one nuclear energy startup CEO personally recruited potential members of the DOGE nuclear team, though it’s not clear if Cohen was brought aboard this way. Cohen has told colleagues and industry contacts that he reports to Emily Underwood, one of Trump adviser Stephen Miller’s top aides for economic policy. He is perceived inside government as a key avatar of the White House’s nuclear agenda.

In its email to ProPublica, the White House said, “Seth Cohen is a Department of Energy employee and does not report to Emily Underwood or Stephen Miller in any capacity.”

The DOE spokesperson added, “Seth’s role at the Department of Energy is to support the Trump administration’s mission to unleash American Energy Dominance.”

Cohen has been pushing to raise the legal limit of radiation that nuclear energy companies are allowed to emit from their facilities. One nuclear industry insider, who spoke on the condition of anonymity, said many firms are fixating on changing these radiation rules: Their business model requires moving nuclear reactors around the country, often near workers or the general public.

Building thick, expensive shielding walls can be prohibitively expensive, they said.

Valar CEO Isaiah Taylor has called limits on exposure to radiation a top barrier to industry growth. A recent DOE memo seen by ProPublica cites cost savings on shielding for Valar’s reactor to justify changing those limits. “Shielding-related cost reductions,” the memo said, “could range from $1-2 million per reactor.” The debate over the precise rule change is ongoing.

The DOE has been considering a fivefold increase to the limit for public exposure to radiation, which will allow some nuclear reactor companies to cut costs on these expensive safety shields, internal DOE documents seen by ProPublica show.

A presentation prepared by DOE staffers in their Idaho offices that has circulated inside the department makes the “business case” for changing the radiation dose rules: It could cut the cost of some new reactors by as much as 5%. These more relaxed standards are likely to be adopted by the NRC and apply to reactors nationwide, documents show.

In February, Wright accompanied Valar’s executive team on a first-of-its-kind flight, as a U.S. military plane was conscripted to fly the company’s reactor from Los Angeles to Utah. Valar does not yet have a working nuclear reactor, and a number of industry sources told ProPublica they viewed the airlift as a PR exercise. Internal government memos justified the airlift by designating it as “critical” to the U.S. “national security interests.”

Cohen posted smiling pictures of himself from the cargo bay of the military plane.

Cohen told an audience at the American Nuclear Society that the rapid build-out was essential to powering Silicon Valley’s AI data centers. He framed the policy in existential terms: “I can’t emphasize this strongly enough that losing the AI war is an outcome akin to the Nazis developing the bomb before the United States.”

As it deliberated rule changes, the DOE has cut out its internal team of health experts who work on radiation safety at the Office of Environment, Health, Safety and Security, said sources familiar with the decision. The advice of outside experts on radiation protection has been largely cast aside.

The DOE spokesperson said its radiation standards “are aligned with Gold Standard Science … with a focus on protecting people and the environment while avoiding unnecessary bureaucracy.”

The department has already decided to abandon the long-standing radiation protection principle known as “ALARA” — the “As Low As Reasonably Achievable” standard — which directs anyone dealing with radioactive materials to minimize exposure.

It often pushes exposure well below legal thresholds. Many experts agreed that the ALARA principle was sometimes applied too strictly, but the move to entirely throw it out was opposed by many prominent radiation health experts.

Whether the agencies will actually change the legal thresholds for radiation exposure is an open question, said sources familiar with the deliberations.

Internal DOE documents arguing for changing dose rules cite a report produced at the Idaho National Laboratory, which was compiled with the help of the AI assistant Claude. “It’s really strange,” said Kathryn Higley, president of the National Council on Radiation Protection and Measurements, a congressionally chartered group studying radiation safety. “They fundamentally mistake the science.”

John Wagner, the head of the Idaho National Laboratory and the report’s lead author, acknowledged to ProPublica that the science over changing radiation exposure rules is hotly contested. “We recognize that respected experts interpret aspects of this literature differently,” he wrote. His analysis was not meant to be the final word, he said, but was “intended to inform debate.”

The impact of radiation levels at very low doses is hard to measure, so the U.S. has historically struck a cautious note. Raising dose limits could put the U.S. out of step with international standards.

For his part, Cohen has told the nuclear industry that he sees his job as making sure the government “is no longer a barrier” to them.

In June, he shot down the notion of companies putting money into a fund for workplace accidents. “Put yourself in the shoes of one of these startups,” he said. “They’re raising hundreds of millions of dollars to do this. And then they would have to go to their VCs and their board and say, listen, guys, we actually need a few hundred million dollars more to put into a trust fund?”

He also suggested that regulators should not fret about preparing for so-called 100-year events — disasters that have roughly a 1% chance of taking place but can be catastrophic for nuclear facilities.

“When SpaceX started building rockets, they sort of expected the first ones to blow up,” he said.

Filed Under: adam blake, chris wright, david taggart, doge, donald trump, elon musk, nicholas gallagher, nrc, nuclear reactors, peter thiel, regulations, scott morris, seth cohen, sydney volanski

Companies: oklo, valar atomics

Source link

Tech

Monitoring LLM behavior: Drift, retries, and refusal patterns

Published

11 minutes ago

25 April 2026

NewsAdmin

The stochastic challenge

Traditional software is predictable: Input A plus function B always equals output C. This determinism allows engineers to develop robust tests. On the other hand, generative AI is stochastic and unpredictable. The exact same prompt often yields different results on Monday versus Tuesday, breaking the traditional unit testing that engineers know and love.

To ship enterprise-ready AI, engineers cannot rely on mere “vibe checks” that pass today but fail when customers use the product. Product builders need to adopt a new infrastructure layer: The AI Evaluation Stack.

This framework is informed by my extensive experience shipping AI products for Fortune 500 enterprise customers in high-stakes industries, where “hallucination” is not funny — it’s a huge compliance risk.

Defining the AI evaluation paradigm

Traditional software tests are binary assertions (pass/fail). While some AI evals use binary asserts, many evaluate on a gradient. An eval is not a single script; it is a structured pipeline of assertions — ranging from strict code syntax to nuanced semantic checks — that verify the AI system’s intended function.

The taxonomy of evaluation checks

To build a robust, cost-effective pipeline, asserts must be separated into two distinct architectural layers:

Layer 1: Deterministic assertions

A surprisingly large share of production AI failures aren’t semantic “hallucinations” — they are basic syntax and routing failures. Deterministic assertions serve as the pipeline’s first gate, using traditional code and regex to validate structural integrity.

Instead of asking if a response is “helpful,” these assertions ask strict, binary questions:

Did the model generate the correct JSON key/value schema?
Did it invoke the correct tool call with the required arguments?
Did it successfully slot-fill a valid GUID or email address?

// Example: Layer 1 Deterministic Tool Call Assertion

{

“test_scenario”: “User asks to look up an account”,

“assertion_type”: “schema_validation”,

“expected_action”: “Call API: get_customer_record”,

“actual_ai_output”: “I found the customer.”,

“eval_result”: “FAIL – AI hallucinated conversational text instead of generating the required API payload.”

}

In the example above, the test failed instantly because the model generated conversational text instead of the required tool call payload.

Architecturally, deterministic assertions must be the first layer of the stack, operating on a computationally inexpensive “fail-fast” principle. If a downstream API requires a specific schema, a malformed JSON string is a fatal error. By failing the evaluation immediately at this layer, engineering teams prevent the pipeline from triggering expensive semantic checks (Layer 2) or wasting valuable human review time (Layer 3).

Layer 2: Model-based assertions

When deterministic assertions pass, the pipeline must evaluate semantic quality. Because natural language is fluid, traditional code cannot easily assert if a response is “helpful” or “empathetic.” This introduces model-based evaluation, commonly referred to as “LLM-as-a-Judge” or “LLM-Judge.”

While using one non-deterministic system to evaluate another seems counterintuitive, it is an exceptionally powerful architectural pattern for use cases requiring nuance. It is virtually impossible to write a reliable regex to verify if a response is “actionable” or “polite.” While human reviewers excel at this nuance, they cannot scale to evaluate tens of thousands of CI/CD test cases. Thus, the LLM-as-a-Judge becomes the scalable proxy for human discernment.

3 critical inputs for model-based assertions

However, model-based assertions only yield reliable data when the LLM-as-a-Judge is provisioned with three critical inputs:

A state-of-the-art reasoning model: The Judge must possess superior reasoning capabilities compared to the production model. If your app runs on a smaller, faster model for latency, the judge must be a frontier reasoning model to approximate human-level discernment.
A strict assessment rubric: Vague evaluation prompts (“Rate how good this answer is”) yield noisy, stochastic evaluations. A robust rubric explicitly defines the gradients of failure and success. (For example, a “Helpfulness” rubric should define Score 1 as an irrelevant refusal, Score 2 as addressing the prompt but lacking actionable steps, and Score 3 as providing actionable next steps strictly within context.)
Ground truth (golden outputs): While the rubric provides the rules, a human-vetted “expected answer” acts as the answer key. When the LLM-Judge can compare the production model’s output against a verified Golden Output, its scoring reliability increases dramatically.

Architecture: The offline vs online pipeline

A robust evaluation architecture requires two complementary pipelines. The online pipeline monitors post-deployment telemetry, while the offline pipeline provides the foundational baseline and deterministic constraints required to evaluate stochastic models safely.

The offline evaluation pipeline

The offline pipeline’s primary objective is regression testing — identifying failures, drift, and latency before production. Deploying an enterprise LLM feature without a gating offline evaluation suite is an architectural anti-pattern; it is the equivalent of merging uncompiled code into a main branch.

Process

1. Curating the golden dataset

The offline lifecycle begins by curating a “golden dataset” — a static, version-controlled repository of 200 to 500 test cases representing the AI’s full operational envelope. Each case pairs an exact input payload with an expected “golden output” (ground truth).

Crucially, this dataset must reflect expected real-world traffic distributions. While most cases cover standard “happy-path” interactions, engineers must systematically incorporate edge cases, jailbreaks, and adversarial inputs. Evaluating “refusal capabilities” under stress remains a strict compliance requirement.

Example test case payload (standard tool use):

Input: “Schedule a 30-minute follow-up meeting with the client for next Tuesday at 10 a.m.”
Expected output (golden): The system successfully invokes the schedule_meeting tool with the correct JSON payload: {“duration_minutes”: 30, “day”: “Tuesday”, “time”: “10 AM”, “attendee”: “client_email”}.

While manually curating hundreds of edge cases is tedious, the process can be accelerated with synthetic data generation pipelines that use a specialized LLM to produce diverse TSV/CSV test payloads. However, relying entirely on AI-generated test cases introduces the risk of data contamination and bias. A human-in-the-loop (HITL) architecture is mandatory at this stage; domain experts must manually review, edit, and validate the synthetic dataset to ensure it accurately reflects real-world user intent and enterprise policy before it is committed to the repository.

2. Defining the evaluation criteria

Once the dataset is curated, engineers must design the evaluation criteria to compute a composite score for each model output. A robust architecture achieves this by assigning weighted points across a hybrid of Layer 1 (deterministic) and Layer 2 (model-based) asserts.

Consider an AI agent executing a “send email” tool. An evaluation framework might utilize a 10-point scoring system:

Layer 1: Deterministic asserts (6 points): Did the agent invoke the correct tool? (2 pts). Did it produce a valid JSON object? (2 pts). Does the JSON strictly adhere to the expected schema? (2 pts).
Layer 2: Model-based asserts (4 points): (Note: Semantic rubrics must be highly use-case specific). Does the subject line reflect user intent? (1 pt). Does the email body match expected outputs without hallucination? (1 pt). Were CC/BCC fields leveraged accurately? (1 pt). Was the appropriate priority flag inferred? (1 pt).

To understand why the LLM-Judge awarded these points, the engineer must prompt the judge to supply its reasoning for each score. This is crucial for debugging failures.

The passing threshold and short-circuit logic

In this example, an 8/10 passing threshold requires 8 points for success. Crucially, the evaluation pipeline must enforce strict short-circuit evaluation (fail-fast logic). If the model fails any deterministic assertion — such as generating a malformed JSON schema — the system must instantly fail the entire test case (0/10). There is zero architectural value in invoking an expensive LLM-Judge to assess the semantic “politeness” of an email if the underlying API call is structurally broken.

3. Executing the pipeline and aggregating signals

Using an evaluation infrastructure of choice, the system executes the offline pipeline — typically integrated as a blocking CI/CD step during a pull request. The infrastructure iterates through the golden dataset, injecting each test payload into the production model, capturing the output, and executing defined assertions against it.

Each output is scored against the passing threshold. Once batch execution is complete, results are aggregated into an overall pass rate. For enterprise-grade applications, the baseline pass rate must typically exceed 95%, scaling to 99%-plus for strict compliance or high-risk domains.

4. Assessment, iteration, and alignment

Based on aggregated failure data, engineering teams conduct a root-cause analysis of failing test cases. This assessment drives iterative updates to core components: refining system prompts, modifying tool descriptions, augmenting knowledge sources, or adjusting hyperparameters (like temperature or top-p). Continuous optimization remains best practice even after achieving a 95% pass rate.

Crucially, any system modification necessitates a full regression test. Because LLMs are inherently non-deterministic, an update intended to fix one specific edge case can easily cause unforeseen degradations in other areas. The entire offline pipeline must be rerun to validate that the update improved quality without introducing regressions.

The online evaluation pipeline

While the offline pipeline acts as a strict pre-deployment gatekeeper, the online pipeline is the post-deployment telemetry system. Its objective is to monitor real-world behavior, capturing emergent edge cases, and quantifying model drift. Architects must instrument applications to capture five distinct categories of telemetry:

1. Explicit user signals

Direct, deterministic feedback indicating model performance:

Thumbs up/down: Disproportionate negative feedback is the most immediate leading indicator of system degradation, directing immediate engineering investigation.
Verbatim in-app feedback: Systematically parsing written comments identifies novel failure modes to integrate back into the offline “golden dataset.”

2. Implicit behavioral signals

Behavioral telemetry reveals silent failures where users give up without explicit feedback:

Regeneration and retry rates: High frequencies of retries indicate the initial output failed to resolve user intent.
Apology rate: Programmatically scanning for heuristic triggers (“I’m sorry”) detects degraded capabilities or broken tool routing.
Refusal rate: Artificially high refusal rates (“I can’t do that”) indicate over-calibrated safety filters rejecting benign user queries.

3. Production deterministic asserts (synchronous)

Because deterministic code checks execute in milliseconds, teams can seamlessly reuse Layer 1 offline asserts (schema conformity, tool validity) to synchronously evaluate 100% of production traffic. Logging these pass/fail rates instantly detects anomalous spikes in malformed outputs — the earliest warning sign of silent model drift or provider-side API changes.

4. Production LLM-as-a-Judge (asynchronous)

If strict data privacy agreements (DPAs) permit logging user inputs, teams can deploy model-based asserts. Architecturally, production LLM-Judges must never execute synchronously on the critical path, which doubles latency and compute costs. Instead, a background LLM-Judge asynchronously samples a fraction (5%) of daily sessions, grading outputs against the offline rubric to generate a continuous quality dashboard.

Engineering the feedback loop (the “flywheel”)

Evaluation pipelines are not “set-it-and-forget-it” infrastructure. Without continuous updates, static datasets suffer from “rot” (concept drift) as user behavior evolves and customers discover novel use cases.

For example, an HR chatbot might boast a pristine 99% offline pass rate for standard payroll questions. However, if the company suddenly announces a new equity plan, users will immediately begin prompting the AI about vesting schedules — a domain entirely missing from the offline evaluations.

To make the system smarter over time, engineers must architect a closed feedback loop that mines production telemetry for continuous improvement.

The continuous improvement workflow:

Capture: A user triggers an explicit negative signal (a “thumbs down”) or an implicit behavioral flag in production.
Triage: The specific session log is automatically flagged and routed for human review.
Root-cause analysis: A domain expert investigates the failure, identifies the gap, and updates the AI system to successfully handle similar requests.
Dataset augmentation: The novel user input, paired with the newly corrected expected output, is appended to the offline Golden Dataset alongside several synthetic variations.
Regression testing: The model is continuously re-evaluated against this newly discovered edge case in all future runs.

Building an evaluation pipeline without monitoring production logs and updating datasets is fundamentally insufficient. Users are unpredictable. Evaluating on stale data creates a dangerous illusion: High offline pass rates masking a rapidly degrading real-world experience.

Conclusion: The new “definition of done”

In the era of generative AI, a feature or product is no longer “done” simply because the code compiles and the prompt returns a coherent response. It is only done when a rigorous, automated evaluation pipeline is deployed and stable — and when the model consistently passes against both a curated golden dataset and newly discovered production edge cases.

This guide has equipped you with a comprehensive blueprint for building that reality. From architecting offline regression pipelines and online telemetry to the continuous feedback flywheel and navigating enterprise anti-patterns, you now have the structural foundation required to deploy AI systems with greater confidence.

Now, it is your turn. Share this framework with your engineering, product, and legal teams to establish a unified, cross-functional standard for AI quality in your organization. Stop guessing whether your models are degrading in production, and start measuring.

Derah Onuorah is a Microsoft senior product manager.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!

Source link

Tech

PenPal, A Robotic Drawing Assistant

Published

53 minutes ago

25 April 2026

NewsAdmin

Emergent properties include examples like murmurations of starlings which can’t be predicted from looking at a single bird, weather which can’t be predicted by looking at a few air molecules, and consciousness which can’t be predicted by looking at a neuron. Likewise, when adding a new tool to a workflow, emergent properties can show up as well. A group at Chicago University developed a robotic drawing tool and a few artists developed some unique drawing methods using it.

The robotic pen uses a pair of tendons to extend the working end out a certain amount. From there it uses a set of servos to can be programmed to revolve around in a defined path, making repeating movements while the artist makes larger movements over the paper. Originally meant for shading, small circles or simpler back-and-forth movements were preset, but with full control over the pen’s behavior the artist can shift focus away to other tasks within the creative process. A study with ten participants was done which showed artists coming up with novel ways of using a tool like this, and others reporting that it’s almost like drawing together with another person.

Looking for novel ways that humans can interact with computers and robots can often lead to surprising outcomes like this. Members of this group aren’t new to novel human interface devices either; they’ve also built a squishy dynamic button as well.

Source link

Tech

Microsoft rolls out revamped Windows Insider Program

Published

1 hour ago

25 April 2026

NewsAdmin

Windows 11

Microsoft says it’s rolling out a revamped Windows Insider Program experience as part of the broader plans to address reliability concerns in Windows 11.

For those unaware, the Windows Insider Program is a beta testing program that allows you to test early Windows releases and provide your feedback to Microsoft.

Until now, Microsoft has not really listened to all the feedback from testers, and all that has added up to a poor Windows experience.

To address this, Microsoft is now making the Windows Insider Program simpler and more transparent in the hope that it will help with the development of Windows 11.

In a blog post, Microsoft admitted that the current channel structure is confusing.

Insider Program used to be simple when Microsoft replaced Insider Rings with Channels, similar to Chromium (Beta, Dev, and Canary), but over time, the structure has become more and more confusing.

There’s no clarity on what channel you should pick if you want to be on the edge and test new features as they develop internally at Microsoft. In fact, most testers never get access to experimental features, thanks to Microsoft’s Controlled Feature Rollout (CFR).

Microsoft has acknowledged that the experience is frustrating: you read about a new feature on the internet, update your PC, hoping to test and provide feedback, and then find out it’s not there.

“That experience, where features are announced but only some of you receive them due to how we gradually roll things out, is the single biggest frustration we hear,” writes Alec Oot, who is responsible for the Windows Update experience at Microsoft.

While you can use third-party tools like ViveTool to enable experimental features, it’s not the ideal experience and isn’t what you signed up for.

Microsoft says the Windows Insider Program is now simpler and more transparent

Microsoft says it’s listening to feedback, making all channels simpler, and moving the Insider Program to just two channels.

The first new channel is ‘Experimental,’ which replaces the Dev and Canary channels. The name makes it obvious that it’s the channel you should sign up for if all you want to do is test experimental features, which may never ship in production.

The second new channel is still called ‘Beta,’ which is an updated version of the original Beta Channel.

**Windows Insider Program now has only two channels**

Source: Microsoft

In the Beta Channel, Microsoft is ending gradual feature rollouts, which means all new features mentioned in the release notes will be immediately available.

In the Experimental channel, you’ll be given access to some features out of the box, but others will be locked behind a flag.

Feature flags in Experimental Channel — **Feature flags to turn on features gradually rolling out.**

Source: Microsoft

The good news is you can manually toggle experimental features from Windows Settings.

For example, if you want to try out new haptic features for the mouse but the feature isn’t showing due to a gradual rollout, you can open Windows Insider Program Settings > Feature flags, then turn on the feature.

Microsoft explains how it’s rolling out the new channels to Windows Insiders

Microsoft says it is moving Insiders to the new channels in phases, starting with Dev Channel users, who will now move to Experimental.

If you are in Dev and do not see the new Experimental channel UI yet, Microsoft says you can manually turn it on by going to Settings > Windows Update > Windows Insider Program > Feature flags and enabling the new experience.

Over the next few weeks, Microsoft will also move Canary users to specific versions of Experimental.

Those on the Canary 28000 series will move to Experimental (26H1), while users who installed the optional 29500 series update will move to Experimental (Future Platforms).

Future platforms — **Advanced Insider Program controls to test future platform releases**

Source: Microsoft

Beta Channel users will move to the new Beta experience, but Microsoft says some minor feature changes may happen during the transition.

If you want to keep access to all existing experimental features, Microsoft recommends moving from Beta to Dev before the transition, as Dev is being moved to Experimental. Microsoft is also changing how it shares build details.

As part of today’s rollout, Microsoft is shipping Build 26220.8283 for Beta, Build 26300.8289 for Experimental, Build 28020.1873 for Experimental 26H1, and Build 29576.1000 for Experimental Future Platforms.

Today’s update includes early access to a new Windows Update experience where you can pause updates as you desire, avoid forced reboots, and more.

AI chained four zero-days into one exploit that bypassed both renderer and OS sandboxes. A wave of new exploits is coming.

At the Autonomous Validation Summit (May 12 & 14), see how autonomous, context-rich validation finds what’s exploitable, proves controls hold, and closes the remediation loop.

Claim Your Spot

Source link

Tech

BMW brings color changing tech closer to production with the iX3 Flow Edition

Published

1 hour ago

25 April 2026

NewsAdmin

Unveiled at the 2026 Beijing Auto Show, the BMW iX3 Flow Edition integrates E Ink’s Prism technology directly into the vehicle’s hood, bringing the concept closer to real-world application. Unlike earlier efforts that relied on external layers of segmented panels, this version embeds the electrophoretic system into the structure of…
Read Entire Article
Source link

Tech

What’s The Difference Between Kelly And Goodyear Tires?

Published

1 hour ago

25 April 2026

NewsAdmin

A bunch of tires and wheels standing upright with a bright sunset in the background

Goami/Getty Images

If you’re shopping for Kelly tires, you might be surprised to find yourself on the Goodyear site. No, this isn’t a fluke: Goodyear and Kelly have been sister brands since 1935. Today, Goodyear is the Tire & Rubber Company’s premium flagship brand. It’s the more high-end of the two, offering more durability across a wider range of different driving conditions than Kelly. Rain, snow, or rugged terrain, Goodyear probably has a tire for you.

Kelly Tires is more straightforward. Of the two, it’s definitely the most budget-friendly option. The Kelly brand is technically older than Goodyear itself, but it’s existed under the Goodyear corporate umbrella since the 1930s. It might not be on the cutting edge of innovation, and it might not be advertising the same top-tier performance specs, but Kelly does do one thing better than Goodyear: Give you fine-enough tires at a lower price point. You still get all-season traction and year-round reliability, but just at a much more accessible cost per tire. Beyond pricing, the product lines are pretty different. Goodyear has six different tire types for over half a dozen different kinds of vehicles, but Kelly’s lineup is much simpler.

Differences in warranty and product lineup

DiPres/Shutterstock

Goodyear’s full lineup covers snow, sport, heavy-duty, and all-season tires for cars, trucks, SUVs, trailers, and more. Kelly’s selection is much smaller and more streamlined than that; just five tire models, and all five of them are all-season, no winter tires or summer tires. Not a lot of variety there compared to Goodyear, but that’s okay. It’s not trying to be Goodyear.

Then there’s the respective warranties. Goodyear has one of the best tire warranties around; a 60-day satisfaction guarantee that basically gives drivers two whole months to think about their purchase. Kelly also has a satisfaction guarantee, but it’s a little more limited than Goodyear’s; 45 days compared to Goodyear’s 60, or about a month and a half. Still, both Goodyear and Kelly give you price matching and access to post-purchase customer support. When it comes down to it, the difference is less about quality versus inferiority, and more about intended use and budget. Goodyear’s more premium, while Kelly’s more affordable.

Source link

Tech

Linux Drops ISDN Subsystem and Other Old Network Drivers

Published

2 hours ago

25 April 2026

NewsAdmin

“Old code like amateur radio and NFC have long been a burden to core networking developers,” reads the pull request.

And so Thursday Linus Torvald merged the pull request “to rid the Linux kernel of the old Integrated Services Digital Network (ISDN) subsystem,” reports Phoronix, “and various other old network drivers largely for PCMCIA era network adapters.”

This was the code suggested for removal given the recent influx of AI/LLM-generated bug reports against this dated code that likely has no active upstream users remaining… [W]ith the large language models and increased code fuzzing finding potential issues with these drivers for obsolete hardware, it’s easier to just get rid of these drivers if no one is actively using the hardware from decades ago…

This merge lightens the kernel by 138,161 lines of code with ISDN gone and numerous old network adapters and also getting rid of legacy ATM device drivers as well as the amateur ham radio support. The main networking drivers removed affect the 3com 3c509 / 3c515 / 3c574 / 3c589, AMD Lance, AMD NMCLAN, SMSC SMC9194 / SMC91C92, Fujitsu FMVJ18X, and 8390 AX88190 / Ultra / WD80X3.

Linux 7.1 also has removed the long-obsolete bus mouse support as well as beginning to phase out Intel 486 CPU support and removing support for Russia’s Baikal CPUs.

Source link

Tech

HBO Max: The 26 Absolute Best Movies to Watch

Published

2 hours ago

25 April 2026

NewsAdmin

Here are some highly rated films to try, plus a look at what’s new in April.

Source link

Tech

Best Indoor Security Camera 2026: Keep your home secure

Published

2 hours ago

25 April 2026

NewsAdmin

It doesn’t matter if you want to keep tabs on your home whilst on holiday or you just want to see what your pets get up to when you’re at the office, having an indoor security camera can be great for providing peace of mind when you need it. There is no shortage of great options in 2026 but if you’re not quite sure where to start then our guide to the best indoor security cameras to buy can help you out.

In just a short time, we’ve seen home security go from something that usually involves a fairly laborious installation process (sometimes with a hired professional) to an aspect of the tech industry that, much like the latest smartphones and laptops, is designed to be far more accessible to the masses.

What has helped with making indoor security cameras more approachable is their inclusion as part of wider smart home ecosystems. You no longer have to worry about proprietary software or a system that operates in a vacuum, as most of the latest security cameras can be implemented into your existing smart home dashboards, whether that be in the Alexa app, Apple Home or Google Home.

With more compatibility at play, you can dive into the settings of these cameras, playback footage and see movement alerts in real-time, all from the comfort of your smartphone. It’s made a big difference in allowing more people to set up a robust home security system, even those who have little experience in this area.

At this point in time, our team of tech experts have tested countless indoor security cameras, so you can shop with confidence as this list has compiled their efforts into a simple and easy-to-understand guide. For anyone looking to keep tabs on their garden, or maybe the front of their home, you’ll be better suited with our alternative list of the best outdoor security cameras.

Best indoor security cameras at a glance

SQUIRREL_ANCHOR_LIST

Learn more about how we test indoor security cameras

All of our indoor security cameras are installed inside our test lab, monitoring real people. We run them for at least a week, so that we can tweak motion detection and find out how reliable or annoying each model is. We download sample footage from each camera, too, so that we can compare image quality between devices.

Pros

Excellent video quality
Flexible and powerful app
Hugely flexible object detection (with subscription)

Cons

Arlo subscriptions are expensive

Pros

Great value
Indoor and outdoor
Simple to set up and use

Cons

Basic video quality
Person detection requires subscription

Pros

Well made
Integrates with the Hue Bridge for lighting control
Sharp daytime footage

Cons

Slightly basic motion controls
Night footage is a bit soft

Pros

Very low price
Local storage option
Wide platform support
Strong night vision

Cons

1080p in HomeKit
Weak speaker audio
No power plug

Pros

Excellent image quality
Smart AI tracking
Doubles as hub
Loads of storage options

Cons

Matter not really ready
Some AI behind paywall
Cutesy look not for all

Pros

Quick and easy to install and use
Prop-up stand makes it easy to get the view you want
Good night vision

Cons

Imou Protect safety subscription costs extra money after a 14 day free trial

Pros

Low cost
Strong image quality
Integrates with other Arlo cameras

Cons

Feels a bit cheap
Arm not that flexible

Pros

Excellent 4K image quality
Clever automated tracking
No monthly fees

Cons

Dual view mode reduces image quality

	Arlo Pro 6 2K Review	Blink Mini 2 Review	Philips Hue Secure Wired Camera Review	Aqara Camera G100 Review	Aqara Camera Hub G350 Review	Imou Versa Review	Arlo Essential Indoor Camera Review	Eufy Security Indoor Cam S350 Review
UK RRP	–	£24.99	–	–	–	–	£119	£126
USA RRP	–	$40	–	–	–	–	$99.99	–
Manufacturer	–	Blink	Philips	–	–	–	Arlo	Eufy
Size (Dimensions)	52 x 78 x 89 MM	50 x 49 x 36 MM	92 x 92 x 74 MM	58 x 58 x 72 MM	85 x 68 x 123 MM	72 x 51 x 52 MM	2 x 1.9 x 4.5 INCHES	65 x 80 x 104 INCHES
Weight	–	48 G	–	–	–	120 G	0.27 LB	610 G
ASIN	–	B09N6P323M	–	–	–	–	–	B0CD9YQMKS
Release Date	2026	2024	2024	2025	2026	2023	2023	2023
First Reviewed Date	17/03/2026	26/07/2024	20/10/2025	–	07/04/2026	23/11/2023	24/08/2023	16/02/2024
Model Number	Arlo Pro 6 2K	Blink Mini 2	Philips Hue Secure Wired Camera	Aqara Camera G100	–	Imou Versa	Arlo Essential Indoor Camera	Eufy Security Indoor Cam S350
Resolution	2560 x 1440	1920 x 1080	1920 x 1080	2304 x 1296	3840 x 2160	1920 x 1080	1920 x 1080	3840 x 2160
Voice Assistant	–	Amazon Alexa	–	–	–	–	Amazon Alexa, Google Assistant	Amazon Alexa
Battery Length	8 months	hrs	hrs	hrs	hrs	hrs	hrs	hrs
Smart assistants	Yes	Yes	Yes	Yes	Yes	–	Yes	Yes
App Control	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
IFTTT	–	Yes	Yes	–	–	–	Yes	–
Camera Type	Indoor/outdoor wireless	Indoor/outdoor wired camera	Wired indoor/outdoor	Indoor/outdoor wired	Indoor camera with smart hub	Wired indoor/outdoor camera	Wired indoor security camera	Indoor pan and tilt
Mounting option	Wall	Wall or bookshelf	Wall	Wall or bookshelf	Desk	–	Wall, bookshelf	Desk or wall
View Field	160 degrees	110 degrees	141.2 degrees	140 degrees	133 degrees	114 degrees	130 degrees	358 degrees
Recording option	Cloud (with subscription), offline (requires hub)	Cloud or local (requires Sync Module 2)	Cloud	microSD or Apple HomeKit Secure	microSD	microSD, cloud	Cloud	MicroSD, cloud
Two-way audio	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Night vision	Yes (full colour)	Yes (IR)	Yes (IR)	Yes (full colour or IR)	Yes	Yes (IR)	Yes (IR)	Yes (IR)
Light	Spotlight	Spotlight	No	Spotlight	–	–	No	No
Motion detection	Yes	Yes	Yes	Yes	Yes	Yes	Yes (PIR)	Yes
Activity zones	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Object detection	People, vehicles, animals, custom	People (requires cloud subscription)	People, animals, vehicles	People, pets, faces	–	Human	People, animals, vehicles	People, pets
Audio detection	Alarms	No	Smoke alarms	–	–	No	–	No
Power source	Battery	USB-C	Mains	USB-C	USB-C	USB	USB	USB-C

WordUp News

DOGE Goes Nuclear: How Trump Invited Silicon Valley Into America’s Nuclear Power Regulator

from the move-fast-and-nuke-things dept

Blindsided by DOGE

“Going So Fast”

“Nuke Bros” in Silicon Valley

You may like

Leave a Reply Cancel reply

Leave a Reply

Monitoring LLM behavior: Drift, retries, and refusal patterns

The stochastic challenge

Defining the AI evaluation paradigm

The taxonomy of evaluation checks

Layer 1: Deterministic assertions

Layer 2: Model-based assertions

3 critical inputs for model-based assertions

Architecture: The offline vs online pipeline

The offline evaluation pipeline

Process

1. Curating the golden dataset

2. Defining the evaluation criteria

3. Executing the pipeline and aggregating signals

4. Assessment, iteration, and alignment

The online evaluation pipeline

1. Explicit user signals

2. Implicit behavioral signals

3. Production deterministic asserts (synchronous)

4. Production LLM-as-a-Judge (asynchronous)

Engineering the feedback loop (the “flywheel”)

Conclusion: The new “definition of done”

PenPal, A Robotic Drawing Assistant

Microsoft rolls out revamped Windows Insider Program

Microsoft says the Windows Insider Program is now simpler and more transparent

Microsoft explains how it’s rolling out the new channels to Windows Insiders

BMW brings color changing tech closer to production with the iX3 Flow Edition

What’s The Difference Between Kelly And Goodyear Tires?

Differences in warranty and product lineup

Linux Drops ISDN Subsystem and Other Old Network Drivers

HBO Max: The 26 Absolute Best Movies to Watch

Best Indoor Security Camera 2026: Keep your home secure

Best indoor security cameras at a glance

Learn more about how we test indoor security cameras

Pros

Cons

Pros

Cons

Pros

Cons

Pros

Cons

Pros

Cons

Pros

Cons

Pros

Cons

Pros

Cons

FAQs

Test Data

Full Specs

NASA’s initial takeaways from the Artemis II mission, and more science stories

Watch the Earthset

Before you go, be sure to check these stories out too:

Porsche Cayenne Coupe Electric debuts with 1,139 hp and 669 km range as company retreats from all-electric strategy

The machine

The contradiction

The market

The hedge

The bet

Survivor Season 50’s Controversial Gameplay Continues To Divide Fans

5 Best Money Rules – How to Get Rich in 2026 | Financial Freedom | Sonu Sharma

Secluded village with strange name and a pub that was almost lost in a fire

Manchester United reach agreement with Casemiro over contract clause amid transfer speculation

US brings back mandatory military draft registration

Steven Gerrard disagrees with Gary Neville over ‘shock’ Chelsea and Arsenal claim | Football

5 Best Money Rules – How to Get Rich in 2026 | Financial Freedom | Sonu Sharma

crypto futures trading #crypto #trading #bitcoin #shorts

Biggest Financial Knowledge 2026 || Ft. Deepak Wadhwa || Podcast by Arvind Arora ||

Trending

Leave a Reply
Cancel reply