Tech

Microsoft launches 3 new AI models in direct shot at OpenAI and Google

Published

on

Microsoft on Wednesday launched three new foundational AI models it built entirely in-house — a state-of-the-art speech transcription system, a voice generation engine, and an upgraded image creator — marking the most concrete evidence yet that the $3 trillion software giant intends to compete directly with OpenAI, Google, and other frontier labs on model development, not just distribution.

The trio of models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — are available immediately through Microsoft Foundry and a new MAI Playground. They span three of the most commercially valuable modalities in enterprise AI: converting speech to text, generating realistic human voice, and creating images. Together, they represent the opening salvo from Microsoft’s superintelligence team, which Suleyman formed just six months ago to pursue what he calls “AI self-sufficiency.”

“I’m very excited that we’ve now got the first models out, which are the very best in the world for transcription,” Suleyman told VentureBeat in an exclusive interview ahead of the launch. “Not only that, we’re able to deliver the model with half the GPUs of the state-of-the-art competition.”

The announcement lands at a precarious moment for Microsoft. The company’s stock just closed its worst quarter since the 2008 financial crisis, as investors increasingly demand proof that hundreds of billions of dollars in AI infrastructure spending will translate into revenue. These models — priced aggressively and positioned to reduce Microsoft’s own cost of goods sold — are Suleyman’s first answer to that pressure.

Advertisement

Microsoft’s new transcription model claims best-in-class accuracy across 25 languages

MAI-Transcribe-1 is the headline release. The speech-to-text model achieves the lowest average Word Error Rate on the FLEURS benchmark — the industry-standard multilingual test — across the top 25 languages by Microsoft product usage, averaging 3.8% WER. According to Microsoft’s benchmarks, it beats OpenAI’s Whisper-large-v3 on all 25 languages, Google’s Gemini 3.1 Flash on 22 of 25, and ElevenLabs’ Scribe v2 and OpenAI’s GPT-Transcribe on 15 of 25 each.

The model uses a transformer-based text decoder with a bi-directional audio encoder. It accepts MP3, WAV, and FLAC files up to 200MB, and Microsoft says its batch transcription speed is 2.5 times faster than the existing Microsoft Azure Fast offering. Diarization, contextual biasing, and streaming are listed as “coming soon.” Microsoft is already testing MAI-Transcribe-1 inside Copilot’s Voice mode and Microsoft Teams for conversation transcription — a detail that underscores how quickly the company intends to replace third-party or older internal models with its own.

Alongside it, MAI-Voice-1 is Microsoft’s text-to-speech model, capable of generating 60 seconds of natural-sounding audio in a single second. The model preserves speaker identity across long-form content and now supports custom voice creation from just a few seconds of audio through Microsoft Foundry. Microsoft is pricing it at $22 per 1 million characters. MAI-Image-2, meanwhile, debuted as a top-three model family on the Arena.ai leaderboard and now delivers at least 2x faster generation times on Foundry and Copilot compared to its predecessor. Microsoft is rolling it out across Bing and PowerPoint, pricing it at $5 per 1 million tokens for text input and $33 per 1 million tokens for image output. WPP, one of the world’s largest advertising holding companies, is among the first enterprise partners building with MAI-Image-2 at scale.

The contract renegotiation with OpenAI that made Microsoft’s model ambitions possible

To understand why these models matter, you have to understand the contractual tectonic shift that made them possible. Until October 2025, Microsoft was contractually prohibited from independently pursuing artificial general intelligence. The original deal with OpenAI, signed in 2019, gave Microsoft a license to OpenAI’s models in exchange for building the cloud infrastructure OpenAI needed. But when OpenAI sought to expand its compute footprint beyond Microsoft — striking deals with SoftBank and others — Microsoft renegotiated. As Suleyman explained in a December 2025 interview with Bloomberg, the revised agreement meant that “up until a few weeks ago, Microsoft was not allowed — by contract — to pursue artificial general intelligence or superintelligence independently.” The new terms freed Microsoft to build its own frontier models while retaining license rights to everything OpenAI builds through 2032.

Advertisement

Suleyman described the dynamic to VentureBeat in characteristically blunt terms. “Back in September of last year, we renegotiated the contract with OpenAI, and that enabled us to independently pursue our own superintelligence,” he said. “Since then, we’ve been convening the compute and the team and buying up the data that we need.”

He was quick to emphasize that the OpenAI partnership remains intact. “Nothing’s changing with the OpenAI partnership. We will be in partnership with them at least until 2032 and hopefully a lot longer,” Suleyman said. “They have been a phenomenal partner to us.” He also highlighted that Microsoft provides access to Anthropic’s Claude through its Foundry API, framing the company as “a platform of platforms.” But the subtext is unmistakable: Microsoft is building the capability to stand on its own. In March, as Business Insider first reported, Suleyman wrote in an internal memo that his goal is to “focus all my energy on our Superintelligence efforts and be able to deliver world class models for Microsoft over the next 5 years.” CNBC reported that the structural shift freed Suleyman from day-to-day Copilot product responsibilities, with former Snap executive Jacob Andreou taking over as EVP of the combined consumer and commercial Copilot experience.

How teams of fewer than 10 engineers built models that rival Big Tech’s best

Perhaps the most striking detail Suleyman shared with VentureBeat is how small the teams behind these models actually are. “The audio model was built by 10 people, and the vast majority of the speed, efficiency and accuracy gains come from the model architecture and the data that we have used,” Suleyman said. “My philosophy has always been that we need fewer people who are more empowered. So we operate an extremely flat structure.” He added: “Our image team, equally, is less than 10 people. So this is all about model and data innovation, which has delivered state of the art performance.”

This matters for two reasons. First, it challenges the prevailing industry narrative that frontier AI development requires thousands of researchers and billions in headcount costs. Meta, by contrast, has pursued what Suleyman described in his Bloomberg interview as a strategy of “hiring a lot of individuals, rather than maybe creating a team” — including reported compensation packages of $100 million to $200 million for top researchers. Second, small teams producing state-of-the-art results dramatically improve the economics. If Microsoft can build best-in-class transcription with 10 engineers and half the GPUs of competitors, the margin structure of its AI business looks fundamentally different from companies burning through cash to achieve similar benchmarks.

Advertisement

The lean-team philosophy also echoes Suleyman’s broader views on how AI is already reshaping the work of building AI itself. When asked by VentureBeat how his own team works, Suleyman described an environment that resembles a startup trading floor more than a traditional Microsoft engineering org. “There are groups of people around round tables, circular tables, not traditional desks, on laptops instead of big screens,” he said. “They’re basically vibe coding, side by side all day, morning till night, in rooms of 50 or 60 people.”

Why Suleyman’s “humanist AI” pitch is aimed squarely at enterprise buyers

Suleyman has been steadily building a philosophical brand around Microsoft’s AI efforts that he calls “humanist AI” — a term that appeared prominently in the blog post he authored for the launch and that he elaborated on in our interview. “I think that the motivation of a humanist super intelligence is to create something that is truly in service of humanity,” he told VentureBeat. “Humans will remain in control at the top of the food chain, and they will be always aligned to human interests.”

The framing serves multiple purposes. It differentiates Microsoft from the more acceleration-oriented rhetoric coming from OpenAI and Meta. It resonates with enterprise buyers who need governance, compliance, and safety assurances before deploying AI in regulated industries. And it provides a narrative hedge: if something goes wrong in the broader AI ecosystem, Microsoft can point to its stated commitment to human control. In his December Bloomberg interview, Suleyman went further, describing containment and alignment as “red lines” and arguing that no one should release a superintelligence tool until they are “confident it can be controlled.”

Suleyman also stressed data provenance as a competitive advantage, describing a conversation with CEO Satya Nadella about developing “a clean lineage of models where the data is extremely clean.” He drew an implicit contrast with open-source alternatives, noting that “many of the open-source models have been trained on data in, let’s say, inappropriate ways. And there are potentially security issues with that.” For enterprise customers evaluating AI vendors amid a thicket of copyright lawsuits across the industry, that is a meaningful commercial argument — if Microsoft can credibly claim that its training data was acquired through properly licensed channels, it reduces the legal and reputational risk of deploying these models in production.

Advertisement

Microsoft’s aggressive pricing puts pressure on Amazon, Google, and the AI startup ecosystem

Today’s launch positions Microsoft on three competitive fronts simultaneously. MAI-Transcribe-1 directly targets the transcription workloads that OpenAI’s Whisper models have dominated in the open-source community, with Microsoft claiming superior accuracy on all 25 benchmarked languages. The FLEURS results also show it winning against Google’s Gemini 3.1 Flash Lite on 22 of 25 languages — a direct challenge as Google aggressively pushes Gemini across its own product suite. And MAI-Voice-1‘s ability to clone voices from seconds of audio and generate speech at 60x real-time puts it in competition with ElevenLabs, Resemble AI, and the growing ecosystem of voice AI startups, with Microsoft’s distribution advantage — any Foundry developer can now access these capabilities through the same API they use for GPT-4 and Claude — acting as a powerful moat.

Suleyman framed the competitive position confidently: “We’re now a top three lab just under OpenAI and Gemini,” he told VentureBeat. The pricing strategy — MAI-Voice-1 at $22 per million characters, MAI-Image-2 at $5 per million input tokens — reflects a deliberate decision to compete on cost. “We’re pricing them to be the very best of any hyperscaler. So there will be the cheapest of any of the hyperscalers out there, Amazon. And obviously Google,” Suleyman said. “And that’s a very conscious decision.”

This makes strategic sense for Microsoft, which can amortize model development costs across its enormous installed base of enterprise customers. But it also speaks to the question investors have been asking with increasing urgency: when does AI spending start generating returns? Microsoft’s stock has fallen roughly 17% year-to-date, according to CNBC, part of a broader selloff in software stocks. By building models that run on half the GPUs of competitors, Microsoft reduces its own infrastructure costs for internal products — Teams, Copilot, Bing, PowerPoint — while offering developers pricing designed to undercut the rest of the market. In his March memo, Suleyman wrote that his models would “enable us to deliver the COGS efficiencies necessary to be able to serve AI workloads at the immense scale required in the coming years.” These three models are the first tangible delivery on that promise.

Suleyman says a frontier large language model is coming — and Microsoft plans to be “completely independent”

Suleyman made clear that transcription, voice, and image generation are just the beginning. When asked whether Microsoft would build a large language model to compete directly with GPT at the frontier level, he was unequivocal. “We absolutely are going to be delivering state of the art models across all modalities,” he said. “Our mission is to make sure that if Microsoft ever needs it, we will be able to provide state of the art at the best efficiency, the cheapest price, and be completely independent.”

Advertisement

He described a multi-year roadmap to “set up the GPU clusters at the appropriate scale,” noting that the superintelligence team was formally stood up only in October 2025. Suleyman spoke to VentureBeat from Miami, where the full team was convening for one of its regular week-long in-person sessions. He described Nadella flying in for the gathering to lay out “the roadmap of everything that we need to achieve for our AI self-sufficiency mission over the next 2, 3, 4 years, and all the compute roadmap that that would involve.”

Building a competitive frontier LLM, of course, is a different order of magnitude in complexity, data requirements, and compute cost from what Microsoft demonstrated Wednesday. The models launched today are specialized — they handle audio and images, not the general reasoning and text generation that underpin products like ChatGPT or Copilot’s core intelligence. Suleyman has the organizational mandate, Nadella’s public backing, and the contractual freedom. What he doesn’t yet have is a track record at Microsoft of delivering on the hardest problem in AI.

But consider what he does have: three models that are best-in-class or near it in their respective domains, built by teams smaller than most seed-stage startups, running on half the industry-standard GPU footprint, and priced below every major cloud competitor. Two years ago, Suleyman proposed in MIT Technology Review what he called the “Modern Turing Test” — not whether AI could fool a human in conversation, but whether it could go out into the world and accomplish real economic tasks with minimal oversight. On Wednesday, his own models took a step toward that vision. The question now is whether Microsoft’s superintelligence team can repeat the trick at the scale that actually matters — and whether they can do it before the market’s patience runs out.

Source link

Advertisement

You must be logged in to post a comment Login

Leave a Reply

Cancel reply

Trending

Exit mobile version