Microsoft takes on AI rivals with three new foundational models

TechCrunch • April 2, 2026 • technology

Key Points:

Microsoft AI has launched three foundational multimodal AI models—MAI-Transcribe-1 for speech-to-text, MAI-Voice-1 for audio generation, and MAI-Image-2 for video generation—aiming to enhance its AI capabilities and compete with other AI labs.
MAI-Transcribe-1 supports transcription in 25 languages and is 2.5 times faster than Microsoft’s previous Azure Fast service; MAI-Voice-1 can generate 60 seconds of audio in one second and allows custom voice creation, while MAI-Image-2 was initially available on MAI Playground and now also on Microsoft Foundry.
These models were developed by Microsoft’s MAI Superintelligence team, led by CEO Mustafa Suleyman, who emphasized a human-centered approach to AI focused on practical communication and use cases.
Microsoft positions these models as cost-effective alternatives to offerings from Google and OpenAI, with pricing starting at $0.36 per hour for transcription, $22 per million characters for voice, and $5 to $33 per million tokens for image-related tasks.
Despite developing its own AI models, Microsoft maintains its partnership with OpenAI, with recent renegotiations enabling Microsoft to advance its superintelligence research independently while continuing collaboration.

Trending Business