Gemini 2.0 Flash ushers in a new era of real-time multimodal AI
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Google’s release of Gemini 2.0 Flash this week, offering users a way to interact live with video of their surroundings, has set the stage for what could be a pivotal shift in how enterprises and consumers engage with technology.
This release — alongside announcements from OpenAI, Microsoft, and others — is part of a transformative leap forward happening in the technology area called “multimodal AI.” The technology allows you to take video — or audio or images — that comes into your computer or phone, and ask questions about it.
It also signals an intensification of the competitive race among Google and its chief rivals — OpenAI and Microsoft — for dominance in AI capabilities. But more importantly, it feels like it is defining the next era of interactive, agentic computing.
This moment in AI feels to me like an “iPhone moment,” and by that I’m referring to 2007-2008 when Apple released an iPhone that, via a connection with the internet and slick user interface, transformed daily lives by giving people a powerful computer in their pocket.
While OpenAI’s ChatGPT may have kicked off this latest AI moment with its powerful human-like chatbot in November 2022, Google’s release here at the end of 2024 feels like a major continuation of that moment — at a time when a lot of observers had been worried about a possible slowdown in improvements of AI technology.
Gemini 2.0 Flash: The catalyst of AI’s multimodal revolution
Google’s Gemini 2.0 Flash offers groundbreaking functionality, allowing real-time interaction with video captured via a smartphone. Unlike prior staged demonstrations (e.g. Google’s Project Astra in May), this technology is now available to everyday users through Google’s AI Studio.
I encourage you to try it yourself. I used it to view and interact with my surroundings — which for me this morning was my kitchen and dining room. You can see instantly how this offers breakthroughs for education and other use cases. You can see why content creator Jerrod Lew reacted on X yesterday with astonishment when he used Gemini 2.0 realtime AI to edit a video in Adobe Premiere Pro. “This is absolutely insane,” he said, after Google guided him within seconds on how to add a basic blur effect even though he was a novice user.
Sam Witteveen, a prominent AI developer and cofounder of Red Dragon AI, was given early access to test Gemini 2.0 Flash, and he highlighted that Gemini Flash’s speed — it is twice as fast as Google’s flagship until now, Gemini 1.5 Pro — and “insanely cheap” pricing make it not just a showcase for for developers to test new products with, but a practical tool for enterprises managing AI budgets. (To be clear, Google hasn’t actually announced pricing for Gemini 2.0 Flash yet. It is a free preview. But Witteveen is basing his assumptions on the precedent set by Google’s Gemini 1.5 series.)
For developers, the live API of these multimodal live features offers significant potential, because they enable seamless integration into applications. That API is also available to use; a demo app is available. Here is the Google blog post for developers.
Programmer Simon Willison called the streaming API next-level: “This stuff is straight out of science fiction: being able to have an audio conversation with a capable LLM about things that it can ‘see’ through your camera is one of those ‘we live in the future’ moments.” He noted the way you ask the API to enable a code execution mode, which lets the models write Python code, run it and consider the result as part of their response — all part of an agentic future.
The technology is clearly a harbinger of new application ecosystems and user expectations. Imagine being able to analyze live video during a presentation, suggest edits, or troubleshoot in real time.
Yes, the technology is cool for consumers, but it’s important for enterprise users and leaders to grasp as well. The new features are the foundation of an entirely new way of working and interacting with technology — suggesting coming productivity gains and creative workflows.
The competitive landscape: A race to define the future
Wednesday’s release of Google’s Gemini 2.0 Flash comes amid a flurry of releases by Google and by its major competitors, which are rushing to ship their latest technologies by the end of the year. They all promise to deliver consumer-ready multimodal capabilities — live video interaction, image generation, and voice synthesis — but some of them aren’t fully baked or even fully available.
One reason for the rush is that some of these companies offer their employees bonuses to deliver on key products before the end of the year. Another is bragging rights when they get new features out first. They can get major user traction by being first, as OpenAI showed in 2022, when its ChatGPT become the fastest growing consumer product in history. Even though Google had similar technology, it was not prepared for a public release and was left flat-footed. Observers have sharply criticized Google ever since for being too slow.
Here’s what the other companies have announced in the past few days, all helping introduce this new era of multimodal AI.
- OpenAI’s Advanced Voice Mode with Vision: Launched yesterday but still rolling out, it offers features like real-time video analysis and screen sharing. While promising, early access issues have limited its immediate impact. For example, I couldn’t access it yet even though I’m a Plus subscriber.
- Microsoft’s Copilot Vision: Last week, Microsoft launched a similar technology in preview — only for a select group of its Pro users. Its browser-integrated design hints at enterprise applications but lacks the polish and accessibility of Gemini 2.0. Microsoft also released a fast, powerful Phi-4 model to boot.
- Anthropic’s Claude 3.5 Haiku: Anthropic, until now in a heated race for large language model (LLM) leadership with OpenAI, hasn’t delivered anything as bleeding-edge on the multimodal side. It did just release 3.5 Haiku, notable for efficiency and speed. But its focus on cost reduction and smaller model sizes contrasts with the boundary-pushing features of Google’s latest release, and those of OpenAI’s Voice Mode with Vision.
Navigating challenges and embracing opportunities
While these technologies are revolutionary, challenges remain:
- Accessibility and scalability: OpenAI and Microsoft have faced rollout bottlenecks, and Google must ensure it avoids similar pitfalls. Google referenced that its live-streaming feature (Project Astra) has a contextual memory limit of up to 10 minutes of in-session memory, although that is likely to increase over time.
- Privacy and security: AI systems that analyze real-time video or personal data need robust safeguards to maintain trust. Google’s Gemini 2.0 Flash model has native image generation built in, access to third-party APIs, and the ability to tap Google search and execute code. All of that is powerful, but can make it dangerously easy for someone to accidentally release private information while playing around with this stuff.
- Ecosystem integration: As Microsoft leverages its enterprise suite and Google anchors itself in Chrome, the question remains: Which platform offers the most seamless experience for enterprises?
However, all of these hurdles are outweighed by the technology’s potential benefits, and there’s no doubt that developers and enterprise companies will be rushing to embrace them over the next year.
Conclusion: A new dawn, led for now by Google
As developer Sam Witteveen and I discuss in our podcast taped Wednesday night after Google’s announcement, Gemini 2.0 Flash is a truly an impressive release, marking the moment when multimodal AI has become real. Google’s advancements have set a new benchmark, although it’s true that this edge could be extremely fleeting. OpenAI and Microsoft are hot on its tail. We’re still very early in this revolution, just like in 2008 when despite the iPhone’s release, it wasn’t clear how Google, Nokia, and RIM would respond. History showed Nokia and RIM didn’t, and they died. Google responded really well, and has given the iPhone a run.
Likewise, it’s clear that Microsoft and OpenAI are very much in this race with Google. Apple, meanwhile, has decided to partner on the technology, and this week announced a further integration with ChatGPT — but it’s certainly not trying to win outright in this new era of multimodal offerings.
In our podcast, Sam and I also cover Google’s special strategic advantage around the area of the browser. For example, its Project Mariner release, a Chrome extension, allows you to do real-world web browsing tasks with even more functionality than competing technologies offered by Anthropic (called Computer Use) and Microsoft’s OmniParser (still in research). (It’s true that Anthropic’s feature gives you more access to your computer’s local resources.) All of this gives Google a head start in the race to push forward agentic AI technologies in 2005 as well, even if Microsoft appears to be ahead on the actual execution side of delivering agentic solutions to enterprises. AI agents do complex tasks autonomously, with minimal human intervention — for example, they’ll soon do advanced research tasks and database checks before performing ecommerce, stock trading or even real estate buying.
Google’s focus on making these Gemini 2.0 capabilities accessible to both developers and consumers is smart, because it ensures it is addressing the industry with a comprehensive plan. Until now, Google has suffered a reputation of not being as aggressively focused on developers as Microsoft is.
The question for decision-makers is not whether to adopt these tools, but how quickly you can integrate them into workflows. It is going to be fascinating to see where the next year takes us. Make sure to listen to our takeaways for enterprise users in the video below:
Source link