11 Jul 2025

New Gemini AI tool animates photos into short video clips

Google’s Gemini AI now turns still photos into 8-second videos with audio, using Veo 3, now live in India.

Google has rolled out a new feature for Gemini AI that transforms still photos into short, animated eight-second videos with sound. The capability is powered by Veo 3, Google’s latest video generation model, and is currently available to Gemini Advanced Ultra and Pro subscribers.

The tool supports background noise, ambient audio, and even spoken dialogue, with support gradually expanding to users in select countries, including India. At launch, access to the web interface is limited, though Google has announced that mobile support will follow later in the week.

To use the tool, users upload a photo, describe the intended motion, and optionally add prompts for sound effects or narration. Gemini then generates a 720p MP4 video in a 16:9 landscape format, automatically synchronising visuals and audio.

Josh Woodward, Vice President of the Gemini app and Google Labs, showcased the feature on X (formerly Twitter), animating a child’s drawing. ‘Still experimental, but we wanted our Pro and Ultra members to try it first,’ he said, calling the result fun and expressive.

To maintain authenticity, each video includes a visible ‘Veo’ watermark in the bottom-right corner and an invisible SynthID watermark. This hidden digital signature, developed by Google DeepMind, helps identify AI-generated content and preserve transparency around synthetic media.

The company has emphasised its commitment to responsible AI deployment by embedding traceable markers in all output from this tool. These safeguards come amid increasing scrutiny of generative video tools and deepfakes across digital platforms.

To animate a photo using Gemini AI’s new tool, users should follow these steps: Click on the ‘tools’ icon in the prompt bar, then choose the ‘video’ option from the menu. Upload the still image, describe the desired motion, and provide sound or narration instructions, optionally.

The underlying Veo 3 model was first introduced at Google I/O as the company’s most advanced video generation engine. It can produce high-quality visuals, simulate real-world physics, and even lip-sync dialogue from text and image-based prompts.

A Google blog post explains: ‘Veo 3 excels from text and image prompting to real-world physics and accurate lip syncing.’ The company says users can craft short story prompts and expect realistic, cinematic responses from the model.

Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!