1 Jul 2025

Text-to-image and beyond: Alibaba launches Qwen VLo AI model

Qwen VLo improves upon earlier versions by addressing semantic inconsistencies in object recognition.

Alibaba Group has launched a new AI model called Qwen VLo, designed to generate and edit images based on text prompts and visual inputs. The model is an upgrade of the earlier Qwen2.5-VL and forms part of Alibaba’s expanding suite of AI services.

Qwen VLo introduces the ‘progressive generation’ feature, which allows users to watch as the image develops in real-time. Users can request creations with simple prompts such as ‘generate a picture of a dog’, or upload existing photos for editing.

According to a company blog post on GitHub, Qwen VLo is a unified multimodal understanding and generation model. It not only interprets visual and textual data, but also produces high-quality, context-aware image outputs.

Previous models had difficulty with semantic consistency, often misidentifying objects or altering key features like the shape or colour of a car. The new Qwen VLo corrects these issues, offering improved object recognition and detail retention.

Users can issue complex editing commands, such as ‘add a sun to the sky’ or ‘make this photo look like it’s from the 19th century’. The model supports traditional vision tasks like depth estimation, edge detection, and segmentation.

Multiple image tasks can be performed simultaneously, making Qwen VLo suitable for more advanced use cases. Thanks to the model’s multilingual capabilities, instructions can be given in several languages, including English and Chinese.

Alibaba, best known for its e-commerce services, has been steadily advancing its AI research and development. In February, CEO Eddie Wu said the company’s primary focus is now on artificial general intelligence—AI systems that match or exceed human-level cognition.

Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!