New method helps AI models locate personalised objects in scenes

MIT researchers have devised a training method that lets vision-language models find a specific object (like your pet) in new scenes with far greater accuracy.

vision-language models, personalised object localisation, in-context learning, pseudo-naming, contextual reasoning, video tracking, model fine-tuning, assistive AI, object tracking

MIT and the MIT-IBM Watson AI Lab have developed a training approach that enables generative vision-language models to localise personalised objects (for example, a specific cat) across new scenes, a task at which they previously performed poorly.

While vision-language models (VLMs) are good at recognising generic object categories (dogs, chairs, etc.), they struggle when asked to point out your specific dog or chair under different conditions.

To remedy this, the researchers framed a fine-tuning regime using video-tracking datasets, where the same object appears in multiple frames.

Crucially, they used pseudo-names (e.g. ‘Charlie’) instead of real object names to prevent the model from relying on memorised label associations. This encourages it to reason about context, scene layout, appearance cues, and relative position, rather than shortcut to category matches.

AI models trained with the method showed a 12% average improvement in personalised localization. In some settings, especially with pseudo-naming, gains reached 21%. Importantly, this enhanced ability did not degrade the model’s overall object recognition performance.

Potential applications include smart home cameras recognising your pet, assistive devices helping visually impaired users find items, robotics, surveillance, and ecological monitoring (e.g. tracking particular animals). The approach helps models better generalise from a few example images rather than needing full retraining for each new object.

Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot