Deep Learning

Getting Started with Zero-Shot Learning

February 26, 20265 min read

Zero-shot learning (ZSL) is one of the most fascinating challenges in modern machine learning. The core idea is simple but profound: can a model recognize or classify something it has never seen during training?

Why it matters

Traditional supervised learning requires labeled examples for every class you want to recognize. This is expensive, time-consuming, and often impossible — especially for rare events like unusual crimes in surveillance footage, rare diseases in medical imaging, or novel objects in robotics.

Zero-shot learning sidesteps this by transferring knowledge from seen classes to unseen ones using auxiliary information — typically semantic descriptions, attributes, or embeddings.

How CLIP changed everything

OpenAI's CLIP (Contrastive Language-Image Pretraining) marked a turning point. By training on 400 million image-text pairs from the internet, CLIP learned a shared embedding space where images and their natural language descriptions are close together.

This means you can classify an image into any category — even one never seen during training — simply by comparing its embedding to the text embedding of the category name.

A simple example

Suppose you want to detect "a person climbing a fence" in a surveillance video. With a traditional model, you'd need hundreds of labeled examples of that exact behavior. With a CLIP-based zero-shot approach, you just pass the text prompt and compare it against video frame embeddings. No labeled anomaly data required.

What's next

The frontier is now context-aware zero-shot learning — making models sensitive not just to what is happening, but where and under what circumstances. A person running is normal on a track but suspicious in a bank vault. This is the problem my thesis addresses.

If you're interested in exploring ZSL further, I recommend starting with the original CLIP paper by Radford et al. (2021) and the survey by Wang et al. on generalized zero-shot learning.

Zero-Shot LearningCLIPDeep Learning