Practice and reinforce the concepts from Lesson 17
In this activity, you'll build a complete multi-modal AI system that connects vision and language. You'll implement CLIP from scratch, perform zero-shot image classification, create an image-text retrieval system, and generate images from text using Stable Diffusion. This hands-on experience will show you how AI systems like Google Lens and DALL-E actually work.
What makes this special: You're building the core technology behind modern multi-modal AI systems - the same techniques used in GPT-4 Vision, Google's Gemini, and image search engines.
Time Required: 90-120 minutes
By completing this activity, you will be able to:
Before starting this activity, you should:
Download the template from the course Templates folder:
AI25-Template-activity-17-multi-modal.zipUpload to Google Colab:
.ipynb fileRun the first cell to verify your environment and see a working demo
This activity provides a 65-70% working implementation. You'll complete the missing pieces:
Example Output:
Epoch 1: Loss = 5.87, Accuracy = 23.4%
Epoch 5: Loss = 3.12, Accuracy = 61.2%
Epoch 10: Loss = 2.48, Accuracy = 78.5%
Similarity Scores:
Image: [dog photo] + Text: "a dog" = 0.87 ✓
Image: [dog photo] + Text: "a cat" = 0.23 ✗
Example Output:
Zero-Shot CIFAR-10 Results:
Accuracy: 67.3%
Per-Class Performance:
Airplane: 72% ✓
Automobile: 81% ✓
Bird: 58% (confused with airplane)
Cat: 64%
Deer: 69%
Dog: 71%
Frog: 75%
Horse: 66%
Ship: 78%
Truck: 83%
Example Output:
Query: "a person riding a bicycle"
Top 5 Results:
1. [Image: cyclist on road] - Similarity: 0.92
2. [Image: bike in park] - Similarity: 0.88
3. [Image: mountain biking] - Similarity: 0.85
4. [Image: person with bike] - Similarity: 0.81
5. [Image: bicycle race] - Similarity: 0.78
Example Output:
Prompt: "A cozy cabin in the snowy mountains at sunset, digital art"
Guidance Scale = 3.0: Creative but loose interpretation
Guidance Scale = 7.5: Balanced (most realistic)
Guidance Scale = 15.0: Highly detailed but may be over-saturated
Your implementation is successful when:
Start with Pre-Trained: Use pre-trained CLIP from HuggingFace for faster results. Train from scratch only if you have time.
Small Batch Sizes: Contrastive learning benefits from large batches (32-128), but use smaller if GPU memory is limited.
Temperature Tuning: The temperature parameter (τ = 0.07) in contrastive loss is critical. Too high -> no learning, too low -> overfitting.
Normalization: Always normalize embeddings before computing similarity (crucial for contrastive learning).
Hint: InfoNCE loss is symmetric - compute loss for both image->text and text->image directions.
Common Mistakes:
F.normalize(embedding, dim=-1))Debug Checklist:
torch.norm(embedding, dim=-1))Code Pattern:
# Normalize embeddings
img_emb = F.normalize(img_emb, dim=-1)
txt_emb = F.normalize(txt_emb, dim=-1)
# Compute similarity matrix
logits = torch.matmul(img_emb, txt_emb.T) / temperature # [batch, batch]
# Diagonal elements are positive pairs
labels = torch.arange(batch_size, device=logits.device)
# Symmetric loss
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)
loss = (loss_i2t + loss_t2i) / 2
Hint: Create text prompts for each class (e.g., "a photo of a dog", "a photo of a cat").
Common Mistakes:
Debug Checklist:
Prompt Engineering:
# Good prompts
templates = [
"a photo of a {}",
"a picture of a {}",
"a {} in a natural setting",
]
# Average embeddings over templates
for cls in classes:
cls_embeddings = []
for template in templates:
text = template.format(cls)
embedding = text_encoder(text)
cls_embeddings.append(embedding)
final_embedding = torch.stack(cls_embeddings).mean(dim=0)
Hint: For image->text retrieval, find the text with highest similarity to the query image.
Common Mistakes:
Debug Checklist:
Efficient Retrieval:
# Precompute all embeddings
image_embeddings = encode_images(all_images) # [N, d]
text_embeddings = encode_texts(all_captions) # [N, d]
# Query: text → images
query_text_emb = encode_text(query) # [1, d]
similarities = torch.matmul(query_text_emb, image_embeddings.T) # [1, N]
top_k_indices = similarities.topk(k=5).indices
Hint: Use the StableDiffusionPipeline from diffusers library. Experiment with guidance_scale (7.5 is typical).
Common Mistakes:
Debug Checklist:
Hyperparameter Exploration:
guidance_scales = [1.0, 5.0, 7.5, 10.0, 15.0]
inference_steps = [20, 50, 100]
for gs in guidance_scales:
for steps in inference_steps:
image = pipe(
prompt=prompt,
guidance_scale=gs,
num_inference_steps=steps,
).images[0]
# Compare visual quality
Task: Use t-SNE or UMAP to visualize the learned image and text embeddings in 2D.
Expected Outcome: Plot showing clusters of semantically similar items (e.g., all "dog" images and "dog" text near each other).
Task: Build an encoder-decoder model that generates captions for images.
Approach:
Expected Outcome: Given an image, generate a descriptive caption (e.g., "A dog playing fetch in a park").
Task: Create a model that answers questions about images.
Steps:
Expected Outcome: Model answers questions like "What color is the car?" -> "Red".
Task: Train CLIP on a domain-specific dataset (e.g., medical images + radiology reports).
Challenges:
Expected Outcome: Custom CLIP model that outperforms general-purpose CLIP on your domain.
Completed Jupyter Notebook (.ipynb file)
Results Gallery (Markdown cells in notebook)
Analysis Report (Markdown cell)
| Criterion | Points | Description |
|---|---|---|
| CLIP Implementation | 25 | Correct contrastive loss, embeddings normalized |
| Zero-Shot Classification | 20 | ``Accuracy > 60``%, reasonable prompts |
| Image-Text Retrieval | 20 | R@`5 > 70`%, efficient implementation |
| Text-to-Image | 15 | Generated images match prompts |
| Analysis | 10 | Thoughtful discussion of results |
| Code Quality | 10 | Clean, commented, efficient |
| Total | 100 |
Bonus Points (+10 each):
Explore Pre-Trained Models:
Read Research Papers:
Prepare for Activity 18:
Skills from this activity are directly applicable to:
Solution: Check temperature parameter (should be ~0.07) and ensure embeddings are normalized.
<30%)"Solution: Improve prompt templates. Try "a photo of a {class}" instead of just "{class}".
Solution: Reduce batch size or use gradient accumulation:
# Instead of batch_size=128
batch_size = 32
accumulation_steps = 4 # Effective batch size: 128
Solution: Increase guidance scale (try 10.0-15.0) or use more descriptive prompts.
Solution: Verify embeddings are normalized and similarity computation is correct.
Your submission will be evaluated on:
Passing Criteria: ``Score >= 70``/100 and all success criteria met.
After completing the activity, reflect on:
What makes contrastive learning effective? Why does pulling matching pairs together and pushing non-matching pairs apart work?
How does CLIP achieve zero-shot classification? Why can it handle categories it's never seen?
What are the limitations of multi-modal AI? When does CLIP fail?
How does CLIP enable text-to-image generation? Trace the path from text prompt to final image.
What are the ethical implications? Consider deepfakes, bias in generated images, and misuse potential.
How would you improve multi-modal models? Think about compositionality, fine-grained understanding, and efficiency.
Congratulations! By completing this activity, you've built the core technology behind Google Lens, DALL-E, and GPT-4 Vision. You now understand how AI systems connect vision and language - one of the most exciting frontiers in modern AI.
Next Activity: Activity 18 - AI Agent with Safety ->