Telebort | Learning to code made fun!

Demo Mode

No student ID available

Activity 17 of 18

Activity 17: Multi-Modal AI - Vision and Language

Practice and reinforce the concepts from Lesson 17

Overview

In this activity, you'll build a complete multi-modal AI system that connects vision and language. You'll implement CLIP from scratch, perform zero-shot image classification, create an image-text retrieval system, and generate images from text using Stable Diffusion. This hands-on experience will show you how AI systems like Google Lens and DALL-E actually work.

What makes this special: You're building the core technology behind modern multi-modal AI systems - the same techniques used in GPT-4 Vision, Google's Gemini, and image search engines.

Time Required: 90-120 minutes

Learning Objectives

By completing this activity, you will be able to:

Implement a CLIP model with dual encoders (vision + language)
Train the model using contrastive learning (InfoNCE loss)
Perform zero-shot image classification on new categories
Build an image-text retrieval system for searching images by description
Generate images from text using Stable Diffusion with CLIP guidance
Analyze the learned embedding space and visualize image-text alignments

Prerequisites

Before starting this activity, you should:

✅ Complete Lesson 13 (Diffusion Models) - Understanding denoising diffusion
✅ Complete Lesson 14 (Transformers) - Understanding self-attention
✅ Complete Lesson 17 Concept - Understanding CLIP and multi-modal learning
✅ Have a Google Colab account (GPU runtime recommended)
✅ Familiarity with PyTorch and HuggingFace Transformers

Getting Started

Download the template from the course Templates folder:
- File: AI25-Template-activity-17-multi-modal.zip
- Extract to your working directory
Upload to Google Colab:
- Open Google Colab (colab.research.google.com)
- Upload the .ipynb file
- Enable GPU: Runtime -> Change runtime type -> GPU (T4)
Run the first cell to verify your environment and see a working demo

What You'll Build

This activity provides a 65-70% working implementation. You'll complete the missing pieces:

Part One: CLIP Implementation (⚠️ Your Task: 35%)

✅ Image encoder (ResNet/ViT) - PRE-BUILT
✅ Text encoder (BERT/Transformer) - PRE-BUILT
⚠️ TODO 1: Implement contrastive loss function (InfoNCE)
⚠️ TODO 2: Complete the CLIP forward pass with normalization

Part 2: Zero-Shot Classification (⚠️ Your Task: 30%)

✅ Load CIFAR-10 dataset - PRE-BUILT
⚠️ TODO 3: Implement zero-shot classifier using text prompts
✅ Evaluation metrics and visualization - PRE-BUILT

Part 3: Image-Text Retrieval (⚠️ Your Task: 25%)

✅ Load MS-COCO dataset - PRE-BUILT
⚠️ TODO 4: Build retrieval system (image->text and text->image search)
✅ Interactive search interface - PRE-BUILT

Part 4: Text-to-Image Generation (✅ Pre-Built: 90%)

✅ Stable Diffusion pipeline with CLIP
⚠️ TODO 5: Experiment with guidance scales and prompts
✅ Image comparison and quality metrics - PRE-BUILT

Expected Results

After Completing TODO 1-2 (CLIP Training)

Contrastive loss decreases from ~6.0 to ~2.5 over 10 epochs
Image-text similarity increases for matching pairs (0.2 -> 0.8)
Embedding space shows clear semantic clustering

Example Output:

ini

Epoch 1: Loss = 5.87, Accuracy = 23.4%
Epoch 5: Loss = 3.12, Accuracy = 61.2%
Epoch 10: Loss = 2.48, Accuracy = 78.5%

Similarity Scores:
  Image: [dog photo] + Text: "a dog" = 0.87 ✓
  Image: [dog photo] + Text: "a cat" = 0.23 ✗

After Completing TODO 3 (Zero-Shot Classification)

CIFAR-10 accuracy: 60-70% (without fine-tuning!)
Confusion matrix shows semantic mistakes (e.g., "dog" <-> "cat", not "dog" <-> "airplane")
Works on custom categories you define

Example Output:

yaml

Zero-Shot CIFAR-10 Results:
  Accuracy: 67.3%

Per-Class Performance:
  Airplane: 72% ✓
  Automobile: 81% ✓
  Bird: 58% (confused with airplane)
  Cat: 64%
  Deer: 69%
  Dog: 71%
  Frog: 75%
  Horse: 66%
  Ship: 78%
  Truck: 83%

After Completing TODO 4 (Image-Text Retrieval)

Retrieval accuracy (R@5): 75-85% on MS-COCO validation set
Interactive search returns relevant images for natural language queries
Bidirectional search works in both directions

Example Output:

yaml

Query: "a person riding a bicycle"
Top 5 Results:
  1. [Image: cyclist on road] - Similarity: 0.92
  2. [Image: bike in park] - Similarity: 0.88
  3. [Image: mountain biking] - Similarity: 0.85
  4. [Image: person with bike] - Similarity: 0.81
  5. [Image: bicycle race] - Similarity: 0.78

After Completing TODO 5 (Text-to-Image)

Generated images match text descriptions
Guidance scale impact: Higher scale = more text-adherent images
Quality improves with more inference steps

Example Output:

ini

Prompt: "A cozy cabin in the snowy mountains at sunset, digital art"

Guidance Scale = 3.0: Creative but loose interpretation
Guidance Scale = 7.5: Balanced (most realistic)
Guidance Scale = 15.0: Highly detailed but may be over-saturated

Success Criteria

Your implementation is successful when:

Contrastive Loss: InfoNCE loss converges below 3.0
Embedding Space: Matching image-text pairs have ``similarity > 0.7``
Zero-Shot Accuracy: CIFAR-10 classification ``achieves > 60``%
Retrieval Performance: R@5 (top-5 recall) > 70% on MS-COCO
Text-to-Image: Generated images visibly match prompts
Semantic Understanding: Model makes reasonable mistakes (confuses cat/dog, not cat/car)

Tips for Success

General Tips

Start with Pre-Trained: Use pre-trained CLIP from HuggingFace for faster results. Train from scratch only if you have time.
Small Batch Sizes: Contrastive learning benefits from large batches (32-128), but use smaller if GPU memory is limited.
Temperature Tuning: The temperature parameter (τ = 0.07) in contrastive loss is critical. Too high -> no learning, too low -> overfitting.
Normalization: Always normalize embeddings before computing similarity (crucial for contrastive learning).

TODO 1-2: CLIP Implementation

Hint: InfoNCE loss is symmetric - compute loss for both image->text and text->image directions.

Common Mistakes:

Forgetting to normalize embeddings (F.normalize(embedding, dim=-1))
Wrong temperature (use 0.07, not 1.0)
Incorrect label creation (diagonal should be positive pairs)

Debug Checklist:

✅ Embeddings are unit norm (check with torch.norm(embedding, dim=-1))
✅ Similarity matrix is symmetric for matching pairs
✅ Loss decreases monotonically

Code Pattern:

python

# Normalize embeddings
img_emb = F.normalize(img_emb, dim=-1)
txt_emb = F.normalize(txt_emb, dim=-1)

# Compute similarity matrix
logits = torch.matmul(img_emb, txt_emb.T) / temperature  # [batch, batch]

# Diagonal elements are positive pairs
labels = torch.arange(batch_size, device=logits.device)

# Symmetric loss
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)
loss = (loss_i2t + loss_t2i) / 2

TODO 3: Zero-Shot Classification

Hint: Create text prompts for each class (e.g., "a photo of a dog", "a photo of a cat").

Common Mistakes:

Generic prompts ("dog" instead of "a photo of a dog") -> Lower accuracy
Not using multiple prompt templates per class
Forgetting to normalize text embeddings

Debug Checklist:

✅ Text embeddings computed for all classes
✅ Argmax over similarity scores gives predicted class
✅ Accuracy improves with better prompt templates

Prompt Engineering:

python

# Good prompts
templates = [
    "a photo of a {}",
    "a picture of a {}",
    "a {} in a natural setting",
]

# Average embeddings over templates
for cls in classes:
    cls_embeddings = []
    for template in templates:
        text = template.format(cls)
        embedding = text_encoder(text)
        cls_embeddings.append(embedding)
    final_embedding = torch.stack(cls_embeddings).mean(dim=0)

TODO 4: Image-Text Retrieval

Hint: For image->text retrieval, find the text with highest similarity to the query image.

Common Mistakes:

Not handling batch processing efficiently (compute all similarities at once)
Forgetting to return top-K results (not just top-1)
Using cosine distance instead of similarity (flip the sign)

Debug Checklist:

✅ Retrieval returns K results sorted by similarity
✅ Bidirectional search works (image->text and text->image)
✅ Retrieval accuracy > random chance (5% for R@5 with 1000 images)

Efficient Retrieval:

python

# Precompute all embeddings
image_embeddings = encode_images(all_images)  # [N, d]
text_embeddings = encode_texts(all_captions)  # [N, d]

# Query: text → images
query_text_emb = encode_text(query)  # [1, d]
similarities = torch.matmul(query_text_emb, image_embeddings.T)  # [1, N]
top_k_indices = similarities.topk(k=5).indices

TODO 5: Text-to-Image Generation

Hint: Use the StableDiffusionPipeline from diffusers library. Experiment with guidance_scale (7.5 is typical).

Common Mistakes:

Not enough inference steps (use 50-100 for quality)
Guidance scale too low (< 3.0) or too high (> 20.0)
Forgetting to enable FP16 for faster generation

Debug Checklist:

✅ Generated images load correctly
✅ Different prompts produce different images
✅ Higher guidance scale increases text adherence

Hyperparameter Exploration:

python

guidance_scales = [1.0, 5.0, 7.5, 10.0, 15.0]
inference_steps = [20, 50, 100]

for gs in guidance_scales:
    for steps in inference_steps:
        image = pipe(
            prompt=prompt,
            guidance_scale=gs,
            num_inference_steps=steps,
        ).images[0]
        # Compare visual quality

Extension Challenges

🟢 Easy: Visualize Embedding Space

Task: Use t-SNE or UMAP to visualize the learned image and text embeddings in 2D.

Expected Outcome: Plot showing clusters of semantically similar items (e.g., all "dog" images and "dog" text near each other).

🟡 Medium: Implement Image Captioning

Task: Build an encoder-decoder model that generates captions for images.

Approach:

Encoder: Use CLIP image encoder
Decoder: GPT-2 or BERT decoder with cross-attention
Training: Cross-entropy loss on caption tokens

Expected Outcome: Given an image, generate a descriptive caption (e.g., "A dog playing fetch in a park").

🔴 Hard: Build a Visual Question Answering (VQA) System

Task: Create a model that answers questions about images.

Steps:

Encode image with visual encoder
Encode question with text encoder
Fuse with co-attention mechanism
Classify into answer categories

Expected Outcome: Model answers questions like "What color is the car?" -> "Red".

⚫ Very Hard: Implement CLIP Training from Scratch on Custom Dataset

Task: Train CLIP on a domain-specific dataset (e.g., medical images + radiology reports).

Challenges:

Collect paired image-text data (10K+ examples)
Handle domain-specific vocabulary
Evaluate on domain-specific tasks

Expected Outcome: Custom CLIP model that outperforms general-purpose CLIP on your domain.

Submission Requirements

Required Deliverables

Completed Jupyter Notebook (.ipynb file)
- All TODOs implemented and tested
- Training logs and loss curves
- Sample results from each task
Results Gallery (Markdown cells in notebook)
- Zero-shot classification: Accuracy table + confusion matrix
- Image retrieval: Top-5 results for 5 queries
- Text-to-image: 5 generated images with prompts
Analysis Report (Markdown cell)
- Discuss embedding space quality
- Analyze failure cases (what mistakes does the model make?)
- Compare zero-shot CLIP to supervised baselines

Evaluation Criteria

Criterion	Points	Description
CLIP Implementation	25	Correct contrastive loss, embeddings normalized
Zero-Shot Classification	20	``Accuracy > 60``%, reasonable prompts
Image-Text Retrieval	20	R@`5 > 70`%, efficient implementation
Text-to-Image	15	Generated images match prompts
Analysis	10	Thoughtful discussion of results
Code Quality	10	Clean, commented, efficient
Total	100

Bonus Points (+10 each):

Implement an extension challenge
Create an interactive Gradio demo for image search
Train CLIP on a custom dataset

Resources

Documentation

HuggingFace CLIP - Pre-trained CLIP models
HuggingFace Diffusers - Stable Diffusion pipeline
OpenAI CLIP Paper - Original CLIP research

Research Papers

CLIP (Radford et al., 2021): Learning Transferable Visual Models From Natural Language
Stable Diffusion (Rombach et al., 2022): High-Resolution Image Synthesis with Latent Diffusion Models
ALIGN (Jia et al., 2021): Scaling Up Visual and Vision-Language Representation Learning

Datasets

CIFAR-10 - 60K images, 10 classes
MS-COCO - 200K images with captions
Flickr30k - 31K images with captions

Lesson 13: Diffusion Models (denoising diffusion)
Lesson 14: Transformers (self-attention mechanisms)
Lesson 16: RLHF (extending to multi-modal domains)

Next Steps

After Completing This Activity

Explore Pre-Trained Models:
- Try OpenAI's CLIP models on HuggingFace
- Experiment with different backbones (ViT vs ResNet)
Read Research Papers:
- CLIP (original paper)
- ALIGN (Google's alternative to CLIP)
- Flamingo (multi-modal few-shot learning)
Prepare for Activity 18:
- AI agents that use multi-modal understanding
- Tool-use with vision capabilities

Career Connections

Skills from this activity are directly applicable to:

Computer Vision Engineer: Build multi-modal search engines
AI Research Scientist: Develop new vision-language architectures
ML Engineer: Deploy text-to-image generation systems
Product Manager: Design multi-modal AI features

Troubleshooting

Issue: "Contrastive loss not decreasing"

Solution: Check temperature parameter (should be ~0.07) and ensure embeddings are normalized.

Issue: "Zero-shot accuracy very low (`<30%`)"

Solution: Improve prompt templates. Try "a photo of a {class}" instead of just "{class}".

Issue: "Out of memory during training"

Solution: Reduce batch size or use gradient accumulation:

python

# Instead of batch_size=128
batch_size = 32
accumulation_steps = 4  # Effective batch size: 128

Issue: "Generated images don't match prompts"

Solution: Increase guidance scale (try 10.0-15.0) or use more descriptive prompts.

Issue: "Image retrieval returns random results"

Solution: Verify embeddings are normalized and similarity computation is correct.

Assessment

Your submission will be evaluated on:

Correctness (40%): Does your implementation produce expected results?
Understanding (30%): Do you demonstrate understanding through analysis?
Quality (20%): Is your code clean and efficient?
Creativity (10%): Did you explore beyond the requirements?

Passing Criteria: ``Score >= 70``/100 and all success criteria met.

Reflection Questions

After completing the activity, reflect on:

What makes contrastive learning effective? Why does pulling matching pairs together and pushing non-matching pairs apart work?
How does CLIP achieve zero-shot classification? Why can it handle categories it's never seen?
What are the limitations of multi-modal AI? When does CLIP fail?
How does CLIP enable text-to-image generation? Trace the path from text prompt to final image.
What are the ethical implications? Consider deepfakes, bias in generated images, and misuse potential.
How would you improve multi-modal models? Think about compositionality, fine-grained understanding, and efficiency.

Congratulations! By completing this activity, you've built the core technology behind Google Lens, DALL-E, and GPT-4 Vision. You now understand how AI systems connect vision and language - one of the most exciting frontiers in modern AI.

Next Activity: Activity 18 - AI Agent with Safety ->

Activity 17 of 18

Activity 17: Multi-Modal AI - Vision and Language

Practice and reinforce the concepts from Lesson 17

Overview

What makes this special: You're building the core technology behind modern multi-modal AI systems - the same techniques used in GPT-4 Vision, Google's Gemini, and image search engines.

Time Required: 90-120 minutes

Learning Objectives

By completing this activity, you will be able to:

Implement a CLIP model with dual encoders (vision + language)
Train the model using contrastive learning (InfoNCE loss)
Perform zero-shot image classification on new categories
Build an image-text retrieval system for searching images by description
Generate images from text using Stable Diffusion with CLIP guidance
Analyze the learned embedding space and visualize image-text alignments

Prerequisites

Before starting this activity, you should:

✅ Complete Lesson 13 (Diffusion Models) - Understanding denoising diffusion
✅ Complete Lesson 14 (Transformers) - Understanding self-attention
✅ Complete Lesson 17 Concept - Understanding CLIP and multi-modal learning
✅ Have a Google Colab account (GPU runtime recommended)
✅ Familiarity with PyTorch and HuggingFace Transformers

Getting Started

Download the template from the course Templates folder:
- File: AI25-Template-activity-17-multi-modal.zip
- Extract to your working directory
Upload to Google Colab:
- Open Google Colab (colab.research.google.com)
- Upload the .ipynb file
- Enable GPU: Runtime -> Change runtime type -> GPU (T4)
Run the first cell to verify your environment and see a working demo

What You'll Build

This activity provides a 65-70% working implementation. You'll complete the missing pieces:

Part One: CLIP Implementation (⚠️ Your Task: 35%)

✅ Image encoder (ResNet/ViT) - PRE-BUILT
✅ Text encoder (BERT/Transformer) - PRE-BUILT
⚠️ TODO 1: Implement contrastive loss function (InfoNCE)
⚠️ TODO 2: Complete the CLIP forward pass with normalization

Part 2: Zero-Shot Classification (⚠️ Your Task: 30%)

✅ Load CIFAR-10 dataset - PRE-BUILT
⚠️ TODO 3: Implement zero-shot classifier using text prompts
✅ Evaluation metrics and visualization - PRE-BUILT

Part 3: Image-Text Retrieval (⚠️ Your Task: 25%)

✅ Load MS-COCO dataset - PRE-BUILT
⚠️ TODO 4: Build retrieval system (image->text and text->image search)
✅ Interactive search interface - PRE-BUILT

Part 4: Text-to-Image Generation (✅ Pre-Built: 90%)

✅ Stable Diffusion pipeline with CLIP
⚠️ TODO 5: Experiment with guidance scales and prompts
✅ Image comparison and quality metrics - PRE-BUILT

Expected Results

After Completing TODO 1-2 (CLIP Training)

Contrastive loss decreases from ~6.0 to ~2.5 over 10 epochs
Image-text similarity increases for matching pairs (0.2 -> 0.8)
Embedding space shows clear semantic clustering

Example Output:

ini

Epoch 1: Loss = 5.87, Accuracy = 23.4%
Epoch 5: Loss = 3.12, Accuracy = 61.2%
Epoch 10: Loss = 2.48, Accuracy = 78.5%

Similarity Scores:
  Image: [dog photo] + Text: "a dog" = 0.87 ✓
  Image: [dog photo] + Text: "a cat" = 0.23 ✗

After Completing TODO 3 (Zero-Shot Classification)

CIFAR-10 accuracy: 60-70% (without fine-tuning!)
Confusion matrix shows semantic mistakes (e.g., "dog" <-> "cat", not "dog" <-> "airplane")
Works on custom categories you define

Example Output:

yaml

Zero-Shot CIFAR-10 Results:
  Accuracy: 67.3%

Per-Class Performance:
  Airplane: 72% ✓
  Automobile: 81% ✓
  Bird: 58% (confused with airplane)
  Cat: 64%
  Deer: 69%
  Dog: 71%
  Frog: 75%
  Horse: 66%
  Ship: 78%
  Truck: 83%

After Completing TODO 4 (Image-Text Retrieval)

Retrieval accuracy (R@5): 75-85% on MS-COCO validation set
Interactive search returns relevant images for natural language queries
Bidirectional search works in both directions

Example Output:

yaml

Query: "a person riding a bicycle"
Top 5 Results:
  1. [Image: cyclist on road] - Similarity: 0.92
  2. [Image: bike in park] - Similarity: 0.88
  3. [Image: mountain biking] - Similarity: 0.85
  4. [Image: person with bike] - Similarity: 0.81
  5. [Image: bicycle race] - Similarity: 0.78

After Completing TODO 5 (Text-to-Image)

Generated images match text descriptions
Guidance scale impact: Higher scale = more text-adherent images
Quality improves with more inference steps

Example Output:

ini

Prompt: "A cozy cabin in the snowy mountains at sunset, digital art"

Guidance Scale = 3.0: Creative but loose interpretation
Guidance Scale = 7.5: Balanced (most realistic)
Guidance Scale = 15.0: Highly detailed but may be over-saturated

Success Criteria

Your implementation is successful when:

Contrastive Loss: InfoNCE loss converges below 3.0
Embedding Space: Matching image-text pairs have ``similarity > 0.7``
Zero-Shot Accuracy: CIFAR-10 classification ``achieves > 60``%
Retrieval Performance: R@5 (top-5 recall) > 70% on MS-COCO
Text-to-Image: Generated images visibly match prompts
Semantic Understanding: Model makes reasonable mistakes (confuses cat/dog, not cat/car)

Tips for Success

General Tips

Start with Pre-Trained: Use pre-trained CLIP from HuggingFace for faster results. Train from scratch only if you have time.
Small Batch Sizes: Contrastive learning benefits from large batches (32-128), but use smaller if GPU memory is limited.
Temperature Tuning: The temperature parameter (τ = 0.07) in contrastive loss is critical. Too high -> no learning, too low -> overfitting.
Normalization: Always normalize embeddings before computing similarity (crucial for contrastive learning).

TODO 1-2: CLIP Implementation

Hint: InfoNCE loss is symmetric - compute loss for both image->text and text->image directions.

Common Mistakes:

Forgetting to normalize embeddings (F.normalize(embedding, dim=-1))
Wrong temperature (use 0.07, not 1.0)
Incorrect label creation (diagonal should be positive pairs)

Debug Checklist:

✅ Embeddings are unit norm (check with torch.norm(embedding, dim=-1))
✅ Similarity matrix is symmetric for matching pairs
✅ Loss decreases monotonically

Code Pattern:

python

# Normalize embeddings
img_emb = F.normalize(img_emb, dim=-1)
txt_emb = F.normalize(txt_emb, dim=-1)

# Compute similarity matrix
logits = torch.matmul(img_emb, txt_emb.T) / temperature  # [batch, batch]

# Diagonal elements are positive pairs
labels = torch.arange(batch_size, device=logits.device)

# Symmetric loss
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)
loss = (loss_i2t + loss_t2i) / 2

TODO 3: Zero-Shot Classification

Hint: Create text prompts for each class (e.g., "a photo of a dog", "a photo of a cat").

Common Mistakes:

Generic prompts ("dog" instead of "a photo of a dog") -> Lower accuracy
Not using multiple prompt templates per class
Forgetting to normalize text embeddings

Debug Checklist:

✅ Text embeddings computed for all classes
✅ Argmax over similarity scores gives predicted class
✅ Accuracy improves with better prompt templates

Prompt Engineering:

python

# Good prompts
templates = [
    "a photo of a {}",
    "a picture of a {}",
    "a {} in a natural setting",
]

# Average embeddings over templates
for cls in classes:
    cls_embeddings = []
    for template in templates:
        text = template.format(cls)
        embedding = text_encoder(text)
        cls_embeddings.append(embedding)
    final_embedding = torch.stack(cls_embeddings).mean(dim=0)

TODO 4: Image-Text Retrieval

Hint: For image->text retrieval, find the text with highest similarity to the query image.

Common Mistakes:

Not handling batch processing efficiently (compute all similarities at once)
Forgetting to return top-K results (not just top-1)
Using cosine distance instead of similarity (flip the sign)

Debug Checklist:

✅ Retrieval returns K results sorted by similarity
✅ Bidirectional search works (image->text and text->image)
✅ Retrieval accuracy > random chance (5% for R@5 with 1000 images)

Efficient Retrieval:

python

# Precompute all embeddings
image_embeddings = encode_images(all_images)  # [N, d]
text_embeddings = encode_texts(all_captions)  # [N, d]

# Query: text → images
query_text_emb = encode_text(query)  # [1, d]
similarities = torch.matmul(query_text_emb, image_embeddings.T)  # [1, N]
top_k_indices = similarities.topk(k=5).indices

TODO 5: Text-to-Image Generation

Hint: Use the StableDiffusionPipeline from diffusers library. Experiment with guidance_scale (7.5 is typical).

Common Mistakes:

Not enough inference steps (use 50-100 for quality)
Guidance scale too low (< 3.0) or too high (> 20.0)
Forgetting to enable FP16 for faster generation

Debug Checklist:

✅ Generated images load correctly
✅ Different prompts produce different images
✅ Higher guidance scale increases text adherence

Hyperparameter Exploration:

python

guidance_scales = [1.0, 5.0, 7.5, 10.0, 15.0]
inference_steps = [20, 50, 100]

for gs in guidance_scales:
    for steps in inference_steps:
        image = pipe(
            prompt=prompt,
            guidance_scale=gs,
            num_inference_steps=steps,
        ).images[0]
        # Compare visual quality

Encoder: Use CLIP image encoder
Decoder: GPT-2 or BERT decoder with cross-attention
Training: Cross-entropy loss on caption tokens

Expected Outcome: Given an image, generate a descriptive caption (e.g., "A dog playing fetch in a park").

🔴 Hard: Build a Visual Question Answering (VQA) System

Task: Create a model that answers questions about images.

Steps:

Encode image with visual encoder
Encode question with text encoder
Fuse with co-attention mechanism
Classify into answer categories

Expected Outcome: Model answers questions like "What color is the car?" -> "Red".

⚫ Very Hard: Implement CLIP Training from Scratch on Custom Dataset

Task: Train CLIP on a domain-specific dataset (e.g., medical images + radiology reports).

Challenges:

Collect paired image-text data (10K+ examples)
Handle domain-specific vocabulary
Evaluate on domain-specific tasks

Expected Outcome: Custom CLIP model that outperforms general-purpose CLIP on your domain.

Submission Requirements

Required Deliverables

Completed Jupyter Notebook (.ipynb file)
- All TODOs implemented and tested
- Training logs and loss curves
- Sample results from each task
Results Gallery (Markdown cells in notebook)
- Zero-shot classification: Accuracy table + confusion matrix
- Image retrieval: Top-5 results for 5 queries
- Text-to-image: 5 generated images with prompts
Analysis Report (Markdown cell)
- Discuss embedding space quality
- Analyze failure cases (what mistakes does the model make?)
- Compare zero-shot CLIP to supervised baselines

Evaluation Criteria

Criterion	Points	Description
CLIP Implementation	25	Correct contrastive loss, embeddings normalized
Zero-Shot Classification	20	``Accuracy > 60``%, reasonable prompts
Image-Text Retrieval	20	R@`5 > 70`%, efficient implementation
Text-to-Image	15	Generated images match prompts
Analysis	10	Thoughtful discussion of results
Code Quality	10	Clean, commented, efficient
Total	100

Bonus Points (+10 each):

Implement an extension challenge
Create an interactive Gradio demo for image search
Train CLIP on a custom dataset

Resources

Documentation

HuggingFace CLIP - Pre-trained CLIP models
HuggingFace Diffusers - Stable Diffusion pipeline
OpenAI CLIP Paper - Original CLIP research

Research Papers

CLIP (Radford et al., 2021): Learning Transferable Visual Models From Natural Language
Stable Diffusion (Rombach et al., 2022): High-Resolution Image Synthesis with Latent Diffusion Models
ALIGN (Jia et al., 2021): Scaling Up Visual and Vision-Language Representation Learning

Datasets

CIFAR-10 - 60K images, 10 classes
MS-COCO - 200K images with captions
Flickr30k - 31K images with captions

Lesson 13: Diffusion Models (denoising diffusion)
Lesson 14: Transformers (self-attention mechanisms)
Lesson 16: RLHF (extending to multi-modal domains)

Next Steps

After Completing This Activity

Explore Pre-Trained Models:
- Try OpenAI's CLIP models on HuggingFace
- Experiment with different backbones (ViT vs ResNet)
Read Research Papers:
- CLIP (original paper)
- ALIGN (Google's alternative to CLIP)
- Flamingo (multi-modal few-shot learning)
Prepare for Activity 18:
- AI agents that use multi-modal understanding
- Tool-use with vision capabilities

Career Connections

Skills from this activity are directly applicable to:

Computer Vision Engineer: Build multi-modal search engines
AI Research Scientist: Develop new vision-language architectures
ML Engineer: Deploy text-to-image generation systems
Product Manager: Design multi-modal AI features

# Instead of batch_size=128
batch_size = 32
accumulation_steps = 4  # Effective batch size: 128

Correctness (40%): Does your implementation produce expected results?
Understanding (30%): Do you demonstrate understanding through analysis?
Quality (20%): Is your code clean and efficient?
Creativity (10%): Did you explore beyond the requirements?

Passing Criteria: ``Score >= 70``/100 and all success criteria met.

Reflection Questions

After completing the activity, reflect on:

What makes contrastive learning effective? Why does pulling matching pairs together and pushing non-matching pairs apart work?
How does CLIP achieve zero-shot classification? Why can it handle categories it's never seen?
What are the limitations of multi-modal AI? When does CLIP fail?
How does CLIP enable text-to-image generation? Trace the path from text prompt to final image.
What are the ethical implications? Consider deepfakes, bias in generated images, and misuse potential.
How would you improve multi-modal models? Think about compositionality, fine-grained understanding, and efficiency.

Next Activity: Activity 18 - AI Agent with Safety ->