Telebort | Learning to code made fun!

Demo Mode

No student ID available

Concept 17 of 18

Concept 17: Multi-Modal AI - Vision and Language

ℹ️ Definition: Multi-modal AI systems process and generate information across multiple modalities (vision, language, audio) by learning joint representations. These models understand relationships between different data types, enabling applications like image captioning, visual question answering, and text-to-image generation.

Learning Objectives

By the end of this lesson, you will be able to:

Explain how cross-modal learning enables models to connect vision and language
Understand the CLIP architecture and contrastive learning objective
Implement zero-shot image classification using learned embeddings
Apply text-to-image generation with Stable Diffusion (connecting to Lesson 13)
Build image captioning systems with encoder-decoder architectures
Analyze how multi-modal AI extends RLHF to vision domains

Introduction: Beyond Single Modalities

For the past 16 lessons, we've worked primarily with single-modality data:

RL (Lessons 1-8): Environments with state vectors or images
GenAI (Lessons 9-15): Text generation (LLMs) or image generation (GANs, Diffusion)
RLHF (Lesson 16): Aligning text models with human preferences

The Next Frontier: What if AI could understand the relationship between images AND text?

Real-World Examples:

Google Lens: Take a photo of a plant -> Get its name and care instructions
DALL-E: Type "a cat wearing a space suit" -> Get a photorealistic image
GPT-4 Vision: Upload a chart -> Ask questions about the data
Claude (this AI!): Analyze screenshots, diagrams, and documents

The Core Challenge: Vision and language are fundamentally different:

Property	Vision	Language
Structure	2D/3D spatial (pixels)	1D sequential (words)
Size	224x224x3 = 150K values	~50 tokens = 50 values
Semantics	Continuous (colors, shapes)	Discrete (words, grammar)
Ambiguity	One image, many descriptions	One description, many images

Solution: Learn a joint embedding space where vision and language align.

The Joint Embedding Idea

Goal: Map images and text to the same vector space, where semantically similar items are close together.

Intuition:

Image of a dog and text "a dog" should be close in embedding space
Image of a dog and text "a cat" should be far apart
Multiple descriptions of the same image should cluster together

Mathematical Formulation:

vbnet

f: Images → R^d  (image encoder)
g: Text → R^d    (text encoder)

Where d is the embedding dimension (e.g., 512 or 1024).

Contrastive Learning

Training Objective: Given a batch of (image, text) pairs:

Pull matching pairs together in embedding space
Push non-matching pairs apart

Contrastive Loss (InfoNCE):

python

def contrastive_loss(image_embeddings, text_embeddings, temperature=0.07):
    """
    image_embeddings: [batch_size, d]
    text_embeddings: [batch_size, d]
    """
    # Normalize embeddings
    image_embeddings = F.normalize(image_embeddings, dim=-1)
    text_embeddings = F.normalize(text_embeddings, dim=-1)

    # Compute similarity matrix: [batch_size, batch_size]
    logits = torch.matmul(image_embeddings, text_embeddings.T) / temperature

    # Labels: diagonal elements are positive pairs
    labels = torch.arange(len(logits), device=logits.device)

    # Cross-entropy loss (image-to-text and text-to-image)
    loss_i2t = F.cross_entropy(logits, labels)
    loss_t2i = F.cross_entropy(logits.T, labels)

    return (loss_i2t + loss_t2i) / 2

Why This Works:

Positive pairs (matching image+text) have high similarity (near 1.0)
Negative pairs (non-matching) have low similarity (near 0.0)
Temperature controls how "sharp" the distribution is

Visualization:

less

Before Training:
  Image: [dog photo] → [0.1, 0.2, 0.3, ...]
  Text: "a dog"      → [0.5, 0.1, 0.8, ...]
  Similarity: 0.23 (random)

After Training:
  Image: [dog photo] → [0.7, 0.3, 0.1, ...]
  Text: "a dog"      → [0.6, 0.4, 0.2, ...]
  Similarity: 0.91 (high!)

CLIP: Contrastive Language-Image Pre-training

Paper: "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., 2021)

Big Idea: Train vision and text encoders jointly on 400 million (image, caption) pairs from the internet using contrastive learning.

CLIP Architecture

css

┌─────────────────────┐
│   Image Encoder     │
│  (ViT or ResNet)    │
│   224×224×3         │
│        ↓            │
│  [0.7, 0.3, ...]    │ ← Image embedding (512-D)
└─────────────────────┘

┌─────────────────────┐
│   Text Encoder      │
│   (Transformer)     │
│  "a dog playing"    │
│        ↓            │
│  [0.6, 0.4, ...]    │ ← Text embedding (512-D)
└─────────────────────┘

    Contrastive Loss: Pull matching pairs together

Image Encoder Options

Option One: ResNet (CNN-based)

python

import torch.nn as nn
from torchvision.models import resnet50

class CLIPImageEncoder(nn.Module):
    def __init__(self, embed_dim=512):
        super().__init__()
        self.resnet = resnet50(pretrained=False)
        # Replace final layer with projection to embed_dim
        self.resnet.fc = nn.Linear(2048, embed_dim)

    def forward(self, images):
        # images: [batch, 3, 224, 224]
        embeddings = self.resnet(images)  # [batch, embed_dim]
        return embeddings

Option 2: Vision Transformer (ViT)

python

from transformers import ViTModel

class CLIPImageEncoder(nn.Module):
    def __init__(self, embed_dim=512):
        super().__init__()
        self.vit = ViTModel.from_pretrained("google/vit-base-patch16-224")
        self.projection = nn.Linear(768, embed_dim)  # ViT hidden size → embed_dim

    def forward(self, images):
        # images: [batch, 3, 224, 224]
        outputs = self.vit(pixel_values=images)
        # Take [CLS] token embedding
        cls_embedding = outputs.last_hidden_state[:, 0, :]  # [batch, 768]
        embeddings = self.projection(cls_embedding)  # [batch, embed_dim]
        return embeddings

Text Encoder

python

from transformers import BertModel

class CLIPTextEncoder(nn.Module):
    def __init__(self, embed_dim=512):
        super().__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.projection = nn.Linear(768, embed_dim)

    def forward(self, input_ids, attention_mask):
        # input_ids: [batch, seq_len]
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        # Take [CLS] token embedding
        cls_embedding = outputs.last_hidden_state[:, 0, :]  # [batch, 768]
        embeddings = self.projection(cls_embedding)  # [batch, embed_dim]
        return embeddings

Complete CLIP Model

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class CLIP(nn.Module):
    def __init__(self, embed_dim=512, temperature=0.07):
        super().__init__()
        self.image_encoder = CLIPImageEncoder(embed_dim)
        self.text_encoder = CLIPTextEncoder(embed_dim)
        self.temperature = temperature

    def forward(self, images, input_ids, attention_mask):
        # Encode images and text
        image_embeddings = self.image_encoder(images)  # [batch, embed_dim]
        text_embeddings = self.text_encoder(input_ids, attention_mask)  # [batch, embed_dim]

        # Normalize embeddings (important for contrastive learning)
        image_embeddings = F.normalize(image_embeddings, dim=-1)
        text_embeddings = F.normalize(text_embeddings, dim=-1)

        # Compute similarity matrix
        logits = torch.matmul(image_embeddings, text_embeddings.T) / self.temperature

        return logits, image_embeddings, text_embeddings

    def compute_loss(self, logits):
        # Labels: diagonal elements are positive pairs
        labels = torch.arange(len(logits), device=logits.device)

        # Symmetric loss (image-to-text + text-to-image)
        loss_i2t = F.cross_entropy(logits, labels)
        loss_t2i = F.cross_entropy(logits.T, labels)

        return (loss_i2t + loss_t2i) / 2

# Training loop
model = CLIP(embed_dim=512)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for epoch in range(10):
    for images, input_ids, attention_mask in dataloader:
        # Forward pass
        logits, img_emb, txt_emb = model(images, input_ids, attention_mask)

        # Compute loss
        loss = model.compute_loss(logits)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

CLIP Training Data

OpenAI's CLIP was trained on:

400 million (image, text) pairs
Scraped from the internet (Alt-text from webpages)
Filtered for quality and content safety

Example Pairs:

vbnet

Image: [Photo of golden retriever]
Text: "A golden retriever playing fetch in a park"

Image: [Sunset over mountains]
Text: "Beautiful sunset with orange sky over mountain range"

Image: [Person cooking]
Text: "Chef preparing pasta in a kitchen"

Zero-Shot Image Classification with CLIP

The Magic: Once trained, CLIP can classify images into categories it's never seen before!

How Zero-Shot Works

Traditional Supervised Learning:

Train model on fixed categories (e.g., ImageNet 1000 classes)
Model can ONLY predict those 1000 classes
New category -> Must retrain entire model

CLIP Zero-Shot:

Train on 400M diverse (image, text) pairs
At test time, provide text descriptions of new categories
Find which text description best matches the image

Example:

python

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

# Load pre-trained CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load image
image = Image.open("mystery_animal.jpg")

# Define candidate categories (can be anything!)
texts = [
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a bird",
    "a photo of a fish",
]

# Process inputs
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

# Compute similarities
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # [1, 4] - image-text similarity scores
probs = logits_per_image.softmax(dim=1)  # Convert to probabilities

# Get prediction
pred_idx = probs.argmax().item()
print(f"Prediction: {texts[pred_idx]}")
print(f"Confidence: {probs[0][pred_idx].item():.2%}")

Output:

makefile

Prediction: a photo of a dog
Confidence: 87.32%

Why This is Revolutionary

Before CLIP: To add a new category to an image classifier, you need:

Thousands of labeled examples of that category
Retraining the entire model (hours/days)

With CLIP: To add a new category:

Write a text description (1 second)
No retraining needed!

Applications:

Content moderation (detect new types of inappropriate content)
Product search (find items by description)
Medical imaging (classify rare conditions without labeled data)

Text-to-Image Generation: Stable Diffusion + CLIP

Connection to Lesson 13: We learned diffusion models denoise random noise into images. But how do we condition the generation on text?

Answer: Use CLIP text embeddings to guide the diffusion process!

Latent Diffusion Model Architecture

less

Text Prompt: "A cat wearing a space suit"
       ↓
   Text Encoder (CLIP)
       ↓
   Text Embedding [512-D]
       ↓
   ┌───────────────────────────────┐
   │  U-Net Denoiser               │
   │  (conditioned on text)        │
   │                               │
   │  Noise → ... → Image Latent   │
   └───────────────────────────────┘
       ↓
   VAE Decoder
       ↓
   Final Image (512×512×3)

CLIP-Guided Diffusion

Key Idea: At each denoising step, ensure the image is moving toward the text description.

Classifier-Free Guidance (CFG):

python

def diffusion_step_with_guidance(x_t, t, text_embedding, guidance_scale=7.5):
    """
    x_t: Noisy image at timestep t
    t: Current timestep
    text_embedding: CLIP text embedding
    guidance_scale: How much to emphasize text conditioning
    """
    # Predict noise with text conditioning
    noise_cond = unet(x_t, t, context=text_embedding)

    # Predict noise without text conditioning (unconditional)
    noise_uncond = unet(x_t, t, context=None)

    # Classifier-free guidance: extrapolate toward text-conditioned prediction
    noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)

    # Denoise
    x_t_minus_1 = denoise(x_t, noise_pred, t)
    return x_t_minus_1

Why CFG Works:

Low guidance (e.g., 1.0): Image loosely matches text (more creative)
High guidance (e.g., 15.0): Image closely matches text (more accurate)
Typical value: 7.5 (good balance)

Stable Diffusion Code

python

from diffusers import StableDiffusionPipeline
import torch

# Load Stable Diffusion model
model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# Generate image from text
prompt = "A cat wearing a space suit, digital art"
image = pipe(
    prompt=prompt,
    height=512,
    width=512,
    num_inference_steps=50,  # More steps = higher quality
    guidance_scale=7.5,      # Text adherence
).images[0]

image.save("cat_astronaut.png")

How Text Influences Generation:

Text Encoder (CLIP): Converts prompt to embedding
U-Net Denoiser: Takes noisy image + text embedding -> Predicts noise to remove
Cross-Attention: U-Net attends to text embedding at each layer (connects vision + language)
Guidance: Amplifies text-conditioned signal during sampling

Image Captioning: Encoder-Decoder Architecture

Task: Given an image, generate a descriptive text caption.

Example:

less

Input Image: [Photo of children playing soccer in a park]
Output Caption: "A group of children playing soccer in a park on a sunny day"

Architecture

scss

┌─────────────────────┐
│  Image Encoder      │  ← CNN or ViT (pre-trained)
│   (ResNet/ViT)      │
│        ↓            │
│  Visual Features    │ ← [batch, 196, 2048] (spatial features)
└─────────────────────┘
         ↓
┌─────────────────────┐
│  Decoder            │  ← Transformer decoder with cross-attention
│  (Transformer)      │
│                     │
│  Cross-Attention:   │ ← Attend to visual features
│  Query: Word        │
│  Key/Value: Image   │
│        ↓            │
│  "A" → "group" →    │ ← Autoregressive generation
│  "of" → "children"  │
└─────────────────────┘

Encoder: Extract Visual Features

python

import torch.nn as nn
from torchvision.models import resnet50

class ImageEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        # Use ResNet without final classification layer
        resnet = resnet50(pretrained=True)
        modules = list(resnet.children())[:-2]  # Remove avgpool and fc
        self.resnet = nn.Sequential(*modules)
        # Output: [batch, 2048, 7, 7] spatial features

    def forward(self, images):
        # images: [batch, 3, 224, 224]
        features = self.resnet(images)  # [batch, 2048, 7, 7]
        # Reshape to sequence: [batch, 49, 2048]
        batch_size, channels, h, w = features.size()
        features = features.view(batch_size, channels, h * w).permute(0, 2, 1)
        return features  # [batch, 49, 2048]

Decoder: Transformer with Cross-Attention

python

class CaptioningDecoder(nn.Module):
    def __init__(self, vocab_size, embed_dim=512, num_heads=8, num_layers=6):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.pos_encoding = nn.Parameter(torch.randn(1, 100, embed_dim))  # Max length 100

        # Transformer decoder layer
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=embed_dim,
            nhead=num_heads,
            dim_feedforward=2048,
        )
        self.transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)

        # Output projection
        self.fc_out = nn.Linear(embed_dim, vocab_size)

    def forward(self, tokens, visual_features, tgt_mask=None):
        # tokens: [batch, seq_len] - caption tokens
        # visual_features: [batch, 49, 2048] - image features

        # Embed tokens
        embeddings = self.embedding(tokens) + self.pos_encoding[:, :tokens.size(1), :]

        # Transformer decoder (with cross-attention to visual features)
        output = self.transformer_decoder(
            tgt=embeddings.permute(1, 0, 2),  # [seq_len, batch, embed_dim]
            memory=visual_features.permute(1, 0, 2),  # [49, batch, 2048]
            tgt_mask=tgt_mask,  # Causal mask for autoregressive generation
        )

        # Project to vocabulary
        logits = self.fc_out(output.permute(1, 0, 2))  # [batch, seq_len, vocab_size]
        return logits

Complete Captioning Model

python

class ImageCaptioningModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.encoder = ImageEncoder()
        self.decoder = CaptioningDecoder(vocab_size)

    def forward(self, images, captions):
        # Encode image
        visual_features = self.encoder(images)  # [batch, 49, 2048]

        # Decode caption (teacher forcing during training)
        logits = self.decoder(captions, visual_features)
        return logits

    def generate_caption(self, image, tokenizer, max_length=50):
        # Encode image
        visual_features = self.encoder(image.unsqueeze(0))

        # Start with <START> token
        caption = [tokenizer.start_token_id]

        # Autoregressive generation
        for _ in range(max_length):
            tokens = torch.tensor([caption], device=image.device)
            logits = self.decoder(tokens, visual_features)

            # Get next token
            next_token = logits[0, -1, :].argmax().item()
            caption.append(next_token)

            # Stop at <END> token
            if next_token == tokenizer.end_token_id:
                break

        # Decode to text
        return tokenizer.decode(caption)

# Training loop
model = ImageCaptioningModel(vocab_size=10000)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

for images, captions in dataloader:
    # Forward pass
    logits = model(images, captions[:, :-1])  # Input: all but last token

    # Compute loss (predict next token)
    loss = F.cross_entropy(
        logits.reshape(-1, vocab_size),
        captions[:, 1:].reshape(-1)  # Target: all but first token
    )

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Visual Question Answering (VQA)

Task: Given an image and a question, generate an answer.

Example:

vbnet

Image: [Photo of a kitchen with a red kettle on the stove]
Question: "What color is the kettle?"
Answer: "Red"

VQA Architecture

css

Image → Visual Encoder → Visual Features [2048-D]
                              ↓
Question → Text Encoder → Question Features [512-D]
                              ↓
                    ┌─────────────────┐
                    │  Fusion Module  │ ← Combine vision + language
                    │  (Multi-modal   │
                    │   Attention)    │
                    └─────────────────┘
                              ↓
                    Answer Classifier [3129 classes]

Fusion with Co-Attention

Key Idea: Let the model attend to relevant image regions based on the question.

Example:

Question: "What color is the kettle?" -> Attend to kettle region
Question: "How many people are visible?" -> Attend to person regions

python

class CoAttentionVQA(nn.Module):
    def __init__(self, num_answers=3129):
        super().__init__()
        self.visual_encoder = ImageEncoder()  # ResNet
        self.text_encoder = nn.LSTM(input_size=300, hidden_size=512, num_layers=2)

        # Co-attention layers
        self.visual_attention = nn.Linear(2048, 512)
        self.text_attention = nn.Linear(512, 512)

        # Answer classifier
        self.classifier = nn.Linear(512 + 2048, num_answers)

    def forward(self, images, questions):
        # Encode image: [batch, 49, 2048]
        visual_features = self.visual_encoder(images)

        # Encode question: [batch, 512]
        _, (question_features, _) = self.text_encoder(questions)
        question_features = question_features[-1]  # Last layer

        # Compute attention weights over image regions
        query = self.text_attention(question_features).unsqueeze(1)  # [batch, 1, 512]
        keys = self.visual_attention(visual_features)  # [batch, 49, 512]

        # Attention scores
        scores = torch.bmm(query, keys.permute(0, 2, 1))  # [batch, 1, 49]
        attention_weights = F.softmax(scores, dim=-1)

        # Weighted sum of visual features
        attended_visual = torch.bmm(attention_weights, visual_features).squeeze(1)  # [batch, 2048]

        # Fuse visual + question features
        fused = torch.cat([attended_visual, question_features], dim=-1)  # [batch, 2560]

        # Classify answer
        logits = self.classifier(fused)  # [batch, 3129]
        return logits

From Lesson 16: RLHF aligns language models with human preferences.

Extension: Use RLHF to align text-to-image models with human aesthetic preferences!

Vision-RLHF Pipeline

Stage 1: Train base diffusion model (Stable Diffusion)

Stage 2: Train aesthetic reward model

Collect preference data: "Which image is better for this prompt?"
Train reward model: r = aesthetic_model(image, prompt)

Stage 3: Fine-tune diffusion model with RLHF

Generate images from prompts
Score with reward model
Use RL (PPO or DPO) to optimize for higher rewards

Example:

python

# Reward model
class AestheticRewardModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.clip = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
        self.mlp = nn.Sequential(
            nn.Linear(768, 256),
            nn.ReLU(),
            nn.Linear(256, 1)  # Scalar reward
        )

    def forward(self, images, prompts):
        # Get CLIP image-text similarity
        clip_score = self.clip(images, prompts).logits_per_image

        # MLP predicts aesthetic score
        features = self.clip.get_image_features(images)
        aesthetic_score = self.mlp(features)

        return aesthetic_score

# RLHF training loop (simplified)
for prompt_batch in prompts:
    # Generate images with current diffusion policy
    images = diffusion_model.generate(prompt_batch)

    # Get rewards
    rewards = aesthetic_model(images, prompt_batch)

    # PPO update
    ppo_trainer.step(prompts=prompt_batch, images=images, rewards=rewards)

Results: Models learn to generate more aesthetically pleasing, higher-quality images.

Applications

Application	Input	Output	Example
Image Search	Image	Relevant text queries	Upload photo -> Find similar products
Visual QA	Image + Question	Answer	"How many people?" -> "3"
Image Captioning	Image	Description	Photo -> "A dog playing in a park"
Text-to-Image	Text	Image	"Sunset over ocean" -> Generated image
Image Editing	Image + Text	Edited image	"Make the sky blue" -> Modified image
Document Understanding	Document image	Extracted info	Invoice -> Total amount
Medical Imaging	X-ray + Query	Diagnosis	"Is there a fracture?" -> "Yes, left tibia"

Advantages and Limitations

Advantages

Advantage	Description
✅ Zero-Shot Transfer	CLIP can classify new categories without training
✅ Unified Representation	One model for multiple vision-language tasks
✅ Rich Semantics	Learns high-level concepts, not just pixels
✅ Scalability	Trained on web-scale data (millions of examples)
✅ Interpretability	Can visualize attention to understand decisions

Limitations

Limitation	Description	Mitigation
⚠️ Compositionality	Struggles with complex relationships	Fine-tune on compositional datasets
⚠️ Bias	Inherits biases from web data	Careful data curation, fairness audits
⚠️ Hallucination	May generate false captions	Reward models for factuality
⚠️ Compute Cost	Large models require GPUs	Model distillation, efficient architectures
⚠️ Fine-Grained Details	Misses small objects or subtle differences	Higher-resolution encoders, attention mechanisms

Connections to Previous Lessons

Lesson	Connection to Multi-Modal AI
Lesson 13 (Diffusion)	Stable Diffusion uses CLIP text embeddings to condition image generation
Lesson 14 (Transformers)	Image captioning decoder uses Transformer architecture
Lesson 15 (LLMs)	Text encoders in CLIP are based on BERT/GPT-style transformers
Lesson 16 (RLHF)	Multi-modal RLHF aligns text-to-image models with human preferences

Key Takeaways

Joint Embedding Space: Multi-modal AI maps images and text to the same vector space using contrastive learning.
CLIP Architecture: Dual encoders (image + text) trained with InfoNCE loss on 400M pairs.
Zero-Shot Classification: CLIP can classify images into new categories by comparing with text descriptions.
Text-to-Image: Stable Diffusion uses CLIP text embeddings to guide denoising diffusion.
Image Captioning: Encoder-decoder with cross-attention generates descriptions from images.
Visual QA: Fusion modules (co-attention) combine visual and textual information for question answering.
Multi-Modal RLHF: Extends RLHF to vision domains for aesthetic and quality alignment.
Cross-Attention: Key mechanism that connects visual features with text queries.
Applications: Image search, VQA, captioning, text-to-image, document understanding.
Challenges: Compositionality, bias, hallucination, compute cost.

Looking Ahead

In Lesson 18 (Future of AI), we'll explore:

AI Agents: Systems that use tool calling and multi-modal understanding to solve complex tasks
Safety and Ethics: Ensuring multi-modal AI is fair, robust, and aligned
Integration: Combining RL, GenAI, and multi-modal AI into unified systems

Preview: Imagine an AI agent that can:

Read a research paper (text)
Analyze figures and charts (vision)
Search for related work (tool use)
Generate a summary with visualizations (multi-modal generation)

This is the future we're building toward!

Summary

Multi-modal AI bridges vision and language through joint embedding spaces learned with contrastive learning.

CLIP (Contrastive Language-Image Pre-training):

Dual encoders (image ViT/ResNet + text Transformer)
Trained on 400M (image, text) pairs
Enables zero-shot classification and image-text retrieval

Applications:

Text-to-image (Stable Diffusion + CLIP)
Image captioning (Encoder-decoder with cross-attention)
Visual QA (Co-attention fusion)
Multi-modal RLHF (aligning text-to-image models)

Key Techniques:

Contrastive loss (InfoNCE)
Cross-attention (connect vision + language)
Classifier-free guidance (text-conditioned generation)

Impact: Powers Google Lens, DALL-E, GPT-4 Vision, and countless other multi-modal applications.

Next: Lesson 18 - The Future of AI ->

Concept 17 of 18

Concept 17: Multi-Modal AI - Vision and Language

ℹ️ Definition: Multi-modal AI systems process and generate information across multiple modalities (vision, language, audio) by learning joint representations. These models understand relationships between different data types, enabling applications like image captioning, visual question answering, and text-to-image generation.

Learning Objectives

By the end of this lesson, you will be able to:

Explain how cross-modal learning enables models to connect vision and language
Understand the CLIP architecture and contrastive learning objective
Implement zero-shot image classification using learned embeddings
Apply text-to-image generation with Stable Diffusion (connecting to Lesson 13)
Build image captioning systems with encoder-decoder architectures
Analyze how multi-modal AI extends RLHF to vision domains

Introduction: Beyond Single Modalities

For the past 16 lessons, we've worked primarily with single-modality data:

RL (Lessons 1-8): Environments with state vectors or images
GenAI (Lessons 9-15): Text generation (LLMs) or image generation (GANs, Diffusion)
RLHF (Lesson 16): Aligning text models with human preferences

The Next Frontier: What if AI could understand the relationship between images AND text?

Real-World Examples:

Google Lens: Take a photo of a plant -> Get its name and care instructions
DALL-E: Type "a cat wearing a space suit" -> Get a photorealistic image
GPT-4 Vision: Upload a chart -> Ask questions about the data
Claude (this AI!): Analyze screenshots, diagrams, and documents

The Core Challenge: Vision and language are fundamentally different:

Property	Vision	Language
Structure	2D/3D spatial (pixels)	1D sequential (words)
Size	224x224x3 = 150K values	~50 tokens = 50 values
Semantics	Continuous (colors, shapes)	Discrete (words, grammar)
Ambiguity	One image, many descriptions	One description, many images

Solution: Learn a joint embedding space where vision and language align.

The Joint Embedding Idea

Goal: Map images and text to the same vector space, where semantically similar items are close together.

Intuition:

Image of a dog and text "a dog" should be close in embedding space
Image of a dog and text "a cat" should be far apart
Multiple descriptions of the same image should cluster together

Mathematical Formulation:

vbnet

f: Images → R^d  (image encoder)
g: Text → R^d    (text encoder)

Where d is the embedding dimension (e.g., 512 or 1024).

Contrastive Learning

Training Objective: Given a batch of (image, text) pairs:

Pull matching pairs together in embedding space
Push non-matching pairs apart

Contrastive Loss (InfoNCE):

python

def contrastive_loss(image_embeddings, text_embeddings, temperature=0.07):
    """
    image_embeddings: [batch_size, d]
    text_embeddings: [batch_size, d]
    """
    # Normalize embeddings
    image_embeddings = F.normalize(image_embeddings, dim=-1)
    text_embeddings = F.normalize(text_embeddings, dim=-1)

    # Compute similarity matrix: [batch_size, batch_size]
    logits = torch.matmul(image_embeddings, text_embeddings.T) / temperature

    # Labels: diagonal elements are positive pairs
    labels = torch.arange(len(logits), device=logits.device)

    # Cross-entropy loss (image-to-text and text-to-image)
    loss_i2t = F.cross_entropy(logits, labels)
    loss_t2i = F.cross_entropy(logits.T, labels)

    return (loss_i2t + loss_t2i) / 2

Why This Works:

Positive pairs (matching image+text) have high similarity (near 1.0)
Negative pairs (non-matching) have low similarity (near 0.0)
Temperature controls how "sharp" the distribution is

Visualization:

less

Before Training:
  Image: [dog photo] → [0.1, 0.2, 0.3, ...]
  Text: "a dog"      → [0.5, 0.1, 0.8, ...]
  Similarity: 0.23 (random)

After Training:
  Image: [dog photo] → [0.7, 0.3, 0.1, ...]
  Text: "a dog"      → [0.6, 0.4, 0.2, ...]
  Similarity: 0.91 (high!)

CLIP: Contrastive Language-Image Pre-training

Paper: "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., 2021)

Big Idea: Train vision and text encoders jointly on 400 million (image, caption) pairs from the internet using contrastive learning.

CLIP Architecture

css

┌─────────────────────┐
│   Image Encoder     │
│  (ViT or ResNet)    │
│   224×224×3         │
│        ↓            │
│  [0.7, 0.3, ...]    │ ← Image embedding (512-D)
└─────────────────────┘

┌─────────────────────┐
│   Text Encoder      │
│   (Transformer)     │
│  "a dog playing"    │
│        ↓            │
│  [0.6, 0.4, ...]    │ ← Text embedding (512-D)
└─────────────────────┘

    Contrastive Loss: Pull matching pairs together

Image Encoder Options

Option One: ResNet (CNN-based)

python

import torch.nn as nn
from torchvision.models import resnet50

class CLIPImageEncoder(nn.Module):
    def __init__(self, embed_dim=512):
        super().__init__()
        self.resnet = resnet50(pretrained=False)
        # Replace final layer with projection to embed_dim
        self.resnet.fc = nn.Linear(2048, embed_dim)

    def forward(self, images):
        # images: [batch, 3, 224, 224]
        embeddings = self.resnet(images)  # [batch, embed_dim]
        return embeddings

Option 2: Vision Transformer (ViT)

python

from transformers import ViTModel

class CLIPImageEncoder(nn.Module):
    def __init__(self, embed_dim=512):
        super().__init__()
        self.vit = ViTModel.from_pretrained("google/vit-base-patch16-224")
        self.projection = nn.Linear(768, embed_dim)  # ViT hidden size → embed_dim

    def forward(self, images):
        # images: [batch, 3, 224, 224]
        outputs = self.vit(pixel_values=images)
        # Take [CLS] token embedding
        cls_embedding = outputs.last_hidden_state[:, 0, :]  # [batch, 768]
        embeddings = self.projection(cls_embedding)  # [batch, embed_dim]
        return embeddings

Text Encoder

python

from transformers import BertModel

class CLIPTextEncoder(nn.Module):
    def __init__(self, embed_dim=512):
        super().__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.projection = nn.Linear(768, embed_dim)

    def forward(self, input_ids, attention_mask):
        # input_ids: [batch, seq_len]
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        # Take [CLS] token embedding
        cls_embedding = outputs.last_hidden_state[:, 0, :]  # [batch, 768]
        embeddings = self.projection(cls_embedding)  # [batch, embed_dim]
        return embeddings

Complete CLIP Model

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class CLIP(nn.Module):
    def __init__(self, embed_dim=512, temperature=0.07):
        super().__init__()
        self.image_encoder = CLIPImageEncoder(embed_dim)
        self.text_encoder = CLIPTextEncoder(embed_dim)
        self.temperature = temperature

    def forward(self, images, input_ids, attention_mask):
        # Encode images and text
        image_embeddings = self.image_encoder(images)  # [batch, embed_dim]
        text_embeddings = self.text_encoder(input_ids, attention_mask)  # [batch, embed_dim]

        # Normalize embeddings (important for contrastive learning)
        image_embeddings = F.normalize(image_embeddings, dim=-1)
        text_embeddings = F.normalize(text_embeddings, dim=-1)

        # Compute similarity matrix
        logits = torch.matmul(image_embeddings, text_embeddings.T) / self.temperature

        return logits, image_embeddings, text_embeddings

    def compute_loss(self, logits):
        # Labels: diagonal elements are positive pairs
        labels = torch.arange(len(logits), device=logits.device)

        # Symmetric loss (image-to-text + text-to-image)
        loss_i2t = F.cross_entropy(logits, labels)
        loss_t2i = F.cross_entropy(logits.T, labels)

        return (loss_i2t + loss_t2i) / 2

# Training loop
model = CLIP(embed_dim=512)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for epoch in range(10):
    for images, input_ids, attention_mask in dataloader:
        # Forward pass
        logits, img_emb, txt_emb = model(images, input_ids, attention_mask)

        # Compute loss
        loss = model.compute_loss(logits)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

CLIP Training Data

OpenAI's CLIP was trained on:

400 million (image, text) pairs
Scraped from the internet (Alt-text from webpages)
Filtered for quality and content safety

Example Pairs:

vbnet

Image: [Photo of golden retriever]
Text: "A golden retriever playing fetch in a park"

Image: [Sunset over mountains]
Text: "Beautiful sunset with orange sky over mountain range"

Image: [Person cooking]
Text: "Chef preparing pasta in a kitchen"

Zero-Shot Image Classification with CLIP

The Magic: Once trained, CLIP can classify images into categories it's never seen before!

How Zero-Shot Works

Traditional Supervised Learning:

Train model on fixed categories (e.g., ImageNet 1000 classes)
Model can ONLY predict those 1000 classes
New category -> Must retrain entire model

CLIP Zero-Shot:

Train on 400M diverse (image, text) pairs
At test time, provide text descriptions of new categories
Find which text description best matches the image

Example:

python

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

# Load pre-trained CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load image
image = Image.open("mystery_animal.jpg")

# Define candidate categories (can be anything!)
texts = [
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a bird",
    "a photo of a fish",
]

# Process inputs
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

# Compute similarities
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # [1, 4] - image-text similarity scores
probs = logits_per_image.softmax(dim=1)  # Convert to probabilities

# Get prediction
pred_idx = probs.argmax().item()
print(f"Prediction: {texts[pred_idx]}")
print(f"Confidence: {probs[0][pred_idx].item():.2%}")

Output:

makefile

Prediction: a photo of a dog
Confidence: 87.32%

Why This is Revolutionary

Before CLIP: To add a new category to an image classifier, you need:

Thousands of labeled examples of that category
Retraining the entire model (hours/days)

With CLIP: To add a new category:

Write a text description (1 second)
No retraining needed!

Applications:

Content moderation (detect new types of inappropriate content)
Product search (find items by description)
Medical imaging (classify rare conditions without labeled data)

Text-to-Image Generation: Stable Diffusion + CLIP

Connection to Lesson 13: We learned diffusion models denoise random noise into images. But how do we condition the generation on text?

Answer: Use CLIP text embeddings to guide the diffusion process!

Latent Diffusion Model Architecture

less

Text Prompt: "A cat wearing a space suit"
       ↓
   Text Encoder (CLIP)
       ↓
   Text Embedding [512-D]
       ↓
   ┌───────────────────────────────┐
   │  U-Net Denoiser               │
   │  (conditioned on text)        │
   │                               │
   │  Noise → ... → Image Latent   │
   └───────────────────────────────┘
       ↓
   VAE Decoder
       ↓
   Final Image (512×512×3)

CLIP-Guided Diffusion

Key Idea: At each denoising step, ensure the image is moving toward the text description.

Classifier-Free Guidance (CFG):

python

def diffusion_step_with_guidance(x_t, t, text_embedding, guidance_scale=7.5):
    """
    x_t: Noisy image at timestep t
    t: Current timestep
    text_embedding: CLIP text embedding
    guidance_scale: How much to emphasize text conditioning
    """
    # Predict noise with text conditioning
    noise_cond = unet(x_t, t, context=text_embedding)

    # Predict noise without text conditioning (unconditional)
    noise_uncond = unet(x_t, t, context=None)

    # Classifier-free guidance: extrapolate toward text-conditioned prediction
    noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)

    # Denoise
    x_t_minus_1 = denoise(x_t, noise_pred, t)
    return x_t_minus_1

Why CFG Works:

Low guidance (e.g., 1.0): Image loosely matches text (more creative)
High guidance (e.g., 15.0): Image closely matches text (more accurate)
Typical value: 7.5 (good balance)

Stable Diffusion Code

python

from diffusers import StableDiffusionPipeline
import torch

# Load Stable Diffusion model
model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# Generate image from text
prompt = "A cat wearing a space suit, digital art"
image = pipe(
    prompt=prompt,
    height=512,
    width=512,
    num_inference_steps=50,  # More steps = higher quality
    guidance_scale=7.5,      # Text adherence
).images[0]

image.save("cat_astronaut.png")

How Text Influences Generation:

Text Encoder (CLIP): Converts prompt to embedding
U-Net Denoiser: Takes noisy image + text embedding -> Predicts noise to remove
Cross-Attention: U-Net attends to text embedding at each layer (connects vision + language)
Guidance: Amplifies text-conditioned signal during sampling

Image Captioning: Encoder-Decoder Architecture

Task: Given an image, generate a descriptive text caption.

Example:

less

Input Image: [Photo of children playing soccer in a park]
Output Caption: "A group of children playing soccer in a park on a sunny day"

Architecture

scss

┌─────────────────────┐
│  Image Encoder      │  ← CNN or ViT (pre-trained)
│   (ResNet/ViT)      │
│        ↓            │
│  Visual Features    │ ← [batch, 196, 2048] (spatial features)
└─────────────────────┘
         ↓
┌─────────────────────┐
│  Decoder            │  ← Transformer decoder with cross-attention
│  (Transformer)      │
│                     │
│  Cross-Attention:   │ ← Attend to visual features
│  Query: Word        │
│  Key/Value: Image   │
│        ↓            │
│  "A" → "group" →    │ ← Autoregressive generation
│  "of" → "children"  │
└─────────────────────┘

Encoder: Extract Visual Features

python

import torch.nn as nn
from torchvision.models import resnet50

class ImageEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        # Use ResNet without final classification layer
        resnet = resnet50(pretrained=True)
        modules = list(resnet.children())[:-2]  # Remove avgpool and fc
        self.resnet = nn.Sequential(*modules)
        # Output: [batch, 2048, 7, 7] spatial features

    def forward(self, images):
        # images: [batch, 3, 224, 224]
        features = self.resnet(images)  # [batch, 2048, 7, 7]
        # Reshape to sequence: [batch, 49, 2048]
        batch_size, channels, h, w = features.size()
        features = features.view(batch_size, channels, h * w).permute(0, 2, 1)
        return features  # [batch, 49, 2048]

Decoder: Transformer with Cross-Attention

python

class CaptioningDecoder(nn.Module):
    def __init__(self, vocab_size, embed_dim=512, num_heads=8, num_layers=6):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.pos_encoding = nn.Parameter(torch.randn(1, 100, embed_dim))  # Max length 100

        # Transformer decoder layer
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=embed_dim,
            nhead=num_heads,
            dim_feedforward=2048,
        )
        self.transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)

        # Output projection
        self.fc_out = nn.Linear(embed_dim, vocab_size)

    def forward(self, tokens, visual_features, tgt_mask=None):
        # tokens: [batch, seq_len] - caption tokens
        # visual_features: [batch, 49, 2048] - image features

        # Embed tokens
        embeddings = self.embedding(tokens) + self.pos_encoding[:, :tokens.size(1), :]

        # Transformer decoder (with cross-attention to visual features)
        output = self.transformer_decoder(
            tgt=embeddings.permute(1, 0, 2),  # [seq_len, batch, embed_dim]
            memory=visual_features.permute(1, 0, 2),  # [49, batch, 2048]
            tgt_mask=tgt_mask,  # Causal mask for autoregressive generation
        )

        # Project to vocabulary
        logits = self.fc_out(output.permute(1, 0, 2))  # [batch, seq_len, vocab_size]
        return logits

Complete Captioning Model

python

class ImageCaptioningModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.encoder = ImageEncoder()
        self.decoder = CaptioningDecoder(vocab_size)

    def forward(self, images, captions):
        # Encode image
        visual_features = self.encoder(images)  # [batch, 49, 2048]

        # Decode caption (teacher forcing during training)
        logits = self.decoder(captions, visual_features)
        return logits

    def generate_caption(self, image, tokenizer, max_length=50):
        # Encode image
        visual_features = self.encoder(image.unsqueeze(0))

        # Start with <START> token
        caption = [tokenizer.start_token_id]

        # Autoregressive generation
        for _ in range(max_length):
            tokens = torch.tensor([caption], device=image.device)
            logits = self.decoder(tokens, visual_features)

            # Get next token
            next_token = logits[0, -1, :].argmax().item()
            caption.append(next_token)

            # Stop at <END> token
            if next_token == tokenizer.end_token_id:
                break

        # Decode to text
        return tokenizer.decode(caption)

# Training loop
model = ImageCaptioningModel(vocab_size=10000)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

for images, captions in dataloader:
    # Forward pass
    logits = model(images, captions[:, :-1])  # Input: all but last token

    # Compute loss (predict next token)
    loss = F.cross_entropy(
        logits.reshape(-1, vocab_size),
        captions[:, 1:].reshape(-1)  # Target: all but first token
    )

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Visual Question Answering (VQA)

Task: Given an image and a question, generate an answer.

Example:

vbnet

Image: [Photo of a kitchen with a red kettle on the stove]
Question: "What color is the kettle?"
Answer: "Red"

VQA Architecture

css

Image → Visual Encoder → Visual Features [2048-D]
                              ↓
Question → Text Encoder → Question Features [512-D]
                              ↓
                    ┌─────────────────┐
                    │  Fusion Module  │ ← Combine vision + language
                    │  (Multi-modal   │
                    │   Attention)    │
                    └─────────────────┘
                              ↓
                    Answer Classifier [3129 classes]

Fusion with Co-Attention

Key Idea: Let the model attend to relevant image regions based on the question.

Example:

Question: "What color is the kettle?" -> Attend to kettle region
Question: "How many people are visible?" -> Attend to person regions

python

class CoAttentionVQA(nn.Module):
    def __init__(self, num_answers=3129):
        super().__init__()
        self.visual_encoder = ImageEncoder()  # ResNet
        self.text_encoder = nn.LSTM(input_size=300, hidden_size=512, num_layers=2)

        # Co-attention layers
        self.visual_attention = nn.Linear(2048, 512)
        self.text_attention = nn.Linear(512, 512)

        # Answer classifier
        self.classifier = nn.Linear(512 + 2048, num_answers)

    def forward(self, images, questions):
        # Encode image: [batch, 49, 2048]
        visual_features = self.visual_encoder(images)

        # Encode question: [batch, 512]
        _, (question_features, _) = self.text_encoder(questions)
        question_features = question_features[-1]  # Last layer

        # Compute attention weights over image regions
        query = self.text_attention(question_features).unsqueeze(1)  # [batch, 1, 512]
        keys = self.visual_attention(visual_features)  # [batch, 49, 512]

        # Attention scores
        scores = torch.bmm(query, keys.permute(0, 2, 1))  # [batch, 1, 49]
        attention_weights = F.softmax(scores, dim=-1)

        # Weighted sum of visual features
        attended_visual = torch.bmm(attention_weights, visual_features).squeeze(1)  # [batch, 2048]

        # Fuse visual + question features
        fused = torch.cat([attended_visual, question_features], dim=-1)  # [batch, 2560]

        # Classify answer
        logits = self.classifier(fused)  # [batch, 3129]
        return logits

From Lesson 16: RLHF aligns language models with human preferences.

Extension: Use RLHF to align text-to-image models with human aesthetic preferences!

Vision-RLHF Pipeline

Stage 1: Train base diffusion model (Stable Diffusion)

Stage 2: Train aesthetic reward model

Collect preference data: "Which image is better for this prompt?"
Train reward model: r = aesthetic_model(image, prompt)

Stage 3: Fine-tune diffusion model with RLHF

Generate images from prompts
Score with reward model
Use RL (PPO or DPO) to optimize for higher rewards

Example:

python

# Reward model
class AestheticRewardModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.clip = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
        self.mlp = nn.Sequential(
            nn.Linear(768, 256),
            nn.ReLU(),
            nn.Linear(256, 1)  # Scalar reward
        )

    def forward(self, images, prompts):
        # Get CLIP image-text similarity
        clip_score = self.clip(images, prompts).logits_per_image

        # MLP predicts aesthetic score
        features = self.clip.get_image_features(images)
        aesthetic_score = self.mlp(features)

        return aesthetic_score

# RLHF training loop (simplified)
for prompt_batch in prompts:
    # Generate images with current diffusion policy
    images = diffusion_model.generate(prompt_batch)

    # Get rewards
    rewards = aesthetic_model(images, prompt_batch)

    # PPO update
    ppo_trainer.step(prompts=prompt_batch, images=images, rewards=rewards)

Results: Models learn to generate more aesthetically pleasing, higher-quality images.

Applications

Application	Input	Output	Example
Image Search	Image	Relevant text queries	Upload photo -> Find similar products
Visual QA	Image + Question	Answer	"How many people?" -> "3"
Image Captioning	Image	Description	Photo -> "A dog playing in a park"
Text-to-Image	Text	Image	"Sunset over ocean" -> Generated image
Image Editing	Image + Text	Edited image	"Make the sky blue" -> Modified image
Document Understanding	Document image	Extracted info	Invoice -> Total amount
Medical Imaging	X-ray + Query	Diagnosis	"Is there a fracture?" -> "Yes, left tibia"

Advantages and Limitations

Advantages

Advantage	Description
✅ Zero-Shot Transfer	CLIP can classify new categories without training
✅ Unified Representation	One model for multiple vision-language tasks
✅ Rich Semantics	Learns high-level concepts, not just pixels
✅ Scalability	Trained on web-scale data (millions of examples)
✅ Interpretability	Can visualize attention to understand decisions

Limitations

Limitation	Description	Mitigation
⚠️ Compositionality	Struggles with complex relationships	Fine-tune on compositional datasets
⚠️ Bias	Inherits biases from web data	Careful data curation, fairness audits
⚠️ Hallucination	May generate false captions	Reward models for factuality
⚠️ Compute Cost	Large models require GPUs	Model distillation, efficient architectures
⚠️ Fine-Grained Details	Misses small objects or subtle differences	Higher-resolution encoders, attention mechanisms

Connections to Previous Lessons

Lesson	Connection to Multi-Modal AI
Lesson 13 (Diffusion)	Stable Diffusion uses CLIP text embeddings to condition image generation
Lesson 14 (Transformers)	Image captioning decoder uses Transformer architecture
Lesson 15 (LLMs)	Text encoders in CLIP are based on BERT/GPT-style transformers
Lesson 16 (RLHF)	Multi-modal RLHF aligns text-to-image models with human preferences

Key Takeaways

Joint Embedding Space: Multi-modal AI maps images and text to the same vector space using contrastive learning.
CLIP Architecture: Dual encoders (image + text) trained with InfoNCE loss on 400M pairs.
Zero-Shot Classification: CLIP can classify images into new categories by comparing with text descriptions.
Text-to-Image: Stable Diffusion uses CLIP text embeddings to guide denoising diffusion.
Image Captioning: Encoder-decoder with cross-attention generates descriptions from images.
Visual QA: Fusion modules (co-attention) combine visual and textual information for question answering.
Multi-Modal RLHF: Extends RLHF to vision domains for aesthetic and quality alignment.
Cross-Attention: Key mechanism that connects visual features with text queries.
Applications: Image search, VQA, captioning, text-to-image, document understanding.
Challenges: Compositionality, bias, hallucination, compute cost.

Looking Ahead

In Lesson 18 (Future of AI), we'll explore:

AI Agents: Systems that use tool calling and multi-modal understanding to solve complex tasks
Safety and Ethics: Ensuring multi-modal AI is fair, robust, and aligned
Integration: Combining RL, GenAI, and multi-modal AI into unified systems

Preview: Imagine an AI agent that can:

Read a research paper (text)
Analyze figures and charts (vision)
Search for related work (tool use)
Generate a summary with visualizations (multi-modal generation)

This is the future we're building toward!

Summary

Multi-modal AI bridges vision and language through joint embedding spaces learned with contrastive learning.

CLIP (Contrastive Language-Image Pre-training):

Dual encoders (image ViT/ResNet + text Transformer)
Trained on 400M (image, text) pairs
Enables zero-shot classification and image-text retrieval

Applications:

Text-to-image (Stable Diffusion + CLIP)
Image captioning (Encoder-decoder with cross-attention)
Visual QA (Co-attention fusion)
Multi-modal RLHF (aligning text-to-image models)

Key Techniques:

Contrastive loss (InfoNCE)
Cross-attention (connect vision + language)
Classifier-free guidance (text-conditioned generation)

Impact: Powers Google Lens, DALL-E, GPT-4 Vision, and countless other multi-modal applications.

Next: Lesson 18 - The Future of AI ->