ℹ️ Definition: Multi-modal AI systems process and generate information across multiple modalities (vision, language, audio) by learning joint representations. These models understand relationships between different data types, enabling applications like image captioning, visual question answering, and text-to-image generation.
By the end of this lesson, you will be able to:
For the past 16 lessons, we've worked primarily with single-modality data:
The Next Frontier: What if AI could understand the relationship between images AND text?
Real-World Examples:
The Core Challenge: Vision and language are fundamentally different:
| Property | Vision | Language |
|---|---|---|
| Structure | 2D/3D spatial (pixels) | 1D sequential (words) |
| Size | 224x224x3 = 150K values | ~50 tokens = 50 values |
| Semantics | Continuous (colors, shapes) | Discrete (words, grammar) |
| Ambiguity | One image, many descriptions | One description, many images |
Solution: Learn a joint embedding space where vision and language align.
Goal: Map images and text to the same vector space, where semantically similar items are close together.
Intuition:
Mathematical Formulation:
f: Images → R^d (image encoder)
g: Text → R^d (text encoder)
Where d is the embedding dimension (e.g., 512 or 1024).
Training Objective: Given a batch of (image, text) pairs:
Contrastive Loss (InfoNCE):
def contrastive_loss(image_embeddings, text_embeddings, temperature=0.07):
"""
image_embeddings: [batch_size, d]
text_embeddings: [batch_size, d]
"""
# Normalize embeddings
image_embeddings = F.normalize(image_embeddings, dim=-1)
text_embeddings = F.normalize(text_embeddings, dim=-1)
# Compute similarity matrix: [batch_size, batch_size]
logits = torch.matmul(image_embeddings, text_embeddings.T) / temperature
# Labels: diagonal elements are positive pairs
labels = torch.arange(len(logits), device=logits.device)
# Cross-entropy loss (image-to-text and text-to-image)
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)
return (loss_i2t + loss_t2i) / 2
Why This Works:
Visualization:
Before Training:
Image: [dog photo] → [0.1, 0.2, 0.3, ...]
Text: "a dog" → [0.5, 0.1, 0.8, ...]
Similarity: 0.23 (random)
After Training:
Image: [dog photo] → [0.7, 0.3, 0.1, ...]
Text: "a dog" → [0.6, 0.4, 0.2, ...]
Similarity: 0.91 (high!)
Paper: "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., 2021)
Big Idea: Train vision and text encoders jointly on 400 million (image, caption) pairs from the internet using contrastive learning.
┌─────────────────────┐
│ Image Encoder │
│ (ViT or ResNet) │
│ 224×224×3 │
│ ↓ │
│ [0.7, 0.3, ...] │ ← Image embedding (512-D)
└─────────────────────┘
┌─────────────────────┐
│ Text Encoder │
│ (Transformer) │
│ "a dog playing" │
│ ↓ │
│ [0.6, 0.4, ...] │ ← Text embedding (512-D)
└─────────────────────┘
Contrastive Loss: Pull matching pairs together

Option One: ResNet (CNN-based)
import torch.nn as nn
from torchvision.models import resnet50
class CLIPImageEncoder(nn.Module):
def __init__(self, embed_dim=512):
super().__init__()
self.resnet = resnet50(pretrained=False)
# Replace final layer with projection to embed_dim
self.resnet.fc = nn.Linear(2048, embed_dim)
def forward(self, images):
# images: [batch, 3, 224, 224]
embeddings = self.resnet(images) # [batch, embed_dim]
return embeddings
Option 2: Vision Transformer (ViT)
from transformers import ViTModel
class CLIPImageEncoder(nn.Module):
def __init__(self, embed_dim=512):
super().__init__()
self.vit = ViTModel.from_pretrained("google/vit-base-patch16-224")
self.projection = nn.Linear(768, embed_dim) # ViT hidden size → embed_dim
def forward(self, images):
# images: [batch, 3, 224, 224]
outputs = self.vit(pixel_values=images)
# Take [CLS] token embedding
cls_embedding = outputs.last_hidden_state[:, 0, :] # [batch, 768]
embeddings = self.projection(cls_embedding) # [batch, embed_dim]
return embeddings
from transformers import BertModel
class CLIPTextEncoder(nn.Module):
def __init__(self, embed_dim=512):
super().__init__()
self.bert = BertModel.from_pretrained("bert-base-uncased")
self.projection = nn.Linear(768, embed_dim)
def forward(self, input_ids, attention_mask):
# input_ids: [batch, seq_len]
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
# Take [CLS] token embedding
cls_embedding = outputs.last_hidden_state[:, 0, :] # [batch, 768]
embeddings = self.projection(cls_embedding) # [batch, embed_dim]
return embeddings
import torch
import torch.nn as nn
import torch.nn.functional as F
class CLIP(nn.Module):
def __init__(self, embed_dim=512, temperature=0.07):
super().__init__()
self.image_encoder = CLIPImageEncoder(embed_dim)
self.text_encoder = CLIPTextEncoder(embed_dim)
self.temperature = temperature
def forward(self, images, input_ids, attention_mask):
# Encode images and text
image_embeddings = self.image_encoder(images) # [batch, embed_dim]
text_embeddings = self.text_encoder(input_ids, attention_mask) # [batch, embed_dim]
# Normalize embeddings (important for contrastive learning)
image_embeddings = F.normalize(image_embeddings, dim=-1)
text_embeddings = F.normalize(text_embeddings, dim=-1)
# Compute similarity matrix
logits = torch.matmul(image_embeddings, text_embeddings.T) / self.temperature
return logits, image_embeddings, text_embeddings
def compute_loss(self, logits):
# Labels: diagonal elements are positive pairs
labels = torch.arange(len(logits), device=logits.device)
# Symmetric loss (image-to-text + text-to-image)
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)
return (loss_i2t + loss_t2i) / 2
# Training loop
model = CLIP(embed_dim=512)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for epoch in range(10):
for images, input_ids, attention_mask in dataloader:
# Forward pass
logits, img_emb, txt_emb = model(images, input_ids, attention_mask)
# Compute loss
loss = model.compute_loss(logits)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
OpenAI's CLIP was trained on:
Example Pairs:
Image: [Photo of golden retriever]
Text: "A golden retriever playing fetch in a park"
Image: [Sunset over mountains]
Text: "Beautiful sunset with orange sky over mountain range"
Image: [Person cooking]
Text: "Chef preparing pasta in a kitchen"
The Magic: Once trained, CLIP can classify images into categories it's never seen before!
Traditional Supervised Learning:
CLIP Zero-Shot:
Example:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
# Load pre-trained CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Load image
image = Image.open("mystery_animal.jpg")
# Define candidate categories (can be anything!)
texts = [
"a photo of a cat",
"a photo of a dog",
"a photo of a bird",
"a photo of a fish",
]
# Process inputs
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
# Compute similarities
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # [1, 4] - image-text similarity scores
probs = logits_per_image.softmax(dim=1) # Convert to probabilities
# Get prediction
pred_idx = probs.argmax().item()
print(f"Prediction: {texts[pred_idx]}")
print(f"Confidence: {probs[0][pred_idx].item():.2%}")
Output:
Prediction: a photo of a dog
Confidence: 87.32%
Before CLIP: To add a new category to an image classifier, you need:
With CLIP: To add a new category:
Applications:
Connection to Lesson 13: We learned diffusion models denoise random noise into images. But how do we condition the generation on text?
Answer: Use CLIP text embeddings to guide the diffusion process!
Text Prompt: "A cat wearing a space suit"
↓
Text Encoder (CLIP)
↓
Text Embedding [512-D]
↓
┌───────────────────────────────┐
│ U-Net Denoiser │
│ (conditioned on text) │
│ │
│ Noise → ... → Image Latent │
└───────────────────────────────┘
↓
VAE Decoder
↓
Final Image (512×512×3)
Key Idea: At each denoising step, ensure the image is moving toward the text description.
Classifier-Free Guidance (CFG):
def diffusion_step_with_guidance(x_t, t, text_embedding, guidance_scale=7.5):
"""
x_t: Noisy image at timestep t
t: Current timestep
text_embedding: CLIP text embedding
guidance_scale: How much to emphasize text conditioning
"""
# Predict noise with text conditioning
noise_cond = unet(x_t, t, context=text_embedding)
# Predict noise without text conditioning (unconditional)
noise_uncond = unet(x_t, t, context=None)
# Classifier-free guidance: extrapolate toward text-conditioned prediction
noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
# Denoise
x_t_minus_1 = denoise(x_t, noise_pred, t)
return x_t_minus_1
Why CFG Works:
from diffusers import StableDiffusionPipeline
import torch
# Load Stable Diffusion model
model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
# Generate image from text
prompt = "A cat wearing a space suit, digital art"
image = pipe(
prompt=prompt,
height=512,
width=512,
num_inference_steps=50, # More steps = higher quality
guidance_scale=7.5, # Text adherence
).images[0]
image.save("cat_astronaut.png")
How Text Influences Generation:

Task: Given an image, generate a descriptive text caption.
Example:
Input Image: [Photo of children playing soccer in a park]
Output Caption: "A group of children playing soccer in a park on a sunny day"
┌─────────────────────┐
│ Image Encoder │ ← CNN or ViT (pre-trained)
│ (ResNet/ViT) │
│ ↓ │
│ Visual Features │ ← [batch, 196, 2048] (spatial features)
└─────────────────────┘
↓
┌─────────────────────┐
│ Decoder │ ← Transformer decoder with cross-attention
│ (Transformer) │
│ │
│ Cross-Attention: │ ← Attend to visual features
│ Query: Word │
│ Key/Value: Image │
│ ↓ │
│ "A" → "group" → │ ← Autoregressive generation
│ "of" → "children" │
└─────────────────────┘
import torch.nn as nn
from torchvision.models import resnet50
class ImageEncoder(nn.Module):
def __init__(self):
super().__init__()
# Use ResNet without final classification layer
resnet = resnet50(pretrained=True)
modules = list(resnet.children())[:-2] # Remove avgpool and fc
self.resnet = nn.Sequential(*modules)
# Output: [batch, 2048, 7, 7] spatial features
def forward(self, images):
# images: [batch, 3, 224, 224]
features = self.resnet(images) # [batch, 2048, 7, 7]
# Reshape to sequence: [batch, 49, 2048]
batch_size, channels, h, w = features.size()
features = features.view(batch_size, channels, h * w).permute(0, 2, 1)
return features # [batch, 49, 2048]
class CaptioningDecoder(nn.Module):
def __init__(self, vocab_size, embed_dim=512, num_heads=8, num_layers=6):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.pos_encoding = nn.Parameter(torch.randn(1, 100, embed_dim)) # Max length 100
# Transformer decoder layer
decoder_layer = nn.TransformerDecoderLayer(
d_model=embed_dim,
nhead=num_heads,
dim_feedforward=2048,
)
self.transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
# Output projection
self.fc_out = nn.Linear(embed_dim, vocab_size)
def forward(self, tokens, visual_features, tgt_mask=None):
# tokens: [batch, seq_len] - caption tokens
# visual_features: [batch, 49, 2048] - image features
# Embed tokens
embeddings = self.embedding(tokens) + self.pos_encoding[:, :tokens.size(1), :]
# Transformer decoder (with cross-attention to visual features)
output = self.transformer_decoder(
tgt=embeddings.permute(1, 0, 2), # [seq_len, batch, embed_dim]
memory=visual_features.permute(1, 0, 2), # [49, batch, 2048]
tgt_mask=tgt_mask, # Causal mask for autoregressive generation
)
# Project to vocabulary
logits = self.fc_out(output.permute(1, 0, 2)) # [batch, seq_len, vocab_size]
return logits
class ImageCaptioningModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.encoder = ImageEncoder()
self.decoder = CaptioningDecoder(vocab_size)
def forward(self, images, captions):
# Encode image
visual_features = self.encoder(images) # [batch, 49, 2048]
# Decode caption (teacher forcing during training)
logits = self.decoder(captions, visual_features)
return logits
def generate_caption(self, image, tokenizer, max_length=50):
# Encode image
visual_features = self.encoder(image.unsqueeze(0))
# Start with <START> token
caption = [tokenizer.start_token_id]
# Autoregressive generation
for _ in range(max_length):
tokens = torch.tensor([caption], device=image.device)
logits = self.decoder(tokens, visual_features)
# Get next token
next_token = logits[0, -1, :].argmax().item()
caption.append(next_token)
# Stop at <END> token
if next_token == tokenizer.end_token_id:
break
# Decode to text
return tokenizer.decode(caption)
# Training loop
model = ImageCaptioningModel(vocab_size=10000)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
for images, captions in dataloader:
# Forward pass
logits = model(images, captions[:, :-1]) # Input: all but last token
# Compute loss (predict next token)
loss = F.cross_entropy(
logits.reshape(-1, vocab_size),
captions[:, 1:].reshape(-1) # Target: all but first token
)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()

Task: Given an image and a question, generate an answer.
Example:
Image: [Photo of a kitchen with a red kettle on the stove]
Question: "What color is the kettle?"
Answer: "Red"
Image → Visual Encoder → Visual Features [2048-D]
↓
Question → Text Encoder → Question Features [512-D]
↓
┌─────────────────┐
│ Fusion Module │ ← Combine vision + language
│ (Multi-modal │
│ Attention) │
└─────────────────┘
↓
Answer Classifier [3129 classes]
Key Idea: Let the model attend to relevant image regions based on the question.
Example:
class CoAttentionVQA(nn.Module):
def __init__(self, num_answers=3129):
super().__init__()
self.visual_encoder = ImageEncoder() # ResNet
self.text_encoder = nn.LSTM(input_size=300, hidden_size=512, num_layers=2)
# Co-attention layers
self.visual_attention = nn.Linear(2048, 512)
self.text_attention = nn.Linear(512, 512)
# Answer classifier
self.classifier = nn.Linear(512 + 2048, num_answers)
def forward(self, images, questions):
# Encode image: [batch, 49, 2048]
visual_features = self.visual_encoder(images)
# Encode question: [batch, 512]
_, (question_features, _) = self.text_encoder(questions)
question_features = question_features[-1] # Last layer
# Compute attention weights over image regions
query = self.text_attention(question_features).unsqueeze(1) # [batch, 1, 512]
keys = self.visual_attention(visual_features) # [batch, 49, 512]
# Attention scores
scores = torch.bmm(query, keys.permute(0, 2, 1)) # [batch, 1, 49]
attention_weights = F.softmax(scores, dim=-1)
# Weighted sum of visual features
attended_visual = torch.bmm(attention_weights, visual_features).squeeze(1) # [batch, 2048]
# Fuse visual + question features
fused = torch.cat([attended_visual, question_features], dim=-1) # [batch, 2560]
# Classify answer
logits = self.classifier(fused) # [batch, 3129]
return logits
From Lesson 16: RLHF aligns language models with human preferences.
Extension: Use RLHF to align text-to-image models with human aesthetic preferences!
Stage 1: Train base diffusion model (Stable Diffusion)
Stage 2: Train aesthetic reward model
r = aesthetic_model(image, prompt)Stage 3: Fine-tune diffusion model with RLHF
Example:
# Reward model
class AestheticRewardModel(nn.Module):
def __init__(self):
super().__init__()
self.clip = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
self.mlp = nn.Sequential(
nn.Linear(768, 256),
nn.ReLU(),
nn.Linear(256, 1) # Scalar reward
)
def forward(self, images, prompts):
# Get CLIP image-text similarity
clip_score = self.clip(images, prompts).logits_per_image
# MLP predicts aesthetic score
features = self.clip.get_image_features(images)
aesthetic_score = self.mlp(features)
return aesthetic_score
# RLHF training loop (simplified)
for prompt_batch in prompts:
# Generate images with current diffusion policy
images = diffusion_model.generate(prompt_batch)
# Get rewards
rewards = aesthetic_model(images, prompt_batch)
# PPO update
ppo_trainer.step(prompts=prompt_batch, images=images, rewards=rewards)
Results: Models learn to generate more aesthetically pleasing, higher-quality images.
| Application | Input | Output | Example |
|---|---|---|---|
| Image Search | Image | Relevant text queries | Upload photo -> Find similar products |
| Visual QA | Image + Question | Answer | "How many people?" -> "3" |
| Image Captioning | Image | Description | Photo -> "A dog playing in a park" |
| Text-to-Image | Text | Image | "Sunset over ocean" -> Generated image |
| Image Editing | Image + Text | Edited image | "Make the sky blue" -> Modified image |
| Document Understanding | Document image | Extracted info | Invoice -> Total amount |
| Medical Imaging | X-ray + Query | Diagnosis | "Is there a fracture?" -> "Yes, left tibia" |
| Advantage | Description |
|---|---|
| ✅ Zero-Shot Transfer | CLIP can classify new categories without training |
| ✅ Unified Representation | One model for multiple vision-language tasks |
| ✅ Rich Semantics | Learns high-level concepts, not just pixels |
| ✅ Scalability | Trained on web-scale data (millions of examples) |
| ✅ Interpretability | Can visualize attention to understand decisions |
| Limitation | Description | Mitigation |
|---|---|---|
| ⚠️ Compositionality | Struggles with complex relationships | Fine-tune on compositional datasets |
| ⚠️ Bias | Inherits biases from web data | Careful data curation, fairness audits |
| ⚠️ Hallucination | May generate false captions | Reward models for factuality |
| ⚠️ Compute Cost | Large models require GPUs | Model distillation, efficient architectures |
| ⚠️ Fine-Grained Details | Misses small objects or subtle differences | Higher-resolution encoders, attention mechanisms |
| Lesson | Connection to Multi-Modal AI |
|---|---|
| Lesson 13 (Diffusion) | Stable Diffusion uses CLIP text embeddings to condition image generation |
| Lesson 14 (Transformers) | Image captioning decoder uses Transformer architecture |
| Lesson 15 (LLMs) | Text encoders in CLIP are based on BERT/GPT-style transformers |
| Lesson 16 (RLHF) | Multi-modal RLHF aligns text-to-image models with human preferences |
Joint Embedding Space: Multi-modal AI maps images and text to the same vector space using contrastive learning.
CLIP Architecture: Dual encoders (image + text) trained with InfoNCE loss on 400M pairs.
Zero-Shot Classification: CLIP can classify images into new categories by comparing with text descriptions.
Text-to-Image: Stable Diffusion uses CLIP text embeddings to guide denoising diffusion.
Image Captioning: Encoder-decoder with cross-attention generates descriptions from images.
Visual QA: Fusion modules (co-attention) combine visual and textual information for question answering.
Multi-Modal RLHF: Extends RLHF to vision domains for aesthetic and quality alignment.
Cross-Attention: Key mechanism that connects visual features with text queries.
Applications: Image search, VQA, captioning, text-to-image, document understanding.
Challenges: Compositionality, bias, hallucination, compute cost.
In Lesson 18 (Future of AI), we'll explore:
Preview: Imagine an AI agent that can:
This is the future we're building toward!
Multi-modal AI bridges vision and language through joint embedding spaces learned with contrastive learning.
CLIP (Contrastive Language-Image Pre-training):
Applications:
Key Techniques:
Impact: Powers Google Lens, DALL-E, GPT-4 Vision, and countless other multi-modal applications.
Next: Lesson 18 - The Future of AI ->