Large Language Models (LLMs) Fundamentals - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Understand Large Language Models (LLMs) architecture and capabilities
Implement zero-shot prompting for task completion without examples
Master few-shot learning with in-context examples
Apply chain-of-thought reasoning for complex problem-solving
Fine-tune LLMs using LoRA (Low-Rank Adaptation) on custom datasets
Evaluate prompt engineering strategies with quantitative metrics

🚀 Getting Started (See Results in 30 Seconds!)

Open in Google Colab: Upload this notebook to Google Colab
Run All Cells: Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic: You'll see:
- ✅ Pre-trained LLM (GPT-2) loaded and ready
- ✅ Tokenization pipeline working
- ✅ LLM responding to prompts immediately
- ✅ Evaluation metrics and visualizations

Expected First Run Time: ~90 seconds (includes model download)

🎯 What's Already Working

The template comes with 70% working code:

✅ Pre-trained LLM Loading: GPT-2/DistilGPT-2 from HuggingFace Transformers
✅ Tokenization Pipeline: Text -> tokens -> model inputs
✅ Inference API: Generate text completions with temperature/top-k sampling
✅ Prompt Templates: Structured templates for different task types
✅ Response Parsing: Extract answers from LLM outputs
✅ Evaluation Harness: Accuracy, BLEU, perplexity metrics
✅ Visualization Dashboard: Performance comparison across prompt strategies
✅ Fine-tuning Framework: PEFT/LoRA integration setup

What Needs Your Work (30%):

⚠️ TODO 1: Implement zero-shot prompting for classification/QA/summarization
⚠️ TODO 2: Implement few-shot prompting with 3-5 examples per task
⚠️ TODO 3: Implement chain-of-thought prompting with reasoning steps
⚠️ TODO 4: Fine-tune LLM with LoRA on instruction dataset (Extension)

📋 Tasks to Complete

TODO 1: Implement Zero-Shot Prompting (Medium)

Location: Section 4 - "Zero-Shot Prompting"

Current State: Prompt templates exist but task descriptions are incomplete

Your Task: Create effective zero-shot prompts for three task types:

Text Classification: Sentiment analysis (positive/negative/neutral)
Question Answering: Answer questions based on context passages
Summarization: Generate concise summaries of long texts

Starter Code Provided:

python

def zero_shot_classify(text, labels):
    # TODO: Create prompt that asks LLM to classify text
    # Template: "Classify the following text as {labels}: {text}\nAnswer:"
    prompt = ""  # Fill this in
    return generate_completion(prompt)

def zero_shot_qa(context, question):
    # TODO: Create prompt for question answering
    # Include context, question, and instruction
    prompt = ""  # Fill this in
    return generate_completion(prompt)

def zero_shot_summarize(text, max_length=50):
    # TODO: Create prompt for summarization
    # Specify desired summary length
    prompt = ""  # Fill this in
    return generate_completion(prompt)

Success Criteria:

Classification achieves >60% accuracy on test set (20 examples)
QA extracts correct answers ``for >50``% of questions
Summaries are concise (<100
Prompts are clear and unambiguous

TODO 2: Implement Few-Shot Prompting (Medium)

Location: Section 5 - "Few-Shot Learning"

Your Task: Enhance prompts with 3-5 in-context examples to improve performance

Requirements:

For each task type (classification, QA, summarization), create few-shot prompts
Use 3-5 labeled examples as demonstrations
Format: Example One:\nInput: ...\nOutput: ...\n\nExample 2:\n...
Ensure examples are diverse and representative

Example Structure:

python

def few_shot_classify(text, labels, examples):
    """
    examples = [
        {"text": "I love this movie!", "label": "positive"},
        {"text": "Terrible experience.", "label": "negative"},
        {"text": "It was okay, nothing special.", "label": "neutral"}
    ]
    """
    # TODO: Build prompt with examples + test input
    pass

Success Criteria:

Few-shot accuracy >70% (10+ points improvement over zero-shot)
Examples are well-formatted and consistent
Test 3-shot vs 5-shot: observe performance scaling
Prompts demonstrate input-output mappings clearly

TODO 3: Implement Chain-of-Thought Prompting (Hard)

Location: Section 6 - "Chain-of-Thought Reasoning"

Your Task: Implement CoT prompting for multi-step reasoning tasks

CoT Strategy:

Add instruction: "Let's think step by step:"
Break down problem into intermediate reasoning steps
Guide LLM to show its work before final answer
Extract final answer from reasoning chain

Task Types:

Math Word Problems: "If John has 5 apples and buys 3 more..."
Logical Reasoning: "If all A are B, and some B are C, then..."
Multi-hop QA: Questions requiring multiple inference steps

Starter Code:

python

def chain_of_thought_prompt(problem, task_type="math"):
    # TODO: Add "Let's think step by step" instruction
    # TODO: Structure prompt to elicit reasoning
    # TODO: Parse final answer from reasoning chain
    prompt = f"""
Solve the following {task_type} problem step by step.

Problem: {problem}

Let's think step by step:
"""
    # Your completion here
    pass

Success Criteria:

CoT improves accuracy on reasoning tasks by >15% vs zero-shot
LLM generates intermediate reasoning steps (not just final answer)
Final answer extraction works reliably (>90% parse success)
Comparison: CoT vs direct answer prompting

TODO 4: Fine-Tune LLM with LoRA (Very Hard - Extension)

Location: Section 8 - "LoRA Fine-Tuning"

Your Task: Fine-tune GPT-2 on instruction-following dataset using PEFT/LoRA

Requirements:

Use LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning
Dataset: Instruction-response pairs (e.g., Alpaca-style)
LoRA hyperparameters: rank=8, alpha=16, dropout=0.05
Train for 500-1000 steps, monitor loss convergence

PEFT Configuration:

python

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,              # Rank
    lora_alpha=16,    # Scaling factor
    target_modules=["c_attn"],  # GPT-2 attention layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

Success Criteria:

Fine-tuning runs without OOM errors (monitor GPU memory)
Training loss decreases steadily
Fine-tuned model outperforms base model on instruction tasks
LoRA adapters ``are <10``MB (vs 500MB+ full fine-tuning)

🚀 Extension Challenges

Once you've completed all TODOs, try these advanced challenges:

Challenge One: Instruction Tuning (Hard)

Create a custom instruction dataset (50+ examples) and fine-tune GPT-2 to follow specific instructions (e.g., "Rewrite this formally", "Explain like I'm 5"). Compare instruction-tuned vs base model.

Challenge 2: Prompt Optimization (Medium)

Implement automatic prompt engineering:

Generate 10 prompt variations for classification task
Evaluate each on validation set
Select best-performing prompt template
Test on held-out test set

Challenge 3: Retrieval-Augmented Generation (RAG) (Hard)

Implement simple RAG pipeline:

Embed document chunks with sentence-transformers
Retrieve top-3 relevant chunks for query (cosine similarity)
Pass retrieved context + query to LLM
Compare RAG vs no-retrieval QA accuracy

Challenge 4: Multi-Task Agent with Tools (Very Hard)

Create LLM agent that can:

Use calculator tool for math (regex to detect calculations)
Search knowledge base for facts
Chain multiple tool calls to solve complex queries
Implement ReAct (Reasoning + Acting) pattern

📊 Expected Results

Zero-Shot (TODO 1):

Classification Accuracy: 55-65% (vs 33% random for 3-class)
QA Exact Match: 40-50%
Summarization BLEU: 15-25

Few-Shot (TODO 2):

Classification Accuracy: 70-80% (+10-15 points)
QA Exact Match: 55-65% (+10-15 points)
Summarization BLEU: 25-35 (+10 points)

Chain-of-Thought (TODO 3):

Math Reasoning Accuracy: 60-75% (vs 35-45% zero-shot)
Logical Reasoning: 55-70% (vs 40-50% zero-shot)
Multi-hop QA: 50-65% (vs 30-40% zero-shot)

LoRA Fine-Tuning (TODO 4):

Instruction-Following Accuracy: 75-85% (vs 50-60% base model)
Training Time: ~10-15 minutes (500 steps, Colab T4 GPU)
Model Size: Base (500MB) + LoRA adapters (8MB)
Loss Reduction: 2.5 -> 1.2 (training loss)

🎓 Success Criteria Checklist

Minimum Requirements (for passing):

Notebook runs without errors in Google Colab
TODO 1 completed (zero-shot prompts implemented)
TODO 2 completed (few-shot prompts with 3+ examples)
Zero-shot -> few-shot shows measurable improvement (>5% accuracy)
All visualizations render correctly

Target Grade (for excellent work):

All 3 core TODOs completed (1-3)
Chain-of-thought prompting implemented with reasoning steps
Performance improvements demonstrated quantitatively
Prompt comparison dashboard shows clear insights
At least 1 extension challenge attempted

Exceptional Work (bonus points):

TODO 4 completed (LoRA fine-tuning successful)
Multiple extension challenges completed
Novel prompt engineering strategies tested
Production-quality code (modular, reusable, well-documented)
Ablation studies (e.g., 3-shot vs 5-shot, temperature tuning)

🛠️ Troubleshooting

Issue: "OutOfMemoryError during model loading"

Solution:

Use DistilGPT-2 instead of GPT-2: model = "distilgpt2" (smaller, faster)
Clear GPU cache: torch.cuda.empty_cache()
Reduce batch size to 1 for inference
Switch to CPU if GPU memory is limited: device = "cpu"

Issue: "Model generates repetitive or nonsensical text"

Solution:

Increase temperature: temperature=0.7 -> 0.9 (more randomness)
Add top-k sampling: top_k=50 (limits vocabulary per step)
Set max_length: max_new_tokens=100 (prevent runaway generation)
Use do_sample=True for non-greedy decoding

Issue: "Zero-shot prompts don't work (accuracy near random)"

Solution:

Make instructions more explicit: "You must choose exactly one of: positive, negative, neutral"
Add output format constraints: "Answer with a single word:"
Try prompt variations: "Classify the sentiment" vs "What is the sentiment?"
Check for off-by-one errors in answer extraction (parsing)

Issue: "LoRA fine-tuning runs out of GPU memory"

Solution:

Reduce LoRA rank: r=8 -> r=4 (fewer parameters)
Decrease batch size: per_device_train_batch_size=4 -> 2
Enable gradient checkpointing: model.gradient_checkpointing_enable()
Use smaller base model: gpt2 -> distilgpt2
Reduce max sequence length: max_length=512 -> 256

Issue: "Chain-of-thought prompts don't generate reasoning steps"

Solution:

Add explicit step markers: "Step 1:", "Step 2:", etc.
Provide CoT examples in few-shot format (show reasoning before answer)
Increase max_tokens: Allow 200+ tokens for reasoning chains
Use stronger instruction: "You MUST show your work step by step before answering"

Issue: "Tokenization errors or encoding issues"

Solution:

python

# Add padding token (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token

# Handle long inputs
inputs = tokenizer(text, truncation=True, max_length=512, return_tensors="pt")

# Decode properly
output = tokenizer.decode(tokens, skip_special_tokens=True)

📚 Resources

Core Papers

GPT-3 (Brown et al., 2020): "Language Models are Few-Shot Learners" - Introduced in-context learning
Chain-of-Thought (Wei et al., 2022): "Chain-of-Thought Prompting Elicits Reasoning"
LoRA (Hu et al., 2021): "Low-Rank Adaptation of Large Language Models"
LLaMA (Touvron et al., 2023): "LLaMA: Open and Efficient Foundation Language Models"
InstructGPT (Ouyang et al., 2022): "Training language models to follow instructions"

Documentation

HuggingFace Transformers - Model loading and inference
PEFT Library - LoRA and parameter-efficient fine-tuning
Datasets Library - Load instruction datasets

Prompt Engineering Guides

OpenAI Best Practices - Prompt engineering strategies
Anthropic Prompt Guide - Advanced prompting techniques
Learn Prompting - Comprehensive prompt engineering course

Concept 15: Large Language Models and Transformers (architecture deep dive)
Concept 16: Prompt Engineering and In-Context Learning (theory)
Concept 17: Fine-Tuning and RLHF (alignment techniques)

Tools and Datasets

Model Hub: HuggingFace Hub - 100K+ pre-trained models
Instruction Datasets: Alpaca, Dolly-15K, FLAN, OpenAssistant
Evaluation: MMLU, HellaSwag, TruthfulQA benchmarks

📤 Submission

Complete required TODOs (minimum: TODO 1-2)
Run entire notebook to generate all outputs and visualizations
Export evaluation results: Save performance comparison table as CSV
Download notebook: File -> Download -> Download .ipynb
Submit via portal: Upload .ipynb and results.csv

Submission Checklist:

Filename: activity-15-[YourName].ipynb
All code cells executed successfully
Zero-shot and few-shot prompts implemented
Performance metrics show improvements (>5% accuracy gain)
Visualizations display correctly (comparison charts)
Comments explain prompt design choices

Optional Deliverables (for bonus points):

Chain-of-thought examples with reasoning traces
Fine-tuned model adapters (LoRA checkpoint)
Prompt ablation study results
Extension challenge implementations

🎉 What's Next?

After mastering LLM fundamentals:

Move to Activity 16: Reinforcement Learning from Human Feedback (RLHF)
Learn how to align LLMs with human preferences
Implement reward modeling and PPO fine-tuning

Key Insight: This activity taught you prompt engineering (zero-shot, few-shot, CoT) and basic fine-tuning. In Activity 16, you'll learn how models like ChatGPT are trained to be helpful, harmless, and honest through RLHF!

Real-World Applications:

Zero-Shot: Quick prototyping without training data
Few-Shot: Rapid task adaptation with minimal examples
Chain-of-Thought: Complex reasoning (math, logic, coding)
LoRA Fine-Tuning: Domain adaptation with limited compute

Good luck! LLMs are the foundation of modern AI applications. Master prompting and fine-tuning, and you'll be able to build production-ready AI systems! 🚀

Template 15: Large Language Models Fundamentals

📦 Project Files Included:

Large Language Models (LLMs) Fundamentals - Discovery Challenge

🎯 Learning Objectives

🚀 Getting Started (See Results in 30 Seconds!)

🎯 What's Already Working

What Needs Your Work (30%):

📋 Tasks to Complete

TODO 1: Implement Zero-Shot Prompting (Medium)

TODO 2: Implement Few-Shot Prompting (Medium)

TODO 3: Implement Chain-of-Thought Prompting (Hard)

TODO 4: Fine-Tune LLM with LoRA (Very Hard - Extension)

🚀 Extension Challenges

Challenge One: Instruction Tuning (Hard)

Challenge 2: Prompt Optimization (Medium)

Challenge 3: Retrieval-Augmented Generation (RAG) (Hard)

Challenge 4: Multi-Task Agent with Tools (Very Hard)

📊 Expected Results

Zero-Shot (TODO 1):

Few-Shot (TODO 2):

Chain-of-Thought (TODO 3):

LoRA Fine-Tuning (TODO 4):

🎓 Success Criteria Checklist

🛠️ Troubleshooting

Issue: "OutOfMemoryError during model loading"

Issue: "Model generates repetitive or nonsensical text"

Issue: "Zero-shot prompts don't work (accuracy near random)"

Issue: "LoRA fine-tuning runs out of GPU memory"

Issue: "Chain-of-thought prompts don't generate reasoning steps"

Issue: "Tokenization errors or encoding issues"

📚 Resources

Core Papers

Documentation

Prompt Engineering Guides

Tools and Datasets

📤 Submission

🎉 What's Next?