Practice and reinforce the concepts from Lesson 15
In this final GenAI activity, you'll work with state-of-the-art Large Language Models. You'll master prompt engineering techniques, fine-tune an LLM with LoRA, implement chain-of-thought reasoning, and build a practical LLM application using HuggingFace. This activity bridges to Lesson 16 (RLHF) and prepares you for the integration module.
By completing this activity, you will:
Download the activity template from the Templates folder:
AI25-Template-activity-15-large-language-models.zipTemplates/AI25-Template-activity-15-large-language-models.zipactivity-15-large-language-models.ipynb to Google ColabExecute the first few cells to:
TODO 1: Implement and evaluate zero-shot prompts
def zero_shot_prompt(model, tokenizer, task_description, input_text):
"""
Zero-shot prompting: Ask directly without examples
Args:
model: LLM (GPT-2, LLaMA, etc.)
tokenizer: Tokenizer
task_description: What to do
input_text: Input to process
Returns:
Generated response
"""
# TODO 1: Construct zero-shot prompt
# Template:
# {task_description}
#
# Input: {input_text}
# Output:
# Your code here
pass
# Test tasks
tasks = [
("Translate to French", "Hello, how are you?"),
("Classify sentiment (Positive/Negative)", "This movie was terrible!"),
("Summarize in one sentence", "Long article text..."),
("Answer the question", "What is the capital of France?"),
]
# TODO 1: Run zero-shot on all tasks, analyze success rate
TODO 2: Implement few-shot prompting
def few_shot_prompt(model, tokenizer, examples, input_text):
"""
Few-shot prompting: Provide examples in prompt
Args:
model: LLM
tokenizer: Tokenizer
examples: List of (input, output) tuples
input_text: New input to process
Returns:
Generated response
"""
# TODO 2: Construct few-shot prompt
# Template:
# Input: {example_1_input}
# Output: {example_1_output}
#
# Input: {example_2_input}
# Output: {example_2_output}
#
# ...
#
# Input: {input_text}
# Output:
# Your code here
pass
# Example: Sentiment classification
examples = [
("I loved this restaurant!", "Positive"),
("The service was terrible.", "Negative"),
("It was okay, nothing special.", "Neutral"),
]
# TODO 2: Test few-shot vs zero-shot, compare accuracy
TODO 3: Implement CoT for complex reasoning
def chain_of_thought_prompt(model, tokenizer, question, enable_cot=True):
"""
Chain-of-thought prompting: Ask model to show reasoning steps
Args:
model: LLM
tokenizer: Tokenizer
question: Problem to solve
enable_cot: Whether to use "Let's think step by step"
Returns:
Generated response with reasoning
"""
# TODO 3: Construct CoT prompt
# Without CoT:
# Q: {question}
# A:
#
# With CoT:
# Q: {question}
# A: Let's think step by step.
# Your code here
pass
# Test problems (math, logic, common sense)
problems = [
"Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?",
"If it takes 5 machines 5 minutes to make 5 widgets, how long does it take 100 machines to make 100 widgets?",
"A farmer has 17 sheep, all but 9 die. How many are left?",
]
# TODO 3: Compare accuracy with and without CoT
TODO 4: Implement role prompting, constraints, and format control
def role_prompt(model, tokenizer, role, task):
"""
Role prompting: Assign persona to model
Args:
role: Persona (e.g., "expert Python developer", "creative writer")
task: Task to perform
Returns:
Generated response
"""
# TODO 4a: Construct role prompt
# Template:
# You are {role}. {task}
# Your code here
pass
def constrained_prompt(model, tokenizer, task, constraints):
"""
Constrained prompting: Specify output format/length/style
Args:
task: What to do
constraints: List of constraints
e.g., ["Respond in exactly 50 words", "Use simple language"]
Returns:
Generated response
"""
# TODO 4b: Construct constrained prompt
# Template:
# {task}
#
# Constraints:
# - {constraint_1}
# - {constraint_2}
# Your code here
pass
def json_format_prompt(model, tokenizer, task, schema):
"""
Format control: Request specific output format (JSON, etc.)
Args:
task: What to extract/generate
schema: Expected JSON structure
Returns:
Generated JSON string
"""
# TODO 4c: Construct format control prompt
# Template:
# {task}
#
# Respond in JSON format:
# {schema}
# Your code here
pass
TODO 5: Fine-tune LLaMA 2 with LoRA
from peft import LoraConfig, get_peft_model, TaskType
def setup_lora_model(base_model_name="meta-llama/Llama-2-7b-hf", lora_r=8):
"""
Set up model for LoRA fine-tuning
Args:
base_model_name: Base LLM to fine-tune
lora_r: LoRA rank (lower = fewer parameters)
Returns:
model: PEFT model with LoRA adapters
tokenizer: Tokenizer
"""
# TODO 5a: Load base model
from transformers import AutoModelForCausalLM, AutoTokenizer
# model = AutoModelForCausalLM.from_pretrained(base_model_name)
# tokenizer = AutoTokenizer.from_pretrained(base_model_name)
# TODO 5b: Configure LoRA
# lora_config = LoraConfig(
# task_type=TaskType.CAUSAL_LM,
# r=lora_r, # Rank
# lora_alpha=32, # Scaling
# lora_dropout=0.1,
# target_modules=["q_proj", "v_proj"], # Which attention layers
# )
# TODO 5c: Apply LoRA to model
# model = get_peft_model(model, lora_config)
# model.print_trainable_parameters() # Should be ~0.1% of total
# Your code here
pass
def fine_tune_lora(model, tokenizer, train_dataset, output_dir="./lora-llama2"):
"""
Fine-tune with LoRA on custom dataset
Args:
model: PEFT model
tokenizer: Tokenizer
train_dataset: HuggingFace Dataset
output_dir: Where to save adapter weights
Returns:
Fine-tuned model
"""
# TODO 5d: Set up training arguments
from transformers import TrainingArguments, Trainer
# training_args = TrainingArguments(
# output_dir=output_dir,
# num_train_epochs=3,
# per_device_train_batch_size=4,
# gradient_accumulation_steps=4,
# learning_rate=2e-4,
# fp16=True, # Mixed precision
# save_steps=100,
# logging_steps=10,
# )
# TODO 5e: Train
# trainer = Trainer(
# model=model,
# args=training_args,
# train_dataset=train_dataset,
# )
# trainer.train()
# Your code here
pass
TODO 6: Implement self-consistency for improved accuracy
def self_consistency(model, tokenizer, question, num_samples=5):
"""
Self-consistency: Generate multiple reasoning paths, take majority vote
Args:
model: LLM
tokenizer: Tokenizer
question: Problem to solve
num_samples: Number of reasoning paths to generate
Returns:
Final answer (majority vote)
All generated answers
"""
# TODO 6: Implement self-consistency
# Step 1: Generate num_samples different reasoning paths
# Use temperature sampling (T=0.7) for diversity
#
# Step 2: Extract final answer from each path
# Parse last line or use regex
#
# Step 3: Take majority vote
# Count occurrences of each answer
# Return most common
# Your code here
pass
# Test on math problems
math_problems = [
"A store sells apples for $2 each and oranges for $3 each. If John buys 5 apples and 3 oranges, how much does he spend?",
"If a train travels 60 miles in 1 hour, how far does it travel in 2.5 hours?",
]
# TODO 6: Compare single-path vs self-consistency accuracy
TODO 7: Implement evaluation metrics
def compute_perplexity(model, tokenizer, text):
"""
Compute perplexity on text
Lower perplexity = model is less "surprised" by text
Args:
model: LLM
tokenizer: Tokenizer
text: Text to evaluate
Returns:
perplexity: Scalar value
"""
# TODO 7a: Implement perplexity
# Formula: exp(average negative log-likelihood)
#
# Step 1: Tokenize text
# Step 2: Forward pass to get logits
# Step 3: Compute cross-entropy loss
# Step 4: perplexity = exp(loss)
# Your code here
pass
def evaluate_zero_shot_classification(model, tokenizer, test_data, task_description):
"""
Evaluate zero-shot classification accuracy
Args:
model: LLM
tokenizer: Tokenizer
test_data: List of (input, label) tuples
task_description: Task prompt
Returns:
accuracy: Fraction of correct predictions
predictions: List of (input, true_label, predicted_label)
"""
# TODO 7b: Implement zero-shot eval
# For each test example:
# 1. Generate prediction using zero-shot prompt
# 2. Parse predicted label
# 3. Compare to true label
# Compute accuracy = correct / total
# Your code here
pass
def human_eval_comparison(model, tokenizer, prompts, baseline_responses):
"""
Human evaluation: Compare model outputs to baseline
Args:
model: Fine-tuned LLM
tokenizer: Tokenizer
prompts: List of prompts
baseline_responses: Responses from base model
Returns:
Comparison results for human review
"""
# TODO 7c: Generate and format for human eval
# For each prompt:
# 1. Generate response from fine-tuned model
# 2. Format: Prompt | Baseline | Fine-tuned
# 3. Save to file for human reviewers
# Your code here
pass
TODO 8: Build practical LLM application (choose one)
Option A: Code Assistant
class CodeAssistant:
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
def complete_code(self, partial_code):
"""Complete partial code snippet"""
# TODO 8a: Implement code completion
pass
def explain_code(self, code):
"""Explain what code does"""
# TODO 8a: Implement code explanation
pass
def fix_bugs(self, buggy_code, error_message):
"""Suggest bug fixes"""
# TODO 8a: Implement bug fixing
pass
Option B: Conversational Chatbot
class Chatbot:
def __init__(self, model, tokenizer, system_prompt):
self.model = model
self.tokenizer = tokenizer
self.system_prompt = system_prompt
self.conversation_history = []
def chat(self, user_message):
"""
Multi-turn conversation
Maintains context across turns
"""
# TODO 8b: Implement conversational chatbot
# Step 1: Append user message to history
# Step 2: Construct prompt with full conversation
# Step 3: Generate response
# Step 4: Append response to history
# Step 5: Return response
# Your code here
pass
def reset(self):
"""Clear conversation history"""
self.conversation_history = []
Task: Sentiment Classification
Input: "This movie was amazing!"
Zero-shot Output: "Positive"
Accuracy: 70-80% (varies by model)
✓ Works for simple tasks
⚠️ Struggles with complex/ambiguous cases
Task: Sentiment Classification (3 examples)
Input: "The food was decent but service was slow."
Few-shot Output: "Neutral"
Accuracy: 85-92%
✓ 10-15% improvement over zero-shot
✓ Learns from examples without fine-tuning
Math Problem:
Q: Roger has 5 tennis balls. He buys 2 cans with 3 balls each. How many total?
Without CoT:
A: 11
With CoT:
A: Let's think step by step.
Roger started with 5 balls.
He bought 2 cans, each with 3 balls.
2 × 3 = 6 new balls.
5 + 6 = 11 balls total.
The answer is 11.
✓ CoT improves accuracy from 65% to 92% on math problems
Role Prompting (Code Assistant):
Prompt: "You are an expert Python developer. Write a function to reverse a string."
Output:
```python
def reverse_string(s: str) -> str:
"""
Reverse a string efficiently.
Args:
s: Input string
Returns:
Reversed string
"""
return s[::-1]
✓ Better code quality with role prompt
### Part 5: LoRA Fine-Tuning
**Model Statistics:**
Base LLaMA 2 (7B):
✓ Fine-tunes on consumer GPU ✓ 99.94% of weights frozen
**Fine-Tuning Results (Python code generation):**
Before: Prompt: "def fibonacci(n):" Output: "I don't understand what you want."
After (LoRA): Prompt: "def fibonacci(n):" Output:
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
✓ Adapts to domain (Python code)
### Part 6: Self-Consistency
**Math Problem (5 samples):**
Generated answers:
Majority vote: "11 balls" (4/5)
✓ Self-consistency accuracy: 96% (vs 92% single-path)
### Part 7: Evaluation
**Perplexity:**
GPT-2 (base): 35.2 Fine-tuned (LoRA): 18.7
✓ Lower perplexity = better fit to domain
**Zero-Shot Classification (sentiment, 1000 examples):**
Accuracy: 84.3% Precision: 0.83 Recall: 0.85 F1: 0.84
✓ Strong performance without task-specific training
### Part 8: LLM Application
**Code Assistant Demo:**
User: "Complete this code: def factorial(n):" Assistant:
def factorial(n):
if n == 0 or n == 1:
return 1
return n * factorial(n-1)
User: "Explain this code" Assistant: "This is a recursive implementation of the factorial function..."
✓ Functional code assistant
## Success Criteria
Your implementation is complete when:
- [ ] Zero-shot prompting works for 4+ task types
- [ ] Few-shot improves accuracy by 10%+ over zero-shot
- [ ] Chain-of-thought improves reasoning task accuracy
- [ ] Advanced prompting techniques (role, constraints, format) work
- [ ] LoRA fine-tuning completes successfully (trains < 1% of params)
- [ ] Self-consistency improves accuracy over single-path
- [ ] Evaluation metrics computed correctly
- [ ] LLM application (chatbot or code assistant) functional
## Tips for Success
### Prompt Engineering Best Practices
**Clear Task Description:**
❌ "Translate this" ✅ "Translate the following English text to French:"
**Provide Context:**
❌ "Answer: Q: What is the capital?" ✅ "Context: France is a country in Europe.\nQ: What is the capital of France?"
**Specify Output Format:**
❌ "Classify the sentiment" ✅ "Classify the sentiment as one of: Positive, Negative, or Neutral"
### LoRA Configuration
**Rank (r) Selection:**
- r=4: Minimal parameters, may underfit
- r=8: Good balance (recommended)
- r=16: More expressive, slower
**Target Modules:**
- Attention layers: ["q_proj", "v_proj"] (standard)
- All linear layers: Adds capacity, more parameters
### Common Pitfalls
**1. Context Length Overflow:**
- Problem: Prompt + generation > max_len (2048 for LLaMA 2)
- Solution: Truncate prompt or use RoPE/ALiBi for longer contexts
**2. Tokenizer Padding:**
- Problem: Improper padding causes errors
- Solution: Set `tokenizer.pad_token = tokenizer.eos_token`
**3. Fine-Tuning Catastrophic Forgetting:**
- Problem: Model forgets general knowledge
- Solution: Use lower learning rate (2e-5), mix general + domain data
## Extension Challenges
### Challenge 1: QLoRA (Medium)
Implement 4-bit quantized LoRA:
__CODE_BLOCK_25__`
**Benefit**: Fine-tune 7B model on 16GB GPU!
### Challenge 2: Retrieval-Augmented Generation (RAG) (Hard)
Combine LLM with document retrieval:
```python
class RAG:
def __init__(self, llm, retriever):
self.llm = llm
self.retriever = retriever
def answer_question(self, question):
# Step 1: Retrieve relevant documents
docs = self.retriever.search(question, top_k=3)
# Step 2: Construct prompt with documents
prompt = f"Context: {docs}\n\nQuestion: {question}\nAnswer:"
# Step 3: Generate answer
return self.llm.generate(prompt)
```
### Challenge 3: Multi-Turn Conversation with Memory (Medium)
Implement conversation with summarization:
```python
class ConversationWithMemory:
def __init__(self, llm, max_history=10):
self.llm = llm
self.history = []
self.max_history = max_history
def chat(self, user_message):
# If history too long, summarize
if len(self.history) > self.max_history:
summary = self.llm.summarize(self.history[:max_history//2])
self.history = [summary] + self.history[max_history//2:]
# Continue conversation
...
```
### Challenge 4: Agent with Tool Use (Very Hard)
LLM that can use external tools (calculator, web search, code execution):
```python
class AgentWithTools:
def __init__(self, llm, tools):
self.llm = llm
self.tools = tools # {"calculator": calc_fn, "search": search_fn}
def solve(self, task):
# Parse task, decide which tools to use, execute, synthesize answer
pass
```
## Submission Requirements
### What to Submit
1. **Completed Notebook**: `activity-15-large-language-models.ipynb`
* All TODOs completed
* All experiments run
2. **Prompt Engineering Results**:
* Zero-shot accuracy on 4 tasks
* Few-shot accuracy comparison
* Chain-of-thought examples with reasoning paths
* Advanced prompting examples (role, constraints, JSON)
3. **Fine-Tuning Results**:
* LoRA training logs
* Before/after comparison (5 prompts)
* Trainable parameter count
4. **Application Demo**:
* Code assistant or chatbot working examples
* 5+ user interactions
* Screenshots or outputs
5. **Analysis** (10-15 sentences):
* Which prompting technique worked best? Why?
* How did fine-tuning change model behavior?
* What are the limitations of your LLM application?
* How does self-consistency improve accuracy?
* When would you use LoRA vs full fine-tuning?
### Submission Steps
1. Complete all prompting experiments
2. Fine-tune with LoRA
3. Build and test LLM application
4. Run evaluations
5. Download notebook
6. Submit via \[course portal link]
## Resources
### Documentation
* [HuggingFace Transformers](https://huggingface.co/docs/transformers/)
* [PEFT (LoRA)](https://huggingface.co/docs/peft/)
* [LLaMA 2 Paper](https://arxiv.org/abs/2307.09288) (Touvron et al., 2023)
### Papers
* [Few-Shot Learning](https://arxiv.org/abs/2005.14165) (GPT-3, Brown et al., 2020)
* [Chain-of-Thought Prompting](https://arxiv.org/abs/2201.11903) (Wei et al., 2022)
* [LoRA](https://arxiv.org/abs/2106.09685) (Hu et al., 2021)
* [QLoRA](https://arxiv.org/abs/2305.14314) (Dettmers et al., 2023)
### Related Concepts
* In-context learning
* Prompt engineering
* Parameter-efficient fine-tuning (PEFT)
* Instruction tuning
## Next Steps
**🎉 Congratulations! You've completed the Generative AI Module!**
**Next: Lesson 16** - Reinforcement Learning from Human Feedback (RLHF)
* Align LLMs with human preferences
* Reward modeling
* PPO fine-tuning
* Constitutional AI
**Integration Module (Lessons 16-18):**
* Combine RL + GenAI
* Multi-modal AI (vision + language)
* Future of AI and ethics
## Assessment
This activity is graded on:
* **Prompt Engineering (30%)**: Zero-shot, few-shot, CoT, advanced techniques
* **Fine-Tuning (25%)**: LoRA setup, training, evaluation
* **Application (25%)**: Functional chatbot or code assistant
* **Evaluation (10%)**: Metrics computed correctly
* **Analysis (10%)**: Demonstrates deep understanding
**Passing Grade**: 70% or higher
Congratulations on mastering Large Language Models! 🚀🎓
You're now ready for the Integration Module, where you'll combine RL and GenAI through RLHF!