Demo Mode

No student ID available

Activity 15 of 18

Activity 15: Large Language Models (LLMs) Fundamentals

Practice and reinforce the concepts from Lesson 15

Activity 15: Large Language Models (LLMs) Fundamentals

Overview

In this final GenAI activity, you'll work with state-of-the-art Large Language Models. You'll master prompt engineering techniques, fine-tune an LLM with LoRA, implement chain-of-thought reasoning, and build a practical LLM application using HuggingFace. This activity bridges to Lesson 16 (RLHF) and prepares you for the integration module.

Learning Objectives

By completing this activity, you will:

Master prompt engineering (zero-shot, few-shot, chain-of-thought)
Fine-tune LLaMA 2 with LoRA/QLoRA for efficiency
Implement in-context learning and few-shot examples
Apply advanced prompting (role prompting, constraints, self-consistency)
Build a practical LLM application (chatbot or code assistant)
Evaluate LLM outputs (perplexity, human eval, benchmarks)
Understand scaling laws and emergent abilities

Prerequisites

Completed Concept 15: Large Language Models (LLMs) Fundamentals
Completed Activity 14: Transformer Architectures for Generation
Familiarity with HuggingFace ecosystem

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-15-large-language-models.zip
Location: Templates/AI25-Template-activity-15-large-language-models.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-15-large-language-models.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 recommended, A100 for larger models)

Step 3: Run Initial Cells

Execute the first few cells to:

Install transformers, peft, bitsandbytes, accelerate
Import libraries
Set up HuggingFace access
Load LLaMA 2 or GPT-2

What You'll Build

Part One: Zero-Shot Prompting (YOU COMPLETE)

TODO 1: Implement and evaluate zero-shot prompts

python

def zero_shot_prompt(model, tokenizer, task_description, input_text):
    """
    Zero-shot prompting: Ask directly without examples

    Args:
        model: LLM (GPT-2, LLaMA, etc.)
        tokenizer: Tokenizer
        task_description: What to do
        input_text: Input to process

    Returns:
        Generated response
    """
    # TODO 1: Construct zero-shot prompt
    # Template:
    # {task_description}
    #
    # Input: {input_text}
    # Output:

    # Your code here
    pass

# Test tasks
tasks = [
    ("Translate to French", "Hello, how are you?"),
    ("Classify sentiment (Positive/Negative)", "This movie was terrible!"),
    ("Summarize in one sentence", "Long article text..."),
    ("Answer the question", "What is the capital of France?"),
]

# TODO 1: Run zero-shot on all tasks, analyze success rate

Part 2: Few-Shot In-Context Learning (YOU COMPLETE)

TODO 2: Implement few-shot prompting

python

def few_shot_prompt(model, tokenizer, examples, input_text):
    """
    Few-shot prompting: Provide examples in prompt

    Args:
        model: LLM
        tokenizer: Tokenizer
        examples: List of (input, output) tuples
        input_text: New input to process

    Returns:
        Generated response
    """
    # TODO 2: Construct few-shot prompt
    # Template:
    # Input: {example_1_input}
    # Output: {example_1_output}
    #
    # Input: {example_2_input}
    # Output: {example_2_output}
    #
    # ...
    #
    # Input: {input_text}
    # Output:

    # Your code here
    pass

# Example: Sentiment classification
examples = [
    ("I loved this restaurant!", "Positive"),
    ("The service was terrible.", "Negative"),
    ("It was okay, nothing special.", "Neutral"),
]

# TODO 2: Test few-shot vs zero-shot, compare accuracy

Part 3: Chain-of-Thought Prompting (YOU COMPLETE)

TODO 3: Implement CoT for complex reasoning

python

def chain_of_thought_prompt(model, tokenizer, question, enable_cot=True):
    """
    Chain-of-thought prompting: Ask model to show reasoning steps

    Args:
        model: LLM
        tokenizer: Tokenizer
        question: Problem to solve
        enable_cot: Whether to use "Let's think step by step"

    Returns:
        Generated response with reasoning
    """
    # TODO 3: Construct CoT prompt
    # Without CoT:
    # Q: {question}
    # A:
    #
    # With CoT:
    # Q: {question}
    # A: Let's think step by step.

    # Your code here
    pass

# Test problems (math, logic, common sense)
problems = [
    "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?",
    "If it takes 5 machines 5 minutes to make 5 widgets, how long does it take 100 machines to make 100 widgets?",
    "A farmer has 17 sheep, all but 9 die. How many are left?",
]

# TODO 3: Compare accuracy with and without CoT

Part 4: Advanced Prompting Techniques (YOU COMPLETE)

TODO 4: Implement role prompting, constraints, and format control

python

def role_prompt(model, tokenizer, role, task):
    """
    Role prompting: Assign persona to model

    Args:
        role: Persona (e.g., "expert Python developer", "creative writer")
        task: Task to perform

    Returns:
        Generated response
    """
    # TODO 4a: Construct role prompt
    # Template:
    # You are {role}. {task}

    # Your code here
    pass

def constrained_prompt(model, tokenizer, task, constraints):
    """
    Constrained prompting: Specify output format/length/style

    Args:
        task: What to do
        constraints: List of constraints
          e.g., ["Respond in exactly 50 words", "Use simple language"]

    Returns:
        Generated response
    """
    # TODO 4b: Construct constrained prompt
    # Template:
    # {task}
    #
    # Constraints:
    # - {constraint_1}
    # - {constraint_2}

    # Your code here
    pass

def json_format_prompt(model, tokenizer, task, schema):
    """
    Format control: Request specific output format (JSON, etc.)

    Args:
        task: What to extract/generate
        schema: Expected JSON structure

    Returns:
        Generated JSON string
    """
    # TODO 4c: Construct format control prompt
    # Template:
    # {task}
    #
    # Respond in JSON format:
    # {schema}

    # Your code here
    pass

Part 5: LoRA Fine-Tuning (YOU COMPLETE)

TODO 5: Fine-tune LLaMA 2 with LoRA

python

from peft import LoraConfig, get_peft_model, TaskType

def setup_lora_model(base_model_name="meta-llama/Llama-2-7b-hf", lora_r=8):
    """
    Set up model for LoRA fine-tuning

    Args:
        base_model_name: Base LLM to fine-tune
        lora_r: LoRA rank (lower = fewer parameters)

    Returns:
        model: PEFT model with LoRA adapters
        tokenizer: Tokenizer
    """
    # TODO 5a: Load base model
    from transformers import AutoModelForCausalLM, AutoTokenizer

    # model = AutoModelForCausalLM.from_pretrained(base_model_name)
    # tokenizer = AutoTokenizer.from_pretrained(base_model_name)

    # TODO 5b: Configure LoRA
    # lora_config = LoraConfig(
    #     task_type=TaskType.CAUSAL_LM,
    #     r=lora_r,  # Rank
    #     lora_alpha=32,  # Scaling
    #     lora_dropout=0.1,
    #     target_modules=["q_proj", "v_proj"],  # Which attention layers
    # )

    # TODO 5c: Apply LoRA to model
    # model = get_peft_model(model, lora_config)
    # model.print_trainable_parameters()  # Should be ~0.1% of total

    # Your code here
    pass

def fine_tune_lora(model, tokenizer, train_dataset, output_dir="./lora-llama2"):
    """
    Fine-tune with LoRA on custom dataset

    Args:
        model: PEFT model
        tokenizer: Tokenizer
        train_dataset: HuggingFace Dataset
        output_dir: Where to save adapter weights

    Returns:
        Fine-tuned model
    """
    # TODO 5d: Set up training arguments
    from transformers import TrainingArguments, Trainer

    # training_args = TrainingArguments(
    #     output_dir=output_dir,
    #     num_train_epochs=3,
    #     per_device_train_batch_size=4,
    #     gradient_accumulation_steps=4,
    #     learning_rate=2e-4,
    #     fp16=True,  # Mixed precision
    #     save_steps=100,
    #     logging_steps=10,
    # )

    # TODO 5e: Train
    # trainer = Trainer(
    #     model=model,
    #     args=training_args,
    #     train_dataset=train_dataset,
    # )
    # trainer.train()

    # Your code here
    pass

Part 6: Self-Consistency (YOU COMPLETE)

TODO 6: Implement self-consistency for improved accuracy

python

def self_consistency(model, tokenizer, question, num_samples=5):
    """
    Self-consistency: Generate multiple reasoning paths, take majority vote

    Args:
        model: LLM
        tokenizer: Tokenizer
        question: Problem to solve
        num_samples: Number of reasoning paths to generate

    Returns:
        Final answer (majority vote)
        All generated answers
    """
    # TODO 6: Implement self-consistency
    # Step 1: Generate num_samples different reasoning paths
    #   Use temperature sampling (T=0.7) for diversity
    #
    # Step 2: Extract final answer from each path
    #   Parse last line or use regex
    #
    # Step 3: Take majority vote
    #   Count occurrences of each answer
    #   Return most common

    # Your code here
    pass

# Test on math problems
math_problems = [
    "A store sells apples for $2 each and oranges for $3 each. If John buys 5 apples and 3 oranges, how much does he spend?",
    "If a train travels 60 miles in 1 hour, how far does it travel in 2.5 hours?",
]

# TODO 6: Compare single-path vs self-consistency accuracy

Part 7: LLM Evaluation (YOU COMPLETE)

TODO 7: Implement evaluation metrics

python

def compute_perplexity(model, tokenizer, text):
    """
    Compute perplexity on text

    Lower perplexity = model is less "surprised" by text

    Args:
        model: LLM
        tokenizer: Tokenizer
        text: Text to evaluate

    Returns:
        perplexity: Scalar value
    """
    # TODO 7a: Implement perplexity
    # Formula: exp(average negative log-likelihood)
    #
    # Step 1: Tokenize text
    # Step 2: Forward pass to get logits
    # Step 3: Compute cross-entropy loss
    # Step 4: perplexity = exp(loss)

    # Your code here
    pass

def evaluate_zero_shot_classification(model, tokenizer, test_data, task_description):
    """
    Evaluate zero-shot classification accuracy

    Args:
        model: LLM
        tokenizer: Tokenizer
        test_data: List of (input, label) tuples
        task_description: Task prompt

    Returns:
        accuracy: Fraction of correct predictions
        predictions: List of (input, true_label, predicted_label)
    """
    # TODO 7b: Implement zero-shot eval
    # For each test example:
    #   1. Generate prediction using zero-shot prompt
    #   2. Parse predicted label
    #   3. Compare to true label
    # Compute accuracy = correct / total

    # Your code here
    pass

def human_eval_comparison(model, tokenizer, prompts, baseline_responses):
    """
    Human evaluation: Compare model outputs to baseline

    Args:
        model: Fine-tuned LLM
        tokenizer: Tokenizer
        prompts: List of prompts
        baseline_responses: Responses from base model

    Returns:
        Comparison results for human review
    """
    # TODO 7c: Generate and format for human eval
    # For each prompt:
    #   1. Generate response from fine-tuned model
    #   2. Format: Prompt | Baseline | Fine-tuned
    #   3. Save to file for human reviewers

    # Your code here
    pass

Part 8: Build LLM Application (YOU COMPLETE)

TODO 8: Build practical LLM application (choose one)

Option A: Code Assistant

python

class CodeAssistant:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def complete_code(self, partial_code):
        """Complete partial code snippet"""
        # TODO 8a: Implement code completion
        pass

    def explain_code(self, code):
        """Explain what code does"""
        # TODO 8a: Implement code explanation
        pass

    def fix_bugs(self, buggy_code, error_message):
        """Suggest bug fixes"""
        # TODO 8a: Implement bug fixing
        pass

Option B: Conversational Chatbot

python

class Chatbot:
    def __init__(self, model, tokenizer, system_prompt):
        self.model = model
        self.tokenizer = tokenizer
        self.system_prompt = system_prompt
        self.conversation_history = []

    def chat(self, user_message):
        """
        Multi-turn conversation

        Maintains context across turns
        """
        # TODO 8b: Implement conversational chatbot
        # Step 1: Append user message to history
        # Step 2: Construct prompt with full conversation
        # Step 3: Generate response
        # Step 4: Append response to history
        # Step 5: Return response

        # Your code here
        pass

    def reset(self):
        """Clear conversation history"""
        self.conversation_history = []

Expected Results

Part One: Zero-Shot Prompting

Task: Sentiment Classification

vbnet

Input: "This movie was amazing!"
Zero-shot Output: "Positive"
Accuracy: 70-80% (varies by model)

✓ Works for simple tasks
⚠️ Struggles with complex/ambiguous cases

Part 2: Few-Shot Learning

Task: Sentiment Classification (3 examples)

sql

Input: "The food was decent but service was slow."
Few-shot Output: "Neutral"
Accuracy: 85-92%

✓ 10-15% improvement over zero-shot
✓ Learns from examples without fine-tuning

Part 3: Chain-of-Thought

Math Problem:

vbnet

Q: Roger has 5 tennis balls. He buys 2 cans with 3 balls each. How many total?

Without CoT:
A: 11

With CoT:
A: Let's think step by step.
   Roger started with 5 balls.
   He bought 2 cans, each with 3 balls.
   2 × 3 = 6 new balls.
   5 + 6 = 11 balls total.
   The answer is 11.

✓ CoT improves accuracy from 65% to 92% on math problems

Part 4: Advanced Prompting

Role Prompting (Code Assistant):

python

Prompt: "You are an expert Python developer. Write a function to reverse a string."

Output:
```python
def reverse_string(s: str) -> str:
    """
    Reverse a string efficiently.

    Args:
        s: Input string

    Returns:
        Reversed string
    """
    return s[::-1]

✓ Better code quality with role prompt

markdown


### Part 5: LoRA Fine-Tuning

**Model Statistics:**

Base LLaMA 2 (7B):

Total parameters: 6,738,415,616
Trainable (LoRA): 4,194,304 (0.06%)
Training time: 2 hours (T4 GPU)

✓ Fine-tunes on consumer GPU ✓ 99.94% of weights frozen

css


**Fine-Tuning Results (Python code generation):**

Before: Prompt: "def fibonacci(n):" Output: "I don't understand what you want."

After (LoRA): Prompt: "def fibonacci(n):" Output:

python

def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

✓ Adapts to domain (Python code)

markdown


### Part 6: Self-Consistency

**Math Problem (5 samples):**

Generated answers:

"11 balls"
"11 balls"
"13 balls" (incorrect reasoning)
"11 balls"
"11 balls"

Majority vote: "11 balls" (4/5)

✓ Self-consistency accuracy: 96% (vs 92% single-path)

makefile


### Part 7: Evaluation

**Perplexity:**

GPT-2 (base): 35.2 Fine-tuned (LoRA): 18.7

✓ Lower perplexity = better fit to domain

java


**Zero-Shot Classification (sentiment, 1000 examples):**

Accuracy: 84.3% Precision: 0.83 Recall: 0.85 F1: 0.84

✓ Strong performance without task-specific training

markdown


### Part 8: LLM Application

**Code Assistant Demo:**

User: "Complete this code: def factorial(n):" Assistant:

python

def factorial(n):
    if n == 0 or n == 1:
        return 1
    return n * factorial(n-1)

User: "Explain this code" Assistant: "This is a recursive implementation of the factorial function..."

✓ Functional code assistant

ini


## Success Criteria

Your implementation is complete when:
- [ ] Zero-shot prompting works for 4+ task types
- [ ] Few-shot improves accuracy by 10%+ over zero-shot
- [ ] Chain-of-thought improves reasoning task accuracy
- [ ] Advanced prompting techniques (role, constraints, format) work
- [ ] LoRA fine-tuning completes successfully (trains  &lt; 1% of params)
- [ ] Self-consistency improves accuracy over single-path
- [ ] Evaluation metrics computed correctly
- [ ] LLM application (chatbot or code assistant) functional

## Tips for Success

### Prompt Engineering Best Practices

**Clear Task Description:**

❌ "Translate this" ✅ "Translate the following English text to French:"

markdown


**Provide Context:**

❌ "Answer: Q: What is the capital?" ✅ "Context: France is a country in Europe.\nQ: What is the capital of France?"

markdown


**Specify Output Format:**

❌ "Classify the sentiment" ✅ "Classify the sentiment as one of: Positive, Negative, or Neutral"

markdown


### LoRA Configuration

**Rank (r) Selection:**
- r=4: Minimal parameters, may underfit
- r=8: Good balance (recommended)
- r=16: More expressive, slower

**Target Modules:**
- Attention layers: ["q_proj", "v_proj"] (standard)
- All linear layers: Adds capacity, more parameters

### Common Pitfalls

**1. Context Length Overflow:**
- Problem: Prompt + generation > max_len (2048 for LLaMA 2)
- Solution: Truncate prompt or use RoPE/ALiBi for longer contexts

**2. Tokenizer Padding:**
- Problem: Improper padding causes errors
- Solution: Set `tokenizer.pad_token = tokenizer.eos_token`

**3. Fine-Tuning Catastrophic Forgetting:**
- Problem: Model forgets general knowledge
- Solution: Use lower learning rate (2e-5), mix general + domain data

## Extension Challenges

### Challenge 1: QLoRA (Medium)

Implement 4-bit quantized LoRA:

__CODE_BLOCK_25__`

**Benefit**: Fine-tune 7B model on 16GB GPU!

### Challenge 2: Retrieval-Augmented Generation (RAG) (Hard)

Combine LLM with document retrieval:

```python
class RAG:
    def __init__(self, llm, retriever):
        self.llm = llm
        self.retriever = retriever

    def answer_question(self, question):
        # Step 1: Retrieve relevant documents
        docs = self.retriever.search(question, top_k=3)

        # Step 2: Construct prompt with documents
        prompt = f"Context: {docs}\n\nQuestion: {question}\nAnswer:"

        # Step 3: Generate answer
        return self.llm.generate(prompt)
```

### Challenge 3: Multi-Turn Conversation with Memory (Medium)

Implement conversation with summarization:

```python
class ConversationWithMemory:
    def __init__(self, llm, max_history=10):
        self.llm = llm
        self.history = []
        self.max_history = max_history

    def chat(self, user_message):
        # If history too long, summarize
        if len(self.history) > self.max_history:
            summary = self.llm.summarize(self.history[:max_history//2])
            self.history = [summary] + self.history[max_history//2:]

        # Continue conversation
        ...
```

### Challenge 4: Agent with Tool Use (Very Hard)

LLM that can use external tools (calculator, web search, code execution):

```python
class AgentWithTools:
    def __init__(self, llm, tools):
        self.llm = llm
        self.tools = tools  # {"calculator": calc_fn, "search": search_fn}

    def solve(self, task):
        # Parse task, decide which tools to use, execute, synthesize answer
        pass
```

## Submission Requirements

### What to Submit

1. **Completed Notebook**: `activity-15-large-language-models.ipynb`
   * All TODOs completed
   * All experiments run

2. **Prompt Engineering Results**:
   * Zero-shot accuracy on 4 tasks
   * Few-shot accuracy comparison
   * Chain-of-thought examples with reasoning paths
   * Advanced prompting examples (role, constraints, JSON)

3. **Fine-Tuning Results**:
   * LoRA training logs
   * Before/after comparison (5 prompts)
   * Trainable parameter count

4. **Application Demo**:
   * Code assistant or chatbot working examples
   * 5+ user interactions
   * Screenshots or outputs

5. **Analysis** (10-15 sentences):
   * Which prompting technique worked best? Why?
   * How did fine-tuning change model behavior?
   * What are the limitations of your LLM application?
   * How does self-consistency improve accuracy?
   * When would you use LoRA vs full fine-tuning?

### Submission Steps

1. Complete all prompting experiments
2. Fine-tune with LoRA
3. Build and test LLM application
4. Run evaluations
5. Download notebook
6. Submit via \[course portal link]

## Resources

### Documentation

* [HuggingFace Transformers](https://huggingface.co/docs/transformers/)
* [PEFT (LoRA)](https://huggingface.co/docs/peft/)
* [LLaMA 2 Paper](https://arxiv.org/abs/2307.09288) (Touvron et al., 2023)

### Papers

* [Few-Shot Learning](https://arxiv.org/abs/2005.14165) (GPT-3, Brown et al., 2020)
* [Chain-of-Thought Prompting](https://arxiv.org/abs/2201.11903) (Wei et al., 2022)
* [LoRA](https://arxiv.org/abs/2106.09685) (Hu et al., 2021)
* [QLoRA](https://arxiv.org/abs/2305.14314) (Dettmers et al., 2023)

### Related Concepts

* In-context learning
* Prompt engineering
* Parameter-efficient fine-tuning (PEFT)
* Instruction tuning

## Next Steps

**🎉 Congratulations! You've completed the Generative AI Module!**

**Next: Lesson 16** - Reinforcement Learning from Human Feedback (RLHF)

* Align LLMs with human preferences
* Reward modeling
* PPO fine-tuning
* Constitutional AI

**Integration Module (Lessons 16-18):**

* Combine RL + GenAI
* Multi-modal AI (vision + language)
* Future of AI and ethics

## Assessment

This activity is graded on:

* **Prompt Engineering (30%)**: Zero-shot, few-shot, CoT, advanced techniques
* **Fine-Tuning (25%)**: LoRA setup, training, evaluation
* **Application (25%)**: Functional chatbot or code assistant
* **Evaluation (10%)**: Metrics computed correctly
* **Analysis (10%)**: Demonstrates deep understanding

**Passing Grade**: 70% or higher

Congratulations on mastering Large Language Models! 🚀🎓

You're now ready for the Integration Module, where you'll combine RL and GenAI through RLHF!

Activity 15 of 18