Demo Mode

Lesson 7 of 9

Concept 7: Evaluation Models - Model Cards & Fine-Tuning

Evaluation Models: Model Cards & Fine-Tuning

:dart: Learning Objectives

By the end of this lesson, you'll be able to:

Understand different ways to check if AI models work well
Create and read model cards that explain how AI systems work
Explain how fine-tuning helps AI models learn new skills
Compare different ways to measure AI performance
Check if AI models are fair and unbiased
Design your own evaluation plans for AI projects

:bar_chart: Introduction to Model Evaluation

:information_source: Model evaluation is like giving AI a report card. It's the process of checking how well an AI system does its job. Just like teachers test students to see what they've learned, we test AI models to make sure they work correctly and safely.

:emoji: Why Evaluation Matters

Performance Assurance: Making sure AI does what it's supposed to do
Bias Detection: Finding unfair or mean outputs before they hurt anyone
Safety Assessment: Checking that AI behaves nicely and safely
Transparency: Explaining clearly what AI can and cannot do
Continuous Improvement: Helping AI get better over time

:mag: Types of Evaluation

:emoji: Automated Evaluation

Computers check how well AI models work using math and formulas.

Advantages:

Fast and consistent - Computers can check thousands of examples quickly
Fair and repeatable - Same test gives same results every time
Saves money - No need to pay human testers

Limitations:

Misses subtle problems - Can't understand context like humans
Only as good as the test - Bad tests give bad results
Can't judge creativity - Doesn't know if something feels right

:emoji: Human Evaluation

Real people check and rate AI outputs.

Advantages:

Understands meaning - Humans get jokes, sarcasm, and context
Spots weird problems - People notice when something feels off
Real-world testing - Shows how actual users will react

Limitations:

Takes time and money - People need to be paid for their work
Everyone's different - What one person likes, another might not
Hard to scale up - Can't test millions of examples easily

:emoji: Hybrid Evaluation

Using both computers AND humans to check AI - the best of both worlds!

:bulb: Best Practices for Hybrid Evaluation:

Let computers do the first check (fast and cheap)

Have humans look at the important stuff (quality control)

Mix different viewpoints together

Keep improving based on what you learn

:emoji: Evaluation Metrics

:memo: Text Generation Metrics

BLEU (Bilingual Evaluation Understudy)

Checks how similar AI's text is to what humans would write.

When to use it:

Translating languages (like Spanish to English)
Making summaries shorter
Writing computer code

How it works:

ini

BLEU = BP exp( (wn log(pn)))
where:
- BP = Penalty for being too short
- pn = How many words match
- wn = Importance of different word groups

:memo: Think of BLEU like comparing your homework to the answer key - the more it matches, the higher your score!

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Checks if AI includes all the important information.

Different types:

ROUGE-N: Counts matching word groups

ROUGE-L: Finds the longest matching sequence

ROUGE-W: Gives extra points for keeping words in order

Perplexity

Shows how confused the AI is when predicting words.

The math behind it:
ini
Perplexity = 2^(-1/N  log(P(wi)))
where:
- N = Total number of words
- P(wi) = How likely each word is
tip Remember: Lower perplexity = Less confused AI = Better performance! Think of it like a test score where lower is better (like golf!)

:emoji: Semantic Similarity Metrics

BERTScore

Uses AI to understand if two sentences mean the same thing.

Why it's cool:

Gets the meaning - Knows "happy" and "joyful" are similar
Flexible matching - Doesn't need exact same words
Thinks like humans - Matches how people judge similarity

Semantic Textual Similarity (STS)

Gives a score for how similar two sentences are in meaning.

Scoring system: 0 (totally different) to 5 (exactly the same meaning)

Real examples:

Score 5: "The cat is sleeping" / "A cat is asleep" (same meaning!)
Score 3: "The cat is sleeping" / "The dog is resting" (similar but different animals)
Score 0: "The cat is sleeping" / "Mathematics is difficult" (nothing in common)

:dart: Task-Specific Metrics

:emoji: Question Answering

Exact Match: Did AI give the exact right answer?
F1 Score: How much of the answer is correct?
SQUAD Score: Special test for reading stories and answering questions

:computer: Code Generation

Compilation Rate: Does the code actually run without errors?
Functional Correctness: Does the code do what it's supposed to?
CodeBLEU: Special BLEU score just for computer code

:clipboard: Model Cards

What are Model Cards?

:information_source: Model cards are like instruction manuals for AI models. They tell you everything you need to know about an AI system - what it does, how it works, and what to watch out for. Think of them as a nutrition label, but for AI!

:page_facing_up: Model Card Components

One. :emoji:️ Model Details

Name and version - What's it called and which version is it?
Type and structure - What kind of AI is it?
Birth certificate - When was it made and by whom?
Size info - How big is it (like computer memory)?
What it's for - Its main jobs and purposes

2. :dart: Intended Use

Main jobs - What the AI was built to do
Who should use it - Teachers? Students? Everyone?
Don't use it for - Things it shouldn't do
Boundaries - Where it works well and where it doesn't

3. :books: Training Data

Where data came from - Books? Websites? Conversations?
How much data - Gigabytes, number of examples
Data preparation - How it was cleaned and organized
Known problems - Biases or gaps in the data
Coverage - Which countries, languages, or time periods

4. :bar_chart: Performance Metrics

How we tested it - The evaluation methods used
Report card scores - Key performance numbers
Fairness check - How it performs for different groups
Confidence levels - How sure we are about the results

5. :emoji:️ Ethical Considerations

Bias check - Unfair patterns we found
Fairness tests - Making sure it treats everyone equally
Planet impact - Energy use and carbon footprint
Privacy protection - How it handles personal info
Misuse warnings - Ways people might use it badly

6. :warning: Limitations and Recommendations

Known problems - Where it might fail
Tricky situations - Edge cases to watch for
Monitoring tips - How to keep an eye on it
Update plans - When it needs refreshing

:pencil2: Creating Effective Model Cards

Best Practices

:bulb: Three Rules for Great Model Cards:

Be Honest and Complete

Tell the whole truth about your AI

Don't hide problems or limitations

Explain what your numbers mean

Write for Everyone

Use simple words when you can

Define technical terms clearly

Add pictures and diagrams to help

Keep it Fresh

Update when the model changes

Add new test results

Fix any mistakes you find

:memo: Example Model Card

markdown

# GPT-3.5 Turbo Model Card

## Model Details
- **Model Name**: GPT-3.5 Turbo
- **Model Type**: Large Language Model (a really smart chatbot)
- **Architecture**: Transformer (the AI's brain structure)
- **Parameters**: ~175 billion (that's a LOT of connections!)
- **Release Date**: March 2023

## Intended Use
- **Primary Use**: Having conversations and answering questions
- **Intended Users**: Developers, teachers, students
- **Out of Scope**: DON'T use for medical or legal advice!

## Training Data
- **Sources**: Internet articles, books, research papers
- **Size**: ~570GB of text (like a huge library!)
- **Cutoff Date**: September 2021 (doesn't know newer stuff)
- **Languages**: Mostly English, but knows 95+ languages

## Performance
- **MMLU**: 70.0% (general knowledge test)
- **HumanEval**: 48.1% (coding test)
- **HellaSwag**: 85.5% (common sense test)

## Limitations
- Can't know about events after September 2021
- Sometimes makes up facts (called "hallucination")
- Might say unfair or mean things by accident

:art: Fine-Tuning

What is Fine-Tuning?

:information_source: Fine-tuning is like teaching an already-smart AI a new skill. Imagine you know how to play piano, and now you want to learn guitar - you already understand music, so learning guitar is easier! Fine-tuning takes an AI that already knows a lot and teaches it something specific.

:wrench: Types of Fine-Tuning

:emoji:️ Full Fine-Tuning

Teaching the AI by changing everything it knows.

The good stuff:

Maximum learning - The AI can learn completely new behaviors
Big changes possible - Can transform how the AI thinks

The challenges:

Super expensive - Needs powerful computers
Forgetfulness risk - Might forget old skills while learning new ones
Needs lots of examples - Like needing thousands of practice problems

:bulb: Parameter-Efficient Fine-Tuning (PEFT)

Low-Rank Adaptation (LoRA)

Adding a small "learning boost" to the AI without changing its core.

How it works:

ini

Original AI brain: W = W
With LoRA: W = W + (tiny boost)
The tiny boost = A × B (small helpers)

:memo: Think of LoRA like adding a small app to your phone instead of replacing the whole operating system!

Why it's awesome:

Super fast - Trains in hours instead of days

Saves memory - Uses way less computer power

Easy switching - Can swap skills in and out like game cartridges

:emoji: Adapters

Adding small "skill modules" to the AI's brain.

How they connect:
rust
Input -> AI's Original Brain -> Skill Module -> Output
Cool features:

Tiny additions - Like adding LEGO blocks

Specific skills - Each adapter does one job well

Plug and play - Add or remove skills easily

:books: Fine-Tuning Process

One. :bar_chart: Data Preparation

What you need:

Quality examples - Like having good study materials

Same format - Keep everything organized the same way

Different types - Mix of easy and hard examples

Fair balance - Equal amounts of each type

How to format your data:
json
// Teaching AI to follow instructions
{
  "instruction": "Make this long text shorter",
  "input": "Once upon a time in a land far away...",
  "output": "A story about a distant land..."
}

// Teaching AI to chat
{
  "messages": [
    {"role": "user", "content": "What's the weather like?"},
    {"role": "assistant", "content": "I'd be happy to help! Where are you located?"}
  ]
}
2. :emoji:️ Training Settings

Important numbers to set:

Learning Rate: How fast AI learns (like study speed: 0.00005)

Batch Size: How many examples at once (like flashcards: 16)

Epochs: How many times to review (like reading a book: 3 times)

Warmup Steps: Starting slowly (like stretching before exercise)

Example settings file:
yaml
learning_rate: 0.00005      # Slow and steady
batch_size: 16              # 16 examples at a time
num_epochs: 3               # Review everything 3 times
warmup_ratio: 0.1           # Start slow for 10%
weight_decay: 0.01          # Prevent memorizing too much
3. :chart_with_upwards_trend: Checking Progress

What to watch:

Training score - Is it getting better?

Test score - Does it work on new examples?

Special tests - How well does it do the specific job?

Overfitting signs - Is it just memorizing? tip Early Stopping - Know When to Quit:

python

# If the AI stops improving, pause training
if test_score_gets_worse:
    wait_a_bit += 1
else:
    wait_a_bit = 0
    save_best_version()
    
if waited_too_long:
    stop_training()  # Don't waste time!

:rocket: Fine-Tuning Applications

:emoji: Domain Adaptation

Teaching AI to speak different "languages" for different fields.

Real-world examples:

Medical AI - Understanding doctor notes and medical terms
Legal AI - Reading contracts and legal documents
Science AI - Reviewing research papers
Tech AI - Writing code documentation

:dart: Task Specialization

Training AI to be really good at one specific job.

Cool applications:

Feeling detector - Understanding if text is happy, sad, or angry
Name finder - Spotting names of people, places, and companies
Code writer - Creating computer programs
Story maker - Writing creative tales and poems

:art: Style and Tone Adaptation

Teaching AI different ways to talk.

Communication styles:

Business formal - "Dear Sir/Madam, I hope this finds you well..."
Friendly chat - "Hey! What's up? How can I help?"
Teacher mode - "Let me explain this step by step..."
Sales pitch - "This amazing product will change your life!"

:white_check_mark: Evaluation Best Practices

:trophy: Building a Complete Testing System

One. :emoji: Use Multiple Tests

Don't rely on just one way to check:

Computer tests - Fast and automatic
Human judges - Real people's opinions
A/B testing - Which version do users prefer?
Real use - How does it work in the wild?

2. :emoji: Test Everything

Normal cases - Regular, expected uses
Weird cases - Unusual or tricky situations
Edge cases - The absolute limits
Trick questions - Trying to fool the AI

3. :emoji:️ Checking for Fairness

:memo: Three Ways to Measure Fairness:

Equal Results - Everyone gets similar outcomes

Equal Accuracy - Works equally well for all groups

Individual Fairness - Treats similar people similarly

:mag: Continuous Monitoring

:bar_chart: Keeping Track of Performance

Watch the numbers - Check scores every day or week
Set up alarms - Get alerts when something goes wrong
Regular checkups - Monthly or quarterly reviews
Listen to users - What do real people think?

:emoji: Spotting Changes (Drift Detection)

Data changes - Is the input different than before?
Performance drops - Is the AI getting worse?
Meaning shifts - Do words mean different things now?
AI aging - Like milk, AI can go bad over time!

:hammer_and_wrench: Tools and Platforms

:emoji: Evaluation Tools

Hugging Face Evaluate: Huge collection of testing tools
MLflow: Keeps track of all your experiments
Weights & Biases: Beautiful charts and graphs
TensorBoard: See what's happening inside AI

:emoji: Fine-Tuning Platforms

Hugging Face Transformers: Free tools anyone can use
OpenAI Fine-Tuning API: Pay-to-use professional service
Google Vertex AI: Google's cloud AI training
Amazon SageMaker: Amazon's AI workshop

:clipboard: Model Card Tools

Model Cards Toolkit: Google's free template maker
Hugging Face Model Cards: Built-in documentation system
Papers with Code: Where researchers share their work
Custom templates: Make your own special formats

:construction: Common Challenges

:bar_chart: Evaluation Challenges

Picking the Right Tests

Too many choices - Which metric should you use?
Competing goals - What if speed and accuracy conflict?
Test limitations - No test is perfect
Understanding results - What do the numbers really mean?

Data Problems

Good test data - Is your test set fair and complete?
Cheating prevention - Make sure training data isn't in tests
Missing answers - What if you don't know the right answer?
Human mistakes - Even people label things wrong

:art: Fine-Tuning Challenges

:books: Overfitting (Memorizing Instead of Learning)

Signs: Amazing on practice tests, terrible on real tests
Fixes:
- Add rules to prevent memorizing
- Stop training earlier
- Get more diverse examples

:emoji: Catastrophic Forgetting

Problem: AI forgets everything it knew before!
Solutions:
- Learn slowly (tiny learning rate)
- Keep some old knowledge frozen
- Mix old and new examples

:bar_chart: Data Needs

:bulb: Remember the Four Rules of Training Data:

Quality beats quantity - 100 great ``examples > 1000`` bad ones

Mix it up - Different types keep AI flexible

Stay organized - Same format for everything

Check regularly - Make sure data stays good

:books: Real-World Case Studies

:emoji: Case Study 1: Medical Chatbot Evaluation

The Challenge: Make sure a medical AI gives safe, accurate advice

How They Tested It:

Medical accuracy - Do doctors agree with the answers?
Safety checks - Will it ever suggest dangerous things?
Fairness testing - Does it work for everyone equally?
User happiness - Do patients like using it?

What They Measured:

:white_check_mark: 95% accurate medical facts
:white_check_mark: Zero harmful advice given
:white_check_mark: 4.5/5 user satisfaction
:white_check_mark: Appropriate for all ages

:computer: Case Study 2: Code Generation Fine-Tuning

The Challenge: Help AI write better Python code

The Process:

Gathered examples - 10,000 excellent Python programs
Made tests - Created 500 coding challenges
Fine-tuned AI - Trained for 3 days on Python
Checked results - Tested if code actually works

Amazing Results:

:chart_with_upwards_trend: 40% more code that runs without errors
:chart_with_upwards_trend: 25% more tests passed successfully
:chart_with_upwards_trend: Much cleaner, readable code

:emoji: Case Study 3: Multilingual Model Assessment

The Challenge: Make AI work well in 20 different languages

Testing Plan:

Language tests - Check each language separately
Translation check - Can it move between languages?
Culture test - Does it respect different cultures?
Resource check - Which languages have enough data?

Key Discoveries:

:mag: English works best (most training data)
:mag: Some languages need more examples
:mag: Cultural context really matters!

:emoji: Key Terms Glossary

Model Card: AI's instruction manual and report card combined
Fine-Tuning: Teaching an old AI new tricks
BLEU Score: How closely AI matches human writing (0-100)
Perplexity: How confused AI is (lower = better)
LoRA: A clever way to add skills without changing everything
Catastrophic Forgetting: When AI forgets its old skills while learning new ones
Bias Assessment: Checking if AI is being unfair to anyone
Drift Detection: Watching for AI getting worse over time

:dart: Summary

Let's recap what we learned about evaluating and improving AI models:

:mag: Evaluation is Essential

We need to test AI like teachers test students
Use both computer tests AND human judgment
Check for fairness and safety, not just accuracy

:clipboard: Model Cards Matter

They're like nutrition labels for AI
Tell us what AI can and can't do
Help everyone use AI responsibly

:art: Fine-Tuning is Powerful

Teaches general AI specific skills
LoRA makes it fast and efficient
Watch out for overfitting and forgetting

:white_check_mark: Best Practices

Test everything multiple ways
Keep monitoring after deployment
Document everything clearly
Always check for bias and fairness

:rocket: What's Next?

In our next lesson, we'll put all this knowledge into practice! We'll design an ethical chatbot from start to finish, using:

Evaluation strategies to ensure quality
Model cards for transparency
Fine-tuning for customization
Responsible AI principles throughout

Get ready to build something amazing! :emoji::sparkles:

:bulb: Practice Activities

Activity One: Create Your Own Model Card

Pick your favorite AI tool (like ChatGPT or Claude) and create a simple model card for it. Include:

What it does well
What it shouldn't be used for
Any limitations you've noticed

Activity 2: Design an Evaluation Plan

Imagine you're building an AI tutor for math. List:

3 automated tests you'd use
2 human evaluation methods
1 fairness check

Activity 3: Fine-Tuning Scenarios

Think of 3 situations where you'd want to fine-tune an AI model. For each, explain:

What specific skill you'd teach it
What kind of data you'd need
How you'd know if it worked

Discussion Questions

Why is it important to test AI in multiple ways?
How do model cards help build trust in AI systems?
What's the difference between an AI memorizing examples and actually learning?
How can we make sure AI is fair to everyone who uses it?

Lesson 7 of 9

Concept 7: Evaluation Models - Model Cards & Fine-Tuning

Evaluation Models: Model Cards & Fine-Tuning

:dart: Learning Objectives

By the end of this lesson, you'll be able to:

Understand different ways to check if AI models work well
Create and read model cards that explain how AI systems work
Explain how fine-tuning helps AI models learn new skills
Compare different ways to measure AI performance
Check if AI models are fair and unbiased
Design your own evaluation plans for AI projects

:bar_chart: Introduction to Model Evaluation

:information_source: Model evaluation is like giving AI a report card. It's the process of checking how well an AI system does its job. Just like teachers test students to see what they've learned, we test AI models to make sure they work correctly and safely.

:emoji: Why Evaluation Matters

Performance Assurance: Making sure AI does what it's supposed to do
Bias Detection: Finding unfair or mean outputs before they hurt anyone
Safety Assessment: Checking that AI behaves nicely and safely
Transparency: Explaining clearly what AI can and cannot do
Continuous Improvement: Helping AI get better over time

:mag: Types of Evaluation

:emoji: Automated Evaluation

Computers check how well AI models work using math and formulas.

Advantages:

Fast and consistent - Computers can check thousands of examples quickly
Fair and repeatable - Same test gives same results every time
Saves money - No need to pay human testers

Limitations:

Misses subtle problems - Can't understand context like humans
Only as good as the test - Bad tests give bad results
Can't judge creativity - Doesn't know if something feels right

:emoji: Human Evaluation

Real people check and rate AI outputs.

Advantages:

Understands meaning - Humans get jokes, sarcasm, and context
Spots weird problems - People notice when something feels off
Real-world testing - Shows how actual users will react

Limitations:

Takes time and money - People need to be paid for their work
Everyone's different - What one person likes, another might not
Hard to scale up - Can't test millions of examples easily

:emoji: Hybrid Evaluation

Using both computers AND humans to check AI - the best of both worlds!

:bulb: Best Practices for Hybrid Evaluation:

Let computers do the first check (fast and cheap)

Have humans look at the important stuff (quality control)

Mix different viewpoints together

Keep improving based on what you learn

:emoji: Evaluation Metrics

:memo: Text Generation Metrics

BLEU (Bilingual Evaluation Understudy)

Checks how similar AI's text is to what humans would write.

When to use it:

Translating languages (like Spanish to English)
Making summaries shorter
Writing computer code

How it works:

ini

BLEU = BP exp( (wn log(pn)))
where:
- BP = Penalty for being too short
- pn = How many words match
- wn = Importance of different word groups

:memo: Think of BLEU like comparing your homework to the answer key - the more it matches, the higher your score!

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Checks if AI includes all the important information.

Different types:

ROUGE-N: Counts matching word groups

ROUGE-L: Finds the longest matching sequence

ROUGE-W: Gives extra points for keeping words in order

Perplexity

Shows how confused the AI is when predicting words.

The math behind it:
ini
Perplexity = 2^(-1/N  log(P(wi)))
where:
- N = Total number of words
- P(wi) = How likely each word is
tip Remember: Lower perplexity = Less confused AI = Better performance! Think of it like a test score where lower is better (like golf!)

:emoji: Semantic Similarity Metrics

BERTScore

Uses AI to understand if two sentences mean the same thing.

Why it's cool:

Gets the meaning - Knows "happy" and "joyful" are similar
Flexible matching - Doesn't need exact same words
Thinks like humans - Matches how people judge similarity

Semantic Textual Similarity (STS)

Gives a score for how similar two sentences are in meaning.

Scoring system: 0 (totally different) to 5 (exactly the same meaning)

Real examples:

Score 5: "The cat is sleeping" / "A cat is asleep" (same meaning!)
Score 3: "The cat is sleeping" / "The dog is resting" (similar but different animals)
Score 0: "The cat is sleeping" / "Mathematics is difficult" (nothing in common)

:dart: Task-Specific Metrics

:emoji: Question Answering

Exact Match: Did AI give the exact right answer?
F1 Score: How much of the answer is correct?
SQUAD Score: Special test for reading stories and answering questions

:computer: Code Generation

Compilation Rate: Does the code actually run without errors?
Functional Correctness: Does the code do what it's supposed to?
CodeBLEU: Special BLEU score just for computer code

:clipboard: Model Cards

What are Model Cards?

:information_source: Model cards are like instruction manuals for AI models. They tell you everything you need to know about an AI system - what it does, how it works, and what to watch out for. Think of them as a nutrition label, but for AI!

:page_facing_up: Model Card Components

One. :emoji:️ Model Details

Name and version - What's it called and which version is it?
Type and structure - What kind of AI is it?
Birth certificate - When was it made and by whom?
Size info - How big is it (like computer memory)?
What it's for - Its main jobs and purposes

2. :dart: Intended Use

Main jobs - What the AI was built to do
Who should use it - Teachers? Students? Everyone?
Don't use it for - Things it shouldn't do
Boundaries - Where it works well and where it doesn't

3. :books: Training Data

Where data came from - Books? Websites? Conversations?
How much data - Gigabytes, number of examples
Data preparation - How it was cleaned and organized
Known problems - Biases or gaps in the data
Coverage - Which countries, languages, or time periods

4. :bar_chart: Performance Metrics

How we tested it - The evaluation methods used
Report card scores - Key performance numbers
Fairness check - How it performs for different groups
Confidence levels - How sure we are about the results

5. :emoji:️ Ethical Considerations

Bias check - Unfair patterns we found
Fairness tests - Making sure it treats everyone equally
Planet impact - Energy use and carbon footprint
Privacy protection - How it handles personal info
Misuse warnings - Ways people might use it badly

6. :warning: Limitations and Recommendations

Known problems - Where it might fail
Tricky situations - Edge cases to watch for
Monitoring tips - How to keep an eye on it
Update plans - When it needs refreshing

:pencil2: Creating Effective Model Cards

Best Practices

:bulb: Three Rules for Great Model Cards:

Be Honest and Complete

Tell the whole truth about your AI

Don't hide problems or limitations

Explain what your numbers mean

Write for Everyone

Use simple words when you can

Define technical terms clearly

Add pictures and diagrams to help

Keep it Fresh

Update when the model changes

Add new test results

Fix any mistakes you find

:memo: Example Model Card

markdown

# GPT-3.5 Turbo Model Card

## Model Details
- **Model Name**: GPT-3.5 Turbo
- **Model Type**: Large Language Model (a really smart chatbot)
- **Architecture**: Transformer (the AI's brain structure)
- **Parameters**: ~175 billion (that's a LOT of connections!)
- **Release Date**: March 2023

## Intended Use
- **Primary Use**: Having conversations and answering questions
- **Intended Users**: Developers, teachers, students
- **Out of Scope**: DON'T use for medical or legal advice!

## Training Data
- **Sources**: Internet articles, books, research papers
- **Size**: ~570GB of text (like a huge library!)
- **Cutoff Date**: September 2021 (doesn't know newer stuff)
- **Languages**: Mostly English, but knows 95+ languages

## Performance
- **MMLU**: 70.0% (general knowledge test)
- **HumanEval**: 48.1% (coding test)
- **HellaSwag**: 85.5% (common sense test)

## Limitations
- Can't know about events after September 2021
- Sometimes makes up facts (called "hallucination")
- Might say unfair or mean things by accident

:art: Fine-Tuning

What is Fine-Tuning?

:information_source: Fine-tuning is like teaching an already-smart AI a new skill. Imagine you know how to play piano, and now you want to learn guitar - you already understand music, so learning guitar is easier! Fine-tuning takes an AI that already knows a lot and teaches it something specific.

:wrench: Types of Fine-Tuning

:emoji:️ Full Fine-Tuning

Teaching the AI by changing everything it knows.

The good stuff:

Maximum learning - The AI can learn completely new behaviors
Big changes possible - Can transform how the AI thinks

The challenges:

Super expensive - Needs powerful computers
Forgetfulness risk - Might forget old skills while learning new ones
Needs lots of examples - Like needing thousands of practice problems

:bulb: Parameter-Efficient Fine-Tuning (PEFT)

Low-Rank Adaptation (LoRA)

Adding a small "learning boost" to the AI without changing its core.

How it works:

ini

Original AI brain: W = W
With LoRA: W = W + (tiny boost)
The tiny boost = A × B (small helpers)

:memo: Think of LoRA like adding a small app to your phone instead of replacing the whole operating system!

Why it's awesome:

Super fast - Trains in hours instead of days

Saves memory - Uses way less computer power

Easy switching - Can swap skills in and out like game cartridges

:emoji: Adapters

Adding small "skill modules" to the AI's brain.

How they connect:
rust
Input -> AI's Original Brain -> Skill Module -> Output
Cool features:

Tiny additions - Like adding LEGO blocks

Specific skills - Each adapter does one job well

Plug and play - Add or remove skills easily

:books: Fine-Tuning Process

One. :bar_chart: Data Preparation

What you need:

Quality examples - Like having good study materials

Same format - Keep everything organized the same way

Different types - Mix of easy and hard examples

Fair balance - Equal amounts of each type

How to format your data:
json
// Teaching AI to follow instructions
{
  "instruction": "Make this long text shorter",
  "input": "Once upon a time in a land far away...",
  "output": "A story about a distant land..."
}

// Teaching AI to chat
{
  "messages": [
    {"role": "user", "content": "What's the weather like?"},
    {"role": "assistant", "content": "I'd be happy to help! Where are you located?"}
  ]
}
2. :emoji:️ Training Settings

Important numbers to set:

Learning Rate: How fast AI learns (like study speed: 0.00005)

Batch Size: How many examples at once (like flashcards: 16)

Epochs: How many times to review (like reading a book: 3 times)

Warmup Steps: Starting slowly (like stretching before exercise)

Example settings file:
yaml
learning_rate: 0.00005      # Slow and steady
batch_size: 16              # 16 examples at a time
num_epochs: 3               # Review everything 3 times
warmup_ratio: 0.1           # Start slow for 10%
weight_decay: 0.01          # Prevent memorizing too much
3. :chart_with_upwards_trend: Checking Progress

What to watch:

Training score - Is it getting better?

Test score - Does it work on new examples?

Special tests - How well does it do the specific job?

Overfitting signs - Is it just memorizing? tip Early Stopping - Know When to Quit:

python

# If the AI stops improving, pause training
if test_score_gets_worse:
    wait_a_bit += 1
else:
    wait_a_bit = 0
    save_best_version()
    
if waited_too_long:
    stop_training()  # Don't waste time!

:rocket: Fine-Tuning Applications

:emoji: Domain Adaptation

Teaching AI to speak different "languages" for different fields.

Real-world examples:

Medical AI - Understanding doctor notes and medical terms
Legal AI - Reading contracts and legal documents
Science AI - Reviewing research papers
Tech AI - Writing code documentation

:dart: Task Specialization

Training AI to be really good at one specific job.

Cool applications:

Feeling detector - Understanding if text is happy, sad, or angry
Name finder - Spotting names of people, places, and companies
Code writer - Creating computer programs
Story maker - Writing creative tales and poems

:art: Style and Tone Adaptation

Teaching AI different ways to talk.

Communication styles:

Business formal - "Dear Sir/Madam, I hope this finds you well..."
Friendly chat - "Hey! What's up? How can I help?"
Teacher mode - "Let me explain this step by step..."
Sales pitch - "This amazing product will change your life!"

:white_check_mark: Evaluation Best Practices

:trophy: Building a Complete Testing System

One. :emoji: Use Multiple Tests

Don't rely on just one way to check:

Computer tests - Fast and automatic
Human judges - Real people's opinions
A/B testing - Which version do users prefer?
Real use - How does it work in the wild?

2. :emoji: Test Everything

Normal cases - Regular, expected uses
Weird cases - Unusual or tricky situations
Edge cases - The absolute limits
Trick questions - Trying to fool the AI

3. :emoji:️ Checking for Fairness

:memo: Three Ways to Measure Fairness:

Equal Results - Everyone gets similar outcomes

Equal Accuracy - Works equally well for all groups

Individual Fairness - Treats similar people similarly

:mag: Continuous Monitoring

:bar_chart: Keeping Track of Performance

Watch the numbers - Check scores every day or week
Set up alarms - Get alerts when something goes wrong
Regular checkups - Monthly or quarterly reviews
Listen to users - What do real people think?

:emoji: Spotting Changes (Drift Detection)

Data changes - Is the input different than before?
Performance drops - Is the AI getting worse?
Meaning shifts - Do words mean different things now?
AI aging - Like milk, AI can go bad over time!

:hammer_and_wrench: Tools and Platforms

:emoji: Evaluation Tools

Hugging Face Evaluate: Huge collection of testing tools
MLflow: Keeps track of all your experiments
Weights & Biases: Beautiful charts and graphs
TensorBoard: See what's happening inside AI

:emoji: Fine-Tuning Platforms

Hugging Face Transformers: Free tools anyone can use
OpenAI Fine-Tuning API: Pay-to-use professional service
Google Vertex AI: Google's cloud AI training
Amazon SageMaker: Amazon's AI workshop

:clipboard: Model Card Tools

Model Cards Toolkit: Google's free template maker
Hugging Face Model Cards: Built-in documentation system
Papers with Code: Where researchers share their work
Custom templates: Make your own special formats

:construction: Common Challenges

:bar_chart: Evaluation Challenges

Picking the Right Tests

Too many choices - Which metric should you use?
Competing goals - What if speed and accuracy conflict?
Test limitations - No test is perfect
Understanding results - What do the numbers really mean?

Data Problems

Good test data - Is your test set fair and complete?
Cheating prevention - Make sure training data isn't in tests
Missing answers - What if you don't know the right answer?
Human mistakes - Even people label things wrong

:art: Fine-Tuning Challenges

:books: Overfitting (Memorizing Instead of Learning)

Signs: Amazing on practice tests, terrible on real tests
Fixes:
- Add rules to prevent memorizing
- Stop training earlier
- Get more diverse examples

:emoji: Catastrophic Forgetting

Problem: AI forgets everything it knew before!
Solutions:
- Learn slowly (tiny learning rate)
- Keep some old knowledge frozen
- Mix old and new examples

:bar_chart: Data Needs

:bulb: Remember the Four Rules of Training Data:

Quality beats quantity - 100 great ``examples > 1000`` bad ones

Mix it up - Different types keep AI flexible

Stay organized - Same format for everything

Check regularly - Make sure data stays good

:books: Real-World Case Studies

:emoji: Case Study 1: Medical Chatbot Evaluation

The Challenge: Make sure a medical AI gives safe, accurate advice

How They Tested It:

Medical accuracy - Do doctors agree with the answers?
Safety checks - Will it ever suggest dangerous things?
Fairness testing - Does it work for everyone equally?
User happiness - Do patients like using it?

What They Measured:

:white_check_mark: 95% accurate medical facts
:white_check_mark: Zero harmful advice given
:white_check_mark: 4.5/5 user satisfaction
:white_check_mark: Appropriate for all ages

:computer: Case Study 2: Code Generation Fine-Tuning

The Challenge: Help AI write better Python code

The Process:

Gathered examples - 10,000 excellent Python programs
Made tests - Created 500 coding challenges
Fine-tuned AI - Trained for 3 days on Python
Checked results - Tested if code actually works

Amazing Results:

:chart_with_upwards_trend: 40% more code that runs without errors
:chart_with_upwards_trend: 25% more tests passed successfully
:chart_with_upwards_trend: Much cleaner, readable code

:emoji: Case Study 3: Multilingual Model Assessment

The Challenge: Make AI work well in 20 different languages

Testing Plan:

Language tests - Check each language separately
Translation check - Can it move between languages?
Culture test - Does it respect different cultures?
Resource check - Which languages have enough data?

Key Discoveries:

:mag: English works best (most training data)
:mag: Some languages need more examples
:mag: Cultural context really matters!

:emoji: Key Terms Glossary

Model Card: AI's instruction manual and report card combined
Fine-Tuning: Teaching an old AI new tricks
BLEU Score: How closely AI matches human writing (0-100)
Perplexity: How confused AI is (lower = better)
LoRA: A clever way to add skills without changing everything
Catastrophic Forgetting: When AI forgets its old skills while learning new ones
Bias Assessment: Checking if AI is being unfair to anyone
Drift Detection: Watching for AI getting worse over time