By the end of this lesson, you'll be able to:
Understand different ways to check if AI models work well
Create and read model cards that explain how AI systems work
Explain how fine-tuning helps AI models learn new skills
Compare different ways to measure AI performance
Check if AI models are fair and unbiased
Design your own evaluation plans for AI projects
:information_source: Model evaluation
is like giving AI a report card. It's the process of checking how well an AI system does its job. Just like teachers test students to see what they've learned, we test AI models to make sure they work correctly and safely.
Performance Assurance : Making sure AI does what it's supposed to do
Bias Detection : Finding unfair or mean outputs before they hurt anyone
Safety Assessment : Checking that AI behaves nicely and safely
Transparency : Explaining clearly what AI can and cannot do
Continuous Improvement : Helping AI get better over time
Computers check how well AI models work using math and formulas.
Advantages:
Fast and consistent - Computers can check thousands of examples quickly
Fair and repeatable - Same test gives same results every time
Saves money - No need to pay human testers
Limitations:
Misses subtle problems - Can't understand context like humans
Only as good as the test - Bad tests give bad results
Can't judge creativity - Doesn't know if something feels right
Real people check and rate AI outputs.
Advantages:
Understands meaning - Humans get jokes, sarcasm, and context
Spots weird problems - People notice when something feels off
Real-world testing - Shows how actual users will react
Limitations:
Takes time and money - People need to be paid for their work
Everyone's different - What one person likes, another might not
Hard to scale up - Can't test millions of examples easily
Using both computers AND humans to check AI - the best of both worlds!
:bulb: Best Practices for Hybrid Evaluation:
Let computers do the first check (fast and cheap)
Have humans look at the important stuff (quality control)
Mix different viewpoints together
Keep improving based on what you learn
Checks how similar AI's text is to what humans would write.
When to use it:
Translating languages (like Spanish to English)
Making summaries shorter
Writing computer code
How it works:
ini
BLEU = BP exp( (wn log(pn)))
where:
- BP = Penalty for being too short
- pn = How many words match
- wn = Importance of different word groups
:memo: Think of BLEU like comparing your homework to the answer key - the more it matches, the higher your score!
Checks if AI includes all the important information.
Different types:
ROUGE-N : Counts matching word groups
ROUGE-L : Finds the longest matching sequence
ROUGE-W : Gives extra points for keeping words in order
Shows how confused the AI is when predicting words.
The math behind it:
ini
Perplexity = 2 ^(-1 /N log(P(wi)))
where:
- N = Total number of words
- P(wi) = How likely each word is
tip
Remember: Lower perplexity = Less confused AI = Better performance!
Think of it like a test score where lower is better (like golf!)
Uses AI to understand if two sentences mean the same thing.
Why it's cool:
Gets the meaning - Knows "happy" and "joyful" are similar
Flexible matching - Doesn't need exact same words
Thinks like humans - Matches how people judge similarity
Gives a score for how similar two sentences are in meaning.
Scoring system: 0 (totally different) to 5 (exactly the same meaning)
Real examples:
Score 5 : "The cat is sleeping" / "A cat is asleep" (same meaning!)
Score 3 : "The cat is sleeping" / "The dog is resting" (similar but different animals)
Score 0 : "The cat is sleeping" / "Mathematics is difficult" (nothing in common)
Exact Match : Did AI give the exact right answer?
F1 Score : How much of the answer is correct?
SQUAD Score : Special test for reading stories and answering questions
Compilation Rate : Does the code actually run without errors?
Functional Correctness : Does the code do what it's supposed to?
CodeBLEU : Special BLEU score just for computer code
:information_source: Model cards
are like instruction manuals for AI models. They tell you everything you need to know about an AI system - what it does, how it works, and what to watch out for. Think of them as a nutrition label, but for AI!
Name and version - What's it called and which version is it?
Type and structure - What kind of AI is it?
Birth certificate - When was it made and by whom?
Size info - How big is it (like computer memory)?
What it's for - Its main jobs and purposes
Main jobs - What the AI was built to do
Who should use it - Teachers? Students? Everyone?
Don't use it for - Things it shouldn't do
Boundaries - Where it works well and where it doesn't
Where data came from - Books? Websites? Conversations?
How much data - Gigabytes, number of examples
Data preparation - How it was cleaned and organized
Known problems - Biases or gaps in the data
Coverage - Which countries, languages, or time periods
How we tested it - The evaluation methods used
Report card scores - Key performance numbers
Fairness check - How it performs for different groups
Confidence levels - How sure we are about the results
Bias check - Unfair patterns we found
Fairness tests - Making sure it treats everyone equally
Planet impact - Energy use and carbon footprint
Privacy protection - How it handles personal info
Misuse warnings - Ways people might use it badly
Known problems - Where it might fail
Tricky situations - Edge cases to watch for
Monitoring tips - How to keep an eye on it
Update plans - When it needs refreshing
:bulb: Three Rules for Great Model Cards:
Be Honest and Complete
Tell the whole truth about your AI
Don't hide problems or limitations
Explain what your numbers mean
Write for Everyone
Use simple words when you can
Define technical terms clearly
Add pictures and diagrams to help
Keep it Fresh
Update when the model changes
Add new test results
Fix any mistakes you find
markdown
# GPT-3.5 Turbo Model Card
## Model Details
- **Model Name** : GPT-3.5 Turbo
- **Model Type** : Large Language Model (a really smart chatbot)
- **Architecture** : Transformer (the AI's brain structure)
- **Parameters** : ~175 billion (that's a LOT of connections!)
- **Release Date** : March 2023
## Intended Use
- **Primary Use** : Having conversations and answering questions
- **Intended Users** : Developers, teachers, students
- **Out of Scope** : DON'T use for medical or legal advice!
## Training Data
- **Sources** : Internet articles, books, research papers
- **Size** : ~570GB of text (like a huge library!)
- **Cutoff Date** : September 2021 (doesn't know newer stuff)
- **Languages** : Mostly English, but knows 95+ languages
## Performance
- **MMLU** : 70.0% (general knowledge test)
- **HumanEval** : 48.1% (coding test)
- **HellaSwag** : 85.5% (common sense test)
## Limitations
- Can't know about events after September 2021
- Sometimes makes up facts (called "hallucination")
- Might say unfair or mean things by accident
:information_source: Fine-tuning
is like teaching an already-smart AI a new skill. Imagine you know how to play piano, and now you want to learn guitar - you already understand music, so learning guitar is easier! Fine-tuning takes an AI that already knows a lot and teaches it something specific.
Teaching the AI by changing everything it knows.
The good stuff:
Maximum learning - The AI can learn completely new behaviors
Big changes possible - Can transform how the AI thinks
The challenges:
Super expensive - Needs powerful computers
Forgetfulness risk - Might forget old skills while learning new ones
Needs lots of examples - Like needing thousands of practice problems
Adding a small "learning boost" to the AI without changing its core.
How it works:
ini
Original AI brain: W = W
With LoRA: W = W + (tiny boost)
The tiny boost = A × B (small helpers)
:memo: Think of LoRA like adding a small app to your phone instead of replacing the whole operating system!
Why it's awesome:
Super fast - Trains in hours instead of days
Saves memory - Uses way less computer power
Easy switching - Can swap skills in and out like game cartridges
Adding small "skill modules" to the AI's brain.
How they connect:
rust
Input -> AI's Original Brain -> Skill Module -> Output
Cool features:
Tiny additions - Like adding LEGO blocks
Specific skills - Each adapter does one job well
Plug and play - Add or remove skills easily
What you need:
Quality examples - Like having good study materials
Same format - Keep everything organized the same way
Different types - Mix of easy and hard examples
Fair balance - Equal amounts of each type
How to format your data:
json
{
"instruction" : "Make this long text shorter" ,
"input" : "Once upon a time in a land far away..." ,
"output" : "A story about a distant land..."
}
{
"messages" : [
{ "role" : "user" , "content" : "What's the weather like?" } ,
{ "role" : "assistant" , "content" : "I'd be happy to help! Where are you located?" }
]
}
Important numbers to set:
Learning Rate : How fast AI learns (like study speed: 0.00005)
Batch Size : How many examples at once (like flashcards: 16)
Epochs : How many times to review (like reading a book: 3 times)
Warmup Steps : Starting slowly (like stretching before exercise)
Example settings file:
yaml
learning_rate: 0.00005
batch_size: 16
num_epochs: 3
warmup_ratio: 0.1
weight_decay: 0.01
What to watch:
Training score - Is it getting better?
Test score - Does it work on new examples?
Special tests - How well does it do the specific job?
Overfitting signs - Is it just memorizing?
tip
Early Stopping - Know When to Quit:
python
if test_score_gets_worse:
wait_a_bit += 1
else :
wait_a_bit = 0
save_best_version()
if waited_too_long:
stop_training()
Teaching AI to speak different "languages" for different fields.
Real-world examples:
Medical AI - Understanding doctor notes and medical terms
Legal AI - Reading contracts and legal documents
Science AI - Reviewing research papers
Tech AI - Writing code documentation
Training AI to be really good at one specific job.
Cool applications:
Feeling detector - Understanding if text is happy, sad, or angry
Name finder - Spotting names of people, places, and companies
Code writer - Creating computer programs
Story maker - Writing creative tales and poems
Teaching AI different ways to talk.
Communication styles:
Business formal - "Dear Sir/Madam, I hope this finds you well..."
Friendly chat - "Hey! What's up? How can I help?"
Teacher mode - "Let me explain this step by step..."
Sales pitch - "This amazing product will change your life!"
Don't rely on just one way to check:
Computer tests - Fast and automatic
Human judges - Real people's opinions
A/B testing - Which version do users prefer?
Real use - How does it work in the wild?
Normal cases - Regular, expected uses
Weird cases - Unusual or tricky situations
Edge cases - The absolute limits
Trick questions - Trying to fool the AI
:memo: Three Ways to Measure Fairness:
Equal Results - Everyone gets similar outcomes
Equal Accuracy - Works equally well for all groups
Individual Fairness - Treats similar people similarly
Watch the numbers - Check scores every day or week
Set up alarms - Get alerts when something goes wrong
Regular checkups - Monthly or quarterly reviews
Listen to users - What do real people think?
Data changes - Is the input different than before?
Performance drops - Is the AI getting worse?
Meaning shifts - Do words mean different things now?
AI aging - Like milk, AI can go bad over time!
Hugging Face Evaluate : Huge collection of testing tools
MLflow : Keeps track of all your experiments
Weights & Biases : Beautiful charts and graphs
TensorBoard : See what's happening inside AI
Hugging Face Transformers : Free tools anyone can use
OpenAI Fine-Tuning API : Pay-to-use professional service
Google Vertex AI : Google's cloud AI training
Amazon SageMaker : Amazon's AI workshop
Model Cards Toolkit : Google's free template maker
Hugging Face Model Cards : Built-in documentation system
Papers with Code : Where researchers share their work
Custom templates : Make your own special formats
Too many choices - Which metric should you use?
Competing goals - What if speed and accuracy conflict?
Test limitations - No test is perfect
Understanding results - What do the numbers really mean?
Good test data - Is your test set fair and complete?
Cheating prevention - Make sure training data isn't in tests
Missing answers - What if you don't know the right answer?
Human mistakes - Even people label things wrong
Signs : Amazing on practice tests, terrible on real tests
Fixes :
Add rules to prevent memorizing
Stop training earlier
Get more diverse examples
Problem : AI forgets everything it knew before!
Solutions :
Learn slowly (tiny learning rate)
Keep some old knowledge frozen
Mix old and new examples
:bulb: Remember the Four Rules of Training Data:
Quality beats quantity - 100 great ``examples > 1000`` bad ones
Mix it up - Different types keep AI flexible
Stay organized - Same format for everything
Check regularly - Make sure data stays good
The Challenge : Make sure a medical AI gives safe, accurate advice
How They Tested It:
Medical accuracy - Do doctors agree with the answers?
Safety checks - Will it ever suggest dangerous things?
Fairness testing - Does it work for everyone equally?
User happiness - Do patients like using it?
What They Measured:
:white_check_mark: 95% accurate medical facts
:white_check_mark: Zero harmful advice given
:white_check_mark: 4.5/5 user satisfaction
:white_check_mark: Appropriate for all ages
The Challenge : Help AI write better Python code
The Process:
Gathered examples - 10,000 excellent Python programs
Made tests - Created 500 coding challenges
Fine-tuned AI - Trained for 3 days on Python
Checked results - Tested if code actually works
Amazing Results:
:chart_with_upwards_trend: 40% more code that runs without errors
:chart_with_upwards_trend: 25% more tests passed successfully
:chart_with_upwards_trend: Much cleaner, readable code
The Challenge : Make AI work well in 20 different languages
Testing Plan:
Language tests - Check each language separately
Translation check - Can it move between languages?
Culture test - Does it respect different cultures?
Resource check - Which languages have enough data?
Key Discoveries:
:mag: English works best (most training data)
:mag: Some languages need more examples
:mag: Cultural context really matters!
Model Card : AI's instruction manual and report card combined
Fine-Tuning : Teaching an old AI new tricks
BLEU Score : How closely AI matches human writing (0-100)
Perplexity : How confused AI is (lower = better)
LoRA : A clever way to add skills without changing everything
Catastrophic Forgetting : When AI forgets its old skills while learning new ones
Bias Assessment : Checking if AI is being unfair to anyone
Drift Detection : Watching for AI getting worse over time
Let's recap what we learned about evaluating and improving AI models:
:mag: Evaluation is Essential
We need to test AI like teachers test students
Use both computer tests AND human judgment
Check for fairness and safety, not just accuracy
:clipboard: Model Cards Matter
They're like nutrition labels for AI
Tell us what AI can and can't do
Help everyone use AI responsibly
:art: Fine-Tuning is Powerful
Teaches general AI specific skills
LoRA makes it fast and efficient
Watch out for overfitting and forgetting
:white_check_mark: Best Practices
Test everything multiple ways
Keep monitoring after deployment
Document everything clearly
Always check for bias and fairness
In our next lesson, we'll put all this knowledge into practice! We'll design an ethical chatbot from start to finish, using:
Evaluation strategies to ensure quality
Model cards for transparency
Fine-tuning for customization
Responsible AI principles throughout
Get ready to build something amazing! :emoji::sparkles:
Pick your favorite AI tool (like ChatGPT or Claude) and create a simple model card for it. Include:
What it does well
What it shouldn't be used for
Any limitations you've noticed
Imagine you're building an AI tutor for math. List:
3 automated tests you'd use
2 human evaluation methods
1 fairness check
Think of 3 situations where you'd want to fine-tune an AI model. For each, explain:
What specific skill you'd teach it
What kind of data you'd need
How you'd know if it worked
Why is it important to test AI in multiple ways?
How do model cards help build trust in AI systems?
What's the difference between an AI memorizing examples and actually learning?
How can we make sure AI is fair to everyone who uses it?