Demo Mode

No student ID available

Concept 15 of 18

Concept 15: Large Language Models (LLMs) Fundamentals

Large Language Models (LLMs) Fundamentals

ℹ️ Definition Large Language Models (LLMs) are Transformer-based neural networks with billions to trillions of parameters, pre-trained on massive text corpora to understand and generate human-like text, forming the foundation of modern AI assistants like GPT-4, Claude, and Gemini.

Learning Objectives

By the end of this lesson, you will:

Understand LLM architecture evolution (GPT-1 -> GPT-4)
Learn scaling laws and emergent abilities
Master prompt engineering techniques
Apply in-context learning and few-shot learning
Understand fine-tuning strategies (full, LoRA, QLoRA)
Build a practical LLM application with HuggingFace
Explore LLM evaluation and limitations

Introduction

In Lesson 14, we learned Transformer architectures. Now we'll explore Large Language Models (LLMs) - Transformers scaled to billions of parameters!

What makes LLMs special:

Scale: GPT-3 has 175 billion parameters (1000x larger than GPT-1)
Emergent abilities: New capabilities appear at scale (reasoning, math, coding)
Zero-shot learning: Perform tasks without training examples
General intelligence: Single model handles diverse tasks

Real-world LLMs:

GPT-4 (OpenAI): ChatGPT, multimodal
Claude 3.5 Sonnet (Anthropic): Constitutional AI, long context
Gemini Pro (Google): Multimodal, integrated with Google services
LLaMA 3 (Meta): Open-source, efficient
Mistral (Mistral AI): Open-source, instruction-tuned

Evolution of Language Models

GPT-1 (2018)

Architecture: 12-layer Transformer decoder Parameters: 117 million Training data: BooksCorpus (7,000 books) Key insight: Pre-training + fine-tuning works!

Capabilities:

Text completion
Basic classification
Simple QA

GPT-2 (2019)

Architecture: 48-layer Transformer decoder Parameters: 1.5 billion (13x larger!) Training data: WebText (40GB, 8M documents) Key insight: Zero-shot learning emerges at scale

Capabilities:

Coherent long-form text
Translation (without training!)
Summarization (without training!)

Controversy: Initially not released due to misuse concerns

Architecture: 96-layer Transformer decoder Parameters: 175 billion (117x larger!) Training data: 570GB (CommonCrawl, WebText2, Books, Wikipedia) Key insight: Few-shot learning with in-context examples

Capabilities:

Write essays, poems, code
Math reasoning
Language translation
Question answering

Breakthrough: In-context learning (no fine-tuning needed!)

GPT-4 (2023)

Architecture: Rumored mixture-of-experts (unconfirmed) Parameters: Estimated 1.7 trillion (disputed) Training data: Unknown (multimodal: text + images) Key insight: RLHF for alignment, multimodal understanding

Capabilities:

Advanced reasoning (passes bar exam)
Multimodal (vision + text)
Longer context (32K tokens -> 128K)
Better instruction following
Reduced hallucination

Scaling Trends

Compute scaling:

bash

GPT-1: 1 petaflop-day (~$50K)
GPT-2: 10 petaflop-days (~$500K)
GPT-3: 3,640 petaflop-days (~$4.6M)
GPT-4: ~25,000 petaflop-days (~$100M estimated)

Data scaling:

makefile

GPT-1: 5GB
GPT-2: 40GB
GPT-3: 570GB
GPT-4: Multi-terabyte (estimated)

Scaling Laws

Neural Scaling Laws (Kaplan et al., 2020)

Key finding: Loss scales predictably with compute, data, and parameters

Power law relationship:

scss

Loss ∝ 1 / (Compute)^α
Loss ∝ 1 / (Data)^β
Loss ∝ 1 / (Parameters)^γ

Implication: Bigger is (usually) better!

Chinchilla Scaling Laws (Hoffman et al., 2022)

Key finding: GPT-3 is undertrained! Should use more data, not just more parameters.

Optimal ratio:

sql

For every parameter, train on ~20 tokens

Example:

GPT-3: 175B params, 300B tokens (1.7 tokens per param) ❌
Chinchilla: 70B params, 1.4T tokens (20 tokens per param) ✅

Result: Chinchilla (70B) outperforms GPT-3 (175B) despite being 2.5x smaller!

Lesson: Data quality and quantity matter as much as model size

Emergent Abilities

Definition: Capabilities that appear suddenly at scale, absent in smaller models

Examples:

One. Chain-of-thought reasoning:

Small models (<10B): Can't solve multi-step math
Large models (>100B): Can solve complex reasoning with step-by-step prompts

2. Instruction following:

Small models: Struggle to follow complex instructions
Large models: Follow nuanced, multi-part instructions

3. Few-shot learning:

Small models: Need many examples
Large models: Learn from 1-5 examples

4. Coding:

Small models: Basic syntax
Large models: Write complex, working code

Open question: Are emergent abilities truly discontinuous, or just measurement artifacts?

Prompt Engineering

What is Prompt Engineering?

Definition: Designing input prompts to elicit desired model behavior

Why it matters:

LLMs are sensitive to phrasing
Small changes = big output differences
No fine-tuning needed!

Zero-Shot Prompting

Method: Ask directly without examples

Example:

less

Prompt: "Translate to French: 'Hello, how are you?'"
Output: "Bonjour, comment allez-vous?"

Use case: Simple, well-defined tasks

Few-Shot Prompting (In-Context Learning)

Method: Provide examples in the prompt

Example:

vbnet

Prompt:
Translate to French:
English: "Good morning"
French: "Bonjour"

English: "Thank you"
French: "Merci"

English: "How are you?"
French:

Output: "Comment allez-vous?"

Key insight: Model learns pattern from examples (no weight updates!)

Chain-of-Thought (CoT) Prompting

Method: Ask model to show reasoning steps

Example (Math):

vbnet

Prompt:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?

A: Let's think step by step.
Roger started with 5 balls.
2 cans × 3 balls per can = 6 balls.
5 + 6 = 11 balls.
The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and
bought 6 more, how many apples do they have?

A: Let's think step by step.

Output:

ini

The cafeteria started with 23 apples.
They used 20, so 23 - 20 = 3 apples left.
Then they bought 6 more, so 3 + 6 = 9 apples.
The answer is 9.

Benefit: Dramatically improves reasoning on complex tasks

Prompt Templates

Task: Sentiment Analysis

vbnet

Classify the sentiment of the following text as Positive, Negative, or Neutral.

Text: "{text}"
Sentiment:

Task: Summarization

css

Summarize the following article in 3 sentences:

Article: {article}

Summary:

Task: Code Generation

css

Write a Python function that {description}

Requirements:
- {req1}
- {req2}

```python
def {function_name}({params}):
    """

Advanced Prompting Techniques

One. Role Prompting:

css

You are an expert Python developer. Write clean, efficient code with comprehensive docstrings.

Task: {task}

2. Constraint Specification:

css

Answer in exactly 50 words. Use simple language suitable for a 10-year-old.

Question: {question}

3. Format Control:

css

Respond in JSON format:
\{
  "answer": "...",
  "confidence": 0-1,
  "reasoning": "..."
\}

4. Self-Consistency:

Generate multiple reasoning paths
Take majority vote of final answers
Improves accuracy on complex reasoning

LLM Architectures

Decoder-Only (GPT-style)

Architecture: Stacked Transformer decoder blocks with causal masking

Attention: Causal (token i can only attend to tokens <= i)

Training: Next token prediction

Use cases: Text generation, completion, dialogue

Examples: GPT-3, GPT-4, LLaMA, Mistral

Encoder-Only (BERT-style)

Architecture: Stacked Transformer encoder blocks with bidirectional attention

Attention: Bidirectional (token i can attend to all tokens)

Training: Masked language modeling (predict masked tokens)

Use cases: Classification, NER, sentence embeddings

Examples: BERT, RoBERTa, ALBERT

Encoder-Decoder (T5-style)

Architecture: Encoder blocks + decoder blocks

Attention: Bidirectional in encoder, causal in decoder

Training: Span corruption (predict masked spans)

Use cases: Translation, summarization, seq2seq

Examples: T5, BART, Flan-T5

Comparison

Architecture	Generation	Understanding	Efficiency
Decoder-Only	✅ Excellent	✅ Good	⚠️ Medium
Encoder-Only	❌ Poor	✅ Excellent	✅ Fast
Encoder-Decoder	✅ Good	✅ Good	❌ Slow

Trend: Decoder-only dominates (GPT, LLaMA, Claude, Gemini)

Fine-Tuning LLMs

Why Fine-Tune?

Reasons:

Adapt to domain: Medical, legal, code, etc.
Improve instruction following: Better alignment
Add capabilities: New languages, tasks
Reduce hallucination: Ground in specific knowledge

Full Fine-Tuning

Method: Update all model parameters

Pros: Best performance Cons: Expensive (requires full model copy per task)

Example:

python

from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained("gpt2-medium")

training_args = TrainingArguments(
    output_dir="./fine-tuned-gpt2",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    save_steps=1000,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

LoRA (Low-Rank Adaptation)

Key insight: Fine-tune low-rank matrices, freeze original weights

Method:

yaml

Original: W (d × d)
LoRA: W + A * B^T
  where A: (d × r), B: (d × r), r << d

Benefits:

Memory efficient: Only train ~0.1% of parameters
Fast: Train 3x faster
Multiple adapters: Store many task-specific LoRAs

Implementation:

python

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,  # Scaling
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
)

model = get_peft_model(base_model, lora_config)

# Only 0.1% of parameters trainable!
print(f"Trainable params: {model.num_parameters(only_trainable=True)}")

QLoRA (Quantized LoRA)

Key insight: Combine LoRA + 4-bit quantization

Benefits:

Ultra memory efficient: Fine-tune 65B model on single 24GB GPU!
No quality loss: Comparable to full fine-tuning

Use case: Fine-tune large models on consumer hardware

Implementation:

python

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
)

model = get_peft_model(model, lora_config)

Instruction Fine-Tuning

Goal: Teach model to follow instructions

Dataset format:

json

\{
  "instruction": "Translate this sentence to Spanish",
  "input": "Hello, how are you?",
  "output": "Hola, ¿cómo estás?"
\}

Training: Minimize loss on output given instruction + input

Result: Model learns to follow diverse instructions

Famous datasets:

Alpaca: 52K instruction examples
Dolly: 15K instruction examples (human-generated)
FLAN: 1.8M instruction examples (multi-task)

Using HuggingFace Transformers

Loading a Pre-Trained Model

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",  # Auto-distribute across GPUs
)

Text Generation

python

prompt = "What is the capital of France?"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    inputs["input_ids"],
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Conversation with System Prompt

python

from transformers import pipeline

chatbot = pipeline("text-generation", model=model, tokenizer=tokenizer)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."},
]

response = chatbot(messages, max_new_tokens=200)
print(response[0]["generated_text"][-1]["content"])

Streaming Generation

python

from transformers import TextIteratorStreamer
from threading import Thread

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)

generation_kwargs = dict(
    inputs=inputs["input_ids"],
    streamer=streamer,
    max_new_tokens=100,
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

# Stream output
for text in streamer:
    print(text, end="", flush=True)

thread.join()

LLM Evaluation

Automatic Metrics

One. Perplexity:

Measures how "surprised" model is by test data
Lower = better
Formula: exp(average negative log-likelihood)

python

from torch.nn import CrossEntropyLoss

def perplexity(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs, labels=inputs["input_ids"])
    return torch.exp(outputs.loss).item()

2. BLEU/ROUGE (for summarization/translation):

Measures n-gram overlap with reference
Higher = better
Range: 0-100

3. Exact Match / F1 (for QA):

Measures answer accuracy

Human Evaluation

Dimensions:

Helpfulness: Does it answer the question?
Harmlessness: Is it safe and unbiased?
Honesty: Does it admit uncertainty?

Methods:

Pairwise comparison (A vs B, which is better?)
Likert scale (1-5 rating)
Turing test (human vs AI)

Benchmarks

One. MMLU (Massive Multitask Language Understanding):

57 subjects (math, history, law, medicine)
Multiple choice questions
GPT-4: 86.4%, Claude 3.5 Sonnet: 88.7%

2. HumanEval (Code Generation):

164 Python programming problems
Pass rate on test cases
GPT-4: 67%, Claude 3.5 Sonnet: 92%

3. TruthfulQA:

Measures truthfulness
Avoid hallucinations and misinformation

4. BBH (Big Bench Hard):

23 challenging tasks requiring reasoning
Few-shot evaluation

LLM Capabilities

What LLMs Can Do

✅ Text generation: Essays, stories, dialogue ✅ Code generation: Write working programs ✅ Translation: Between languages ✅ Summarization: Condense long documents ✅ Question answering: Factual and reasoning ✅ Math: Solve equations, word problems ✅ Common sense reasoning: Everyday situations ✅ Creative writing: Poems, scripts, jokes ✅ Instruction following: Complex multi-step tasks

What LLMs Struggle With

❌ Factual accuracy: Hallucinate plausible-sounding falsehoods ❌ Math (complex): Struggle with multi-step calculations ❌ Long-term memory: Forget earlier parts of long conversations ❌ Real-time info: No knowledge of events after training cutoff ❌ Common sense: Occasional absurd mistakes ❌ Reasoning: Can fail on logic puzzles ❌ Bias: Reflect biases in training data

LLM Safety and Ethics

Concerns

One. Misinformation:

Generate convincing fake news
Hallucinate false "facts"

2. Bias and Fairness:

Amplify societal biases
Stereotyping
Discriminatory outputs

3. Privacy:

May memorize training data
Leak sensitive information

4. Misuse:

Generate malicious code
Assist in illegal activities
Spam and phishing

5. Job Displacement:

Automate writing, coding, customer service

Mitigation Strategies

One. RLHF (Reinforcement Learning from Human Feedback):

Align model with human values
Reduce harmful outputs

2. Constitutional AI:

Define principles model must follow
Self-critique and revise

3. Red Teaming:

Test for adversarial prompts
Identify failure modes

4. Content Filtering:

Detect and block harmful outputs
Input/output moderation

5. Transparency:

Model cards (capabilities, limitations)
Training data documentation

Practical Tips for Using LLMs

Prompt Design

One. Be specific:

arduino

❌ "Write about AI"
✅ "Write a 200-word introduction to transformer architectures for beginners"

2. Provide context:

arduino

❌ "Translate: bonjour"
✅ "Translate this French greeting to English: bonjour"

3. Use examples:

less

❌ "Classify sentiment"
✅ "Classify sentiment (examples: 'Great!' → Positive, 'Terrible' → Negative): '{text}'"

4. Specify format:

markdown

❌ "List benefits"
✅ "List 3 benefits in bullet points"
   

### Iterative Refinement

**Process:**

1. Start with simple prompt
2. Test output
3. Refine prompt based on errors
4. Add constraints/examples
5. Repeat

### Best Practices

✅ **Verify facts**: LLMs can hallucinate
✅ **Use for drafts**: Human review recommended
✅ **Test edge cases**: Check unusual inputs
✅ **Monitor costs**: API calls add up quickly
✅ **Version prompts**: Track what works

## Key Takeaways

1. **LLMs** are Transformers scaled to billions of parameters
2. **Scaling laws** show predictable improvements with size
3. **Emergent abilities** appear at scale (reasoning, few-shot learning)
4. **Prompt engineering** is key to eliciting desired behavior
5. **Chain-of-thought** prompting improves reasoning
6. **Fine-tuning** adapts models to specific tasks (LoRA/QLoRA for efficiency)
7. **Evaluation** requires both automatic metrics and human judgment
8. **Safety** is critical (RLHF, content filtering, red teaming)

## Looking Ahead

* **Lesson 16**: RLHF - how to align LLMs with human preferences using RL
* **Lesson 17**: Multi-modal AI - combining vision and language (CLIP, GPT-4)
* **Lesson 18**: Future of AI - integration, ethics, and responsible development

## Summary

* **LLMs** are large-scale Transformers (billions of parameters)
* **Evolution**: GPT-1 (117M) -> GPT-4 (1.7T estimated)
* **Scaling laws**: Bigger models + more data = better performance
* **Emergent abilities**: Reasoning, coding, few-shot learning appear at scale
* **Prompt engineering**: Zero-shot, few-shot, chain-of-thought
* **Fine-tuning**: Full, LoRA, QLoRA for adaptation
* **HuggingFace**: Easy access to pre-trained models
* **Evaluation**: Perplexity, BLEU, benchmarks (MMLU, HumanEval)
* **Capabilities**: Generation, reasoning, translation, coding
* **Limitations**: Hallucination, bias, privacy concerns
* **Safety**: RLHF, red teaming, content filtering

Concept 15 of 18

Concept 15: Large Language Models (LLMs) Fundamentals

Large Language Models (LLMs) Fundamentals

ℹ️ Definition Large Language Models (LLMs) are Transformer-based neural networks with billions to trillions of parameters, pre-trained on massive text corpora to understand and generate human-like text, forming the foundation of modern AI assistants like GPT-4, Claude, and Gemini.

Learning Objectives

By the end of this lesson, you will:

Understand LLM architecture evolution (GPT-1 -> GPT-4)
Learn scaling laws and emergent abilities
Master prompt engineering techniques
Apply in-context learning and few-shot learning
Understand fine-tuning strategies (full, LoRA, QLoRA)
Build a practical LLM application with HuggingFace
Explore LLM evaluation and limitations

Introduction

In Lesson 14, we learned Transformer architectures. Now we'll explore Large Language Models (LLMs) - Transformers scaled to billions of parameters!

What makes LLMs special:

Scale: GPT-3 has 175 billion parameters (1000x larger than GPT-1)
Emergent abilities: New capabilities appear at scale (reasoning, math, coding)
Zero-shot learning: Perform tasks without training examples
General intelligence: Single model handles diverse tasks

Real-world LLMs:

GPT-4 (OpenAI): ChatGPT, multimodal
Claude 3.5 Sonnet (Anthropic): Constitutional AI, long context
Gemini Pro (Google): Multimodal, integrated with Google services
LLaMA 3 (Meta): Open-source, efficient
Mistral (Mistral AI): Open-source, instruction-tuned

Evolution of Language Models

GPT-1 (2018)

Architecture: 12-layer Transformer decoder Parameters: 117 million Training data: BooksCorpus (7,000 books) Key insight: Pre-training + fine-tuning works!

Capabilities:

Text completion
Basic classification
Simple QA

GPT-2 (2019)

Architecture: 48-layer Transformer decoder Parameters: 1.5 billion (13x larger!) Training data: WebText (40GB, 8M documents) Key insight: Zero-shot learning emerges at scale

Capabilities:

Coherent long-form text
Translation (without training!)
Summarization (without training!)

Controversy: Initially not released due to misuse concerns

GPT-3 (2020)

Capabilities:

Write essays, poems, code
Math reasoning
Language translation
Question answering

Breakthrough: In-context learning (no fine-tuning needed!)

GPT-4 (2023)

Capabilities:

Advanced reasoning (passes bar exam)
Multimodal (vision + text)
Longer context (32K tokens -> 128K)
Better instruction following
Reduced hallucination

Scaling Trends

Compute scaling:

bash

GPT-1: 1 petaflop-day (~$50K)
GPT-2: 10 petaflop-days (~$500K)
GPT-3: 3,640 petaflop-days (~$4.6M)
GPT-4: ~25,000 petaflop-days (~$100M estimated)

Data scaling:

makefile

GPT-1: 5GB
GPT-2: 40GB
GPT-3: 570GB
GPT-4: Multi-terabyte (estimated)

Scaling Laws

Neural Scaling Laws (Kaplan et al., 2020)

Key finding: Loss scales predictably with compute, data, and parameters

Power law relationship:

scss

Loss ∝ 1 / (Compute)^α
Loss ∝ 1 / (Data)^β
Loss ∝ 1 / (Parameters)^γ

Implication: Bigger is (usually) better!

Chinchilla Scaling Laws (Hoffman et al., 2022)

Key finding: GPT-3 is undertrained! Should use more data, not just more parameters.

Optimal ratio:

sql

For every parameter, train on ~20 tokens

Example:

GPT-3: 175B params, 300B tokens (1.7 tokens per param) ❌
Chinchilla: 70B params, 1.4T tokens (20 tokens per param) ✅

Result: Chinchilla (70B) outperforms GPT-3 (175B) despite being 2.5x smaller!

Lesson: Data quality and quantity matter as much as model size

Emergent Abilities

Definition: Capabilities that appear suddenly at scale, absent in smaller models

Examples:

One. Chain-of-thought reasoning:

Small models (<10B): Can't solve multi-step math
Large models (>100B): Can solve complex reasoning with step-by-step prompts

2. Instruction following:

Small models: Struggle to follow complex instructions
Large models: Follow nuanced, multi-part instructions

3. Few-shot learning:

Small models: Need many examples
Large models: Learn from 1-5 examples

4. Coding:

Small models: Basic syntax
Large models: Write complex, working code

Open question: Are emergent abilities truly discontinuous, or just measurement artifacts?

Prompt Engineering

What is Prompt Engineering?

Definition: Designing input prompts to elicit desired model behavior

Why it matters:

LLMs are sensitive to phrasing
Small changes = big output differences
No fine-tuning needed!

Zero-Shot Prompting

Method: Ask directly without examples

Example:

less

Prompt: "Translate to French: 'Hello, how are you?'"
Output: "Bonjour, comment allez-vous?"

Use case: Simple, well-defined tasks

Few-Shot Prompting (In-Context Learning)

Method: Provide examples in the prompt

Example:

vbnet

Prompt:
Translate to French:
English: "Good morning"
French: "Bonjour"

English: "Thank you"
French: "Merci"

English: "How are you?"
French:

Output: "Comment allez-vous?"

Key insight: Model learns pattern from examples (no weight updates!)

Chain-of-Thought (CoT) Prompting

Method: Ask model to show reasoning steps

Example (Math):

vbnet

Prompt:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?

A: Let's think step by step.
Roger started with 5 balls.
2 cans × 3 balls per can = 6 balls.
5 + 6 = 11 balls.
The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and
bought 6 more, how many apples do they have?

A: Let's think step by step.

Output:

ini

The cafeteria started with 23 apples.
They used 20, so 23 - 20 = 3 apples left.
Then they bought 6 more, so 3 + 6 = 9 apples.
The answer is 9.

Benefit: Dramatically improves reasoning on complex tasks

Prompt Templates

Task: Sentiment Analysis

vbnet

Classify the sentiment of the following text as Positive, Negative, or Neutral.

Text: "{text}"
Sentiment:

Task: Summarization

css

Summarize the following article in 3 sentences:

Article: {article}

Summary:

Task: Code Generation

css

Write a Python function that {description}

Requirements:
- {req1}
- {req2}

```python
def {function_name}({params}):
    """

Advanced Prompting Techniques

One. Role Prompting:

css

You are an expert Python developer. Write clean, efficient code with comprehensive docstrings.

Task: {task}

2. Constraint Specification:

css

Answer in exactly 50 words. Use simple language suitable for a 10-year-old.

Question: {question}

3. Format Control:

css

Respond in JSON format:
\{
  "answer": "...",
  "confidence": 0-1,
  "reasoning": "..."
\}

4. Self-Consistency:

Generate multiple reasoning paths
Take majority vote of final answers
Improves accuracy on complex reasoning

LLM Architectures

Decoder-Only (GPT-style)

Architecture: Stacked Transformer decoder blocks with causal masking

Attention: Causal (token i can only attend to tokens <= i)

Training: Next token prediction

Use cases: Text generation, completion, dialogue

Examples: GPT-3, GPT-4, LLaMA, Mistral

Encoder-Only (BERT-style)

Architecture: Stacked Transformer encoder blocks with bidirectional attention

Attention: Bidirectional (token i can attend to all tokens)

Training: Masked language modeling (predict masked tokens)

Use cases: Classification, NER, sentence embeddings

Examples: BERT, RoBERTa, ALBERT

Encoder-Decoder (T5-style)

Architecture: Encoder blocks + decoder blocks

Attention: Bidirectional in encoder, causal in decoder

Training: Span corruption (predict masked spans)

Use cases: Translation, summarization, seq2seq

Examples: T5, BART, Flan-T5

Comparison

Architecture	Generation	Understanding	Efficiency
Decoder-Only	✅ Excellent	✅ Good	⚠️ Medium
Encoder-Only	❌ Poor	✅ Excellent	✅ Fast
Encoder-Decoder	✅ Good	✅ Good	❌ Slow

Trend: Decoder-only dominates (GPT, LLaMA, Claude, Gemini)

Fine-Tuning LLMs

Why Fine-Tune?

Reasons:

Adapt to domain: Medical, legal, code, etc.
Improve instruction following: Better alignment
Add capabilities: New languages, tasks
Reduce hallucination: Ground in specific knowledge

Full Fine-Tuning

Method: Update all model parameters

Pros: Best performance Cons: Expensive (requires full model copy per task)

Example:

python

from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained("gpt2-medium")

training_args = TrainingArguments(
    output_dir="./fine-tuned-gpt2",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    save_steps=1000,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

LoRA (Low-Rank Adaptation)

Key insight: Fine-tune low-rank matrices, freeze original weights

Method:

yaml

Original: W (d × d)
LoRA: W + A * B^T
  where A: (d × r), B: (d × r), r << d

Benefits:

Memory efficient: Only train ~0.1% of parameters
Fast: Train 3x faster
Multiple adapters: Store many task-specific LoRAs

Implementation:

python

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,  # Scaling
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
)

model = get_peft_model(base_model, lora_config)

# Only 0.1% of parameters trainable!
print(f"Trainable params: {model.num_parameters(only_trainable=True)}")

QLoRA (Quantized LoRA)

Key insight: Combine LoRA + 4-bit quantization

Benefits:

Ultra memory efficient: Fine-tune 65B model on single 24GB GPU!
No quality loss: Comparable to full fine-tuning

Use case: Fine-tune large models on consumer hardware

Implementation:

python

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
)

model = get_peft_model(model, lora_config)

Instruction Fine-Tuning

Goal: Teach model to follow instructions

Dataset format:

json

\{
  "instruction": "Translate this sentence to Spanish",
  "input": "Hello, how are you?",
  "output": "Hola, ¿cómo estás?"
\}

Training: Minimize loss on output given instruction + input

Result: Model learns to follow diverse instructions

Famous datasets:

Alpaca: 52K instruction examples
Dolly: 15K instruction examples (human-generated)
FLAN: 1.8M instruction examples (multi-task)

Using HuggingFace Transformers

Loading a Pre-Trained Model

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",  # Auto-distribute across GPUs
)

Text Generation

python

prompt = "What is the capital of France?"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    inputs["input_ids"],
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Conversation with System Prompt

python

from transformers import pipeline

chatbot = pipeline("text-generation", model=model, tokenizer=tokenizer)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."},
]

response = chatbot(messages, max_new_tokens=200)
print(response[0]["generated_text"][-1]["content"])

Streaming Generation

python

from transformers import TextIteratorStreamer
from threading import Thread

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)

generation_kwargs = dict(
    inputs=inputs["input_ids"],
    streamer=streamer,
    max_new_tokens=100,
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

# Stream output
for text in streamer:
    print(text, end="", flush=True)

thread.join()

LLM Evaluation

Automatic Metrics

One. Perplexity:

Measures how "surprised" model is by test data
Lower = better
Formula: exp(average negative log-likelihood)

python

from torch.nn import CrossEntropyLoss

def perplexity(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs, labels=inputs["input_ids"])
    return torch.exp(outputs.loss).item()

2. BLEU/ROUGE (for summarization/translation):

Measures n-gram overlap with reference
Higher = better
Range: 0-100

3. Exact Match / F1 (for QA):

Measures answer accuracy

Human Evaluation

Dimensions:

Helpfulness: Does it answer the question?
Harmlessness: Is it safe and unbiased?
Honesty: Does it admit uncertainty?

Methods:

Pairwise comparison (A vs B, which is better?)
Likert scale (1-5 rating)
Turing test (human vs AI)

Benchmarks

One. MMLU (Massive Multitask Language Understanding):

57 subjects (math, history, law, medicine)
Multiple choice questions
GPT-4: 86.4%, Claude 3.5 Sonnet: 88.7%

2. HumanEval (Code Generation):

164 Python programming problems
Pass rate on test cases
GPT-4: 67%, Claude 3.5 Sonnet: 92%

3. TruthfulQA:

Measures truthfulness
Avoid hallucinations and misinformation

4. BBH (Big Bench Hard):

23 challenging tasks requiring reasoning
Few-shot evaluation

Generate convincing fake news
Hallucinate false "facts"

2. Bias and Fairness:

Amplify societal biases
Stereotyping
Discriminatory outputs

3. Privacy:

May memorize training data
Leak sensitive information

4. Misuse:

Generate malicious code
Assist in illegal activities
Spam and phishing

5. Job Displacement:

Automate writing, coding, customer service

Mitigation Strategies

One. RLHF (Reinforcement Learning from Human Feedback):

Align model with human values
Reduce harmful outputs

2. Constitutional AI:

Define principles model must follow
Self-critique and revise

3. Red Teaming:

Test for adversarial prompts
Identify failure modes

4. Content Filtering:

Detect and block harmful outputs
Input/output moderation

5. Transparency:

Model cards (capabilities, limitations)
Training data documentation

Practical Tips for Using LLMs

Prompt Design

One. Be specific:

arduino

❌ "Write about AI"
✅ "Write a 200-word introduction to transformer architectures for beginners"

2. Provide context:

arduino

❌ "Translate: bonjour"
✅ "Translate this French greeting to English: bonjour"

3. Use examples:

less

❌ "Classify sentiment"
✅ "Classify sentiment (examples: 'Great!' → Positive, 'Terrible' → Negative): '{text}'"

4. Specify format:

markdown

❌ "List benefits"
✅ "List 3 benefits in bullet points"
   

### Iterative Refinement

**Process:**

1. Start with simple prompt
2. Test output
3. Refine prompt based on errors
4. Add constraints/examples
5. Repeat

### Best Practices

✅ **Verify facts**: LLMs can hallucinate
✅ **Use for drafts**: Human review recommended
✅ **Test edge cases**: Check unusual inputs
✅ **Monitor costs**: API calls add up quickly
✅ **Version prompts**: Track what works

## Key Takeaways

1. **LLMs** are Transformers scaled to billions of parameters
2. **Scaling laws** show predictable improvements with size
3. **Emergent abilities** appear at scale (reasoning, few-shot learning)
4. **Prompt engineering** is key to eliciting desired behavior
5. **Chain-of-thought** prompting improves reasoning
6. **Fine-tuning** adapts models to specific tasks (LoRA/QLoRA for efficiency)
7. **Evaluation** requires both automatic metrics and human judgment
8. **Safety** is critical (RLHF, content filtering, red teaming)

## Looking Ahead

* **Lesson 16**: RLHF - how to align LLMs with human preferences using RL
* **Lesson 17**: Multi-modal AI - combining vision and language (CLIP, GPT-4)
* **Lesson 18**: Future of AI - integration, ethics, and responsible development

## Summary

* **LLMs** are large-scale Transformers (billions of parameters)
* **Evolution**: GPT-1 (117M) -> GPT-4 (1.7T estimated)
* **Scaling laws**: Bigger models + more data = better performance
* **Emergent abilities**: Reasoning, coding, few-shot learning appear at scale
* **Prompt engineering**: Zero-shot, few-shot, chain-of-thought
* **Fine-tuning**: Full, LoRA, QLoRA for adaptation
* **HuggingFace**: Easy access to pre-trained models
* **Evaluation**: Perplexity, BLEU, benchmarks (MMLU, HumanEval)
* **Capabilities**: Generation, reasoning, translation, coding
* **Limitations**: Hallucination, bias, privacy concerns
* **Safety**: RLHF, red teaming, content filtering