ℹ️ Definition Large Language Models (LLMs) are Transformer-based neural networks with billions to trillions of parameters, pre-trained on massive text corpora to understand and generate human-like text, forming the foundation of modern AI assistants like GPT-4, Claude, and Gemini.
By the end of this lesson, you will:
In Lesson 14, we learned Transformer architectures. Now we'll explore Large Language Models (LLMs) - Transformers scaled to billions of parameters!
What makes LLMs special:
Real-world LLMs:
Architecture: 12-layer Transformer decoder Parameters: 117 million Training data: BooksCorpus (7,000 books) Key insight: Pre-training + fine-tuning works!
Capabilities:
Architecture: 48-layer Transformer decoder Parameters: 1.5 billion (13x larger!) Training data: WebText (40GB, 8M documents) Key insight: Zero-shot learning emerges at scale
Capabilities:
Controversy: Initially not released due to misuse concerns
Architecture: 96-layer Transformer decoder Parameters: 175 billion (117x larger!) Training data: 570GB (CommonCrawl, WebText2, Books, Wikipedia) Key insight: Few-shot learning with in-context examples
Capabilities:
Breakthrough: In-context learning (no fine-tuning needed!)
Architecture: Rumored mixture-of-experts (unconfirmed) Parameters: Estimated 1.7 trillion (disputed) Training data: Unknown (multimodal: text + images) Key insight: RLHF for alignment, multimodal understanding
Capabilities:
Compute scaling:
GPT-1: 1 petaflop-day (~$50K)
GPT-2: 10 petaflop-days (~$500K)
GPT-3: 3,640 petaflop-days (~$4.6M)
GPT-4: ~25,000 petaflop-days (~$100M estimated)
Data scaling:
GPT-1: 5GB
GPT-2: 40GB
GPT-3: 570GB
GPT-4: Multi-terabyte (estimated)
Key finding: Loss scales predictably with compute, data, and parameters
Power law relationship:
Loss ∝ 1 / (Compute)^α
Loss ∝ 1 / (Data)^β
Loss ∝ 1 / (Parameters)^γ
Implication: Bigger is (usually) better!
Key finding: GPT-3 is undertrained! Should use more data, not just more parameters.
Optimal ratio:
For every parameter, train on ~20 tokens
Example:
Result: Chinchilla (70B) outperforms GPT-3 (175B) despite being 2.5x smaller!
Lesson: Data quality and quantity matter as much as model size
Definition: Capabilities that appear suddenly at scale, absent in smaller models
Examples:
One. Chain-of-thought reasoning:
<10B): Can't solve multi-step math2. Instruction following:
3. Few-shot learning:
4. Coding:
Open question: Are emergent abilities truly discontinuous, or just measurement artifacts?
Definition: Designing input prompts to elicit desired model behavior
Why it matters:
Method: Ask directly without examples
Example:
Prompt: "Translate to French: 'Hello, how are you?'"
Output: "Bonjour, comment allez-vous?"
Use case: Simple, well-defined tasks
Method: Provide examples in the prompt
Example:
Prompt:
Translate to French:
English: "Good morning"
French: "Bonjour"
English: "Thank you"
French: "Merci"
English: "How are you?"
French:
Output: "Comment allez-vous?"
Key insight: Model learns pattern from examples (no weight updates!)
Method: Ask model to show reasoning steps
Example (Math):
Prompt:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Let's think step by step.
Roger started with 5 balls.
2 cans × 3 balls per can = 6 balls.
5 + 6 = 11 balls.
The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and
bought 6 more, how many apples do they have?
A: Let's think step by step.
Output:
The cafeteria started with 23 apples.
They used 20, so 23 - 20 = 3 apples left.
Then they bought 6 more, so 3 + 6 = 9 apples.
The answer is 9.
Benefit: Dramatically improves reasoning on complex tasks
Task: Sentiment Analysis
Classify the sentiment of the following text as Positive, Negative, or Neutral.
Text: "{text}"
Sentiment:
Task: Summarization
Summarize the following article in 3 sentences:
Article: {article}
Summary:
Task: Code Generation
Write a Python function that {description}
Requirements:
- {req1}
- {req2}
```python
def {function_name}({params}):
"""
One. Role Prompting:
You are an expert Python developer. Write clean, efficient code with comprehensive docstrings.
Task: {task}
2. Constraint Specification:
Answer in exactly 50 words. Use simple language suitable for a 10-year-old.
Question: {question}
3. Format Control:
Respond in JSON format:
\{
"answer": "...",
"confidence": 0-1,
"reasoning": "..."
\}
4. Self-Consistency:
Architecture: Stacked Transformer decoder blocks with causal masking
Attention: Causal (token i can only attend to tokens <= i)
Training: Next token prediction
Use cases: Text generation, completion, dialogue
Examples: GPT-3, GPT-4, LLaMA, Mistral
Architecture: Stacked Transformer encoder blocks with bidirectional attention
Attention: Bidirectional (token i can attend to all tokens)
Training: Masked language modeling (predict masked tokens)
Use cases: Classification, NER, sentence embeddings
Examples: BERT, RoBERTa, ALBERT
Architecture: Encoder blocks + decoder blocks
Attention: Bidirectional in encoder, causal in decoder
Training: Span corruption (predict masked spans)
Use cases: Translation, summarization, seq2seq
Examples: T5, BART, Flan-T5
| Architecture | Generation | Understanding | Efficiency |
|---|---|---|---|
| Decoder-Only | ✅ Excellent | ✅ Good | ⚠️ Medium |
| Encoder-Only | ❌ Poor | ✅ Excellent | ✅ Fast |
| Encoder-Decoder | ✅ Good | ✅ Good | ❌ Slow |
Trend: Decoder-only dominates (GPT, LLaMA, Claude, Gemini)
Reasons:
Method: Update all model parameters
Pros: Best performance Cons: Expensive (requires full model copy per task)
Example:
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments
model = AutoModelForCausalLM.from_pretrained("gpt2-medium")
training_args = TrainingArguments(
output_dir="./fine-tuned-gpt2",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=5e-5,
save_steps=1000,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()
Key insight: Fine-tune low-rank matrices, freeze original weights
Method:
Original: W (d × d)
LoRA: W + A * B^T
where A: (d × r), B: (d × r), r << d
Benefits:
Implementation:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=8, # Rank
lora_alpha=32, # Scaling
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"], # Which layers to adapt
)
model = get_peft_model(base_model, lora_config)
# Only 0.1% of parameters trainable!
print(f"Trainable params: {model.num_parameters(only_trainable=True)}")
Key insight: Combine LoRA + 4-bit quantization
Benefits:
Use case: Fine-tune large models on consumer hardware
Implementation:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
)
model = get_peft_model(model, lora_config)
Goal: Teach model to follow instructions
Dataset format:
\{
"instruction": "Translate this sentence to Spanish",
"input": "Hello, how are you?",
"output": "Hola, ¿cómo estás?"
\}
Training: Minimize loss on output given instruction + input
Result: Model learns to follow diverse instructions
Famous datasets:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto", # Auto-distribute across GPUs
)
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
inputs["input_ids"],
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
from transformers import pipeline
chatbot = pipeline("text-generation", model=model, tokenizer=tokenizer)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."},
]
response = chatbot(messages, max_new_tokens=200)
print(response[0]["generated_text"][-1]["content"])
from transformers import TextIteratorStreamer
from threading import Thread
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
inputs=inputs["input_ids"],
streamer=streamer,
max_new_tokens=100,
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
# Stream output
for text in streamer:
print(text, end="", flush=True)
thread.join()
One. Perplexity:
exp(average negative log-likelihood)from torch.nn import CrossEntropyLoss
def perplexity(model, tokenizer, text):
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
return torch.exp(outputs.loss).item()
2. BLEU/ROUGE (for summarization/translation):
3. Exact Match / F1 (for QA):
Dimensions:
Methods:
One. MMLU (Massive Multitask Language Understanding):
2. HumanEval (Code Generation):
3. TruthfulQA:
4. BBH (Big Bench Hard):
✅ Text generation: Essays, stories, dialogue ✅ Code generation: Write working programs ✅ Translation: Between languages ✅ Summarization: Condense long documents ✅ Question answering: Factual and reasoning ✅ Math: Solve equations, word problems ✅ Common sense reasoning: Everyday situations ✅ Creative writing: Poems, scripts, jokes ✅ Instruction following: Complex multi-step tasks
❌ Factual accuracy: Hallucinate plausible-sounding falsehoods ❌ Math (complex): Struggle with multi-step calculations ❌ Long-term memory: Forget earlier parts of long conversations ❌ Real-time info: No knowledge of events after training cutoff ❌ Common sense: Occasional absurd mistakes ❌ Reasoning: Can fail on logic puzzles ❌ Bias: Reflect biases in training data
One. Misinformation:
2. Bias and Fairness:
3. Privacy:
4. Misuse:
5. Job Displacement:
One. RLHF (Reinforcement Learning from Human Feedback):
2. Constitutional AI:
3. Red Teaming:
4. Content Filtering:
5. Transparency:
One. Be specific:
❌ "Write about AI"
✅ "Write a 200-word introduction to transformer architectures for beginners"
2. Provide context:
❌ "Translate: bonjour"
✅ "Translate this French greeting to English: bonjour"
3. Use examples:
❌ "Classify sentiment"
✅ "Classify sentiment (examples: 'Great!' → Positive, 'Terrible' → Negative): '{text}'"
4. Specify format:
❌ "List benefits"
✅ "List 3 benefits in bullet points"
### Iterative Refinement
**Process:**
1. Start with simple prompt
2. Test output
3. Refine prompt based on errors
4. Add constraints/examples
5. Repeat
### Best Practices
✅ **Verify facts**: LLMs can hallucinate
✅ **Use for drafts**: Human review recommended
✅ **Test edge cases**: Check unusual inputs
✅ **Monitor costs**: API calls add up quickly
✅ **Version prompts**: Track what works
## Key Takeaways
1. **LLMs** are Transformers scaled to billions of parameters
2. **Scaling laws** show predictable improvements with size
3. **Emergent abilities** appear at scale (reasoning, few-shot learning)
4. **Prompt engineering** is key to eliciting desired behavior
5. **Chain-of-thought** prompting improves reasoning
6. **Fine-tuning** adapts models to specific tasks (LoRA/QLoRA for efficiency)
7. **Evaluation** requires both automatic metrics and human judgment
8. **Safety** is critical (RLHF, content filtering, red teaming)
## Looking Ahead
* **Lesson 16**: RLHF - how to align LLMs with human preferences using RL
* **Lesson 17**: Multi-modal AI - combining vision and language (CLIP, GPT-4)
* **Lesson 18**: Future of AI - integration, ethics, and responsible development
## Summary
* **LLMs** are large-scale Transformers (billions of parameters)
* **Evolution**: GPT-1 (117M) -> GPT-4 (1.7T estimated)
* **Scaling laws**: Bigger models + more data = better performance
* **Emergent abilities**: Reasoning, coding, few-shot learning appear at scale
* **Prompt engineering**: Zero-shot, few-shot, chain-of-thought
* **Fine-tuning**: Full, LoRA, QLoRA for adaptation
* **HuggingFace**: Easy access to pre-trained models
* **Evaluation**: Perplexity, BLEU, benchmarks (MMLU, HumanEval)
* **Capabilities**: Generation, reasoning, translation, coding
* **Limitations**: Hallucination, bias, privacy concerns
* **Safety**: RLHF, red teaming, content filtering