ℹ️ Definition
Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make optimal decisions by interacting with an environment and receiving rewards or penalties for its actions.
By the end of this lesson, you will:
Understand what Reinforcement Learning is and how it differs from supervised/unsupervised learning
Learn the key components of RL: agent, environment, state, action, and reward
Understand the concept of Markov Decision Processes (MDPs)
Implement a basic agent-environment interaction loop
Visualize agent performance over multiple episodes
In the AI courses you've taken so far, you've learned about supervised learning (learning from labeled examples) and unsupervised learning (finding patterns in unlabeled data). Reinforcement Learning introduces a completely different paradigm: learning through trial and error .
Imagine teaching a dog to fetch. You don't show the dog thousands of labeled examples of "correct fetch" vs "incorrect fetch." Instead, the dog tries different behaviors, and you reward it when it does something right. Over time, the dog learns which actions lead to rewards. This is the essence of reinforcement learning.
RL involves two main components interacting over time:
Agent : The learner or decision maker (e.g., a robot, game-playing AI, recommendation system)
Environment : Everything the agent interacts with (e.g., game world, physical world, user behavior)
At each time step:
The agent observes the current state of the environment
The agent chooses an action to take
The environment transitions to a new state
The agent receives a reward signal
The cycle repeats
A description of the current situation.
Examples:
Chess : Current positions of all pieces on the board
Self-driving car : Sensor readings, speed, location, nearby objects
Stock trading : Current prices, portfolio holdings, market indicators
A decision the agent can make to interact with the environment.
Examples:
Chess : Move a specific piece to a specific square
Self-driving car : Accelerate, brake, turn left, turn right
Stock trading : Buy stock, sell stock, hold
A scalar feedback signal indicating how good or bad the action was.
Examples:
Chess : +1 for winning, -1 for losing, 0 otherwise
Self-driving car : +1 for staying in lane, -100 for collision, -1 for slow speed
Stock trading : Profit or loss from the trade
💡 Key Insight : The agent doesn't know which actions are best initially. It must discover good actions through experience by trying actions and observing rewards.
The agent's goal is to learn a policy (π) - a strategy for choosing actions - that maximizes the cumulative reward over time.
Policy : A mapping from states to actions (π: S -> A)
Return (G) : Sum of all future rewards (often discounted)
Most RL problems are formalized as Markov Decision Processes . An MDP has the Markov Property :
Markov Property : The future depends only on the current state, not on how we got there.
This means: Given the current state, the history of previous states doesn't matter for predicting what happens next.
Example - Chess:
Markov : You only need to see the current board position to decide your next move
Non-Markov : You would need to know the entire game history to decide
An MDP is defined by:
S : Set of states
A : Set of actions
P : State transition probabilities P(s'|s,a) = probability of reaching state s' when taking action a in state s
R : Reward function R(s,a,s') = reward received after taking action a in state s and reaching s'
γ : Discount factor (0 <= γ <= 1) - how much we value future rewards
Episodic Tasks:
Have a defined ending (terminal state)
Examples: Games (win/lose/draw), robot reaching a goal
Continuing Tasks:
Run indefinitely
Examples: Stock trading, process control, recommendation systems
Model-Based:
Agent learns a model of the environment (how actions affect states)
Uses the model to plan ahead
Example: AlphaGo uses a model to simulate future game states
Model-Free:
Agent learns directly from experience without modeling the environment
Simpler but often requires more data
Examples: Q-Learning, Policy Gradients (which we'll learn soon)
Value-Based:
Learn the value of being in each state (or taking each action)
Derive policy from values
Examples: Q-Learning, DQN
Policy-Based:
Learn the policy directly
Examples: REINFORCE, PPO
On-Policy:
Learn about the policy being used to make decisions
Example: SARSA
Off-Policy:
Learn about one policy while following another
Example: Q-Learning
One of the central challenges in RL:
Exploitation : Use what you've learned to maximize immediate reward
Exploration : Try new actions to discover better options
Example - Restaurant Choice:
Exploit : Always go to your favorite restaurant (guaranteed satisfaction)
Explore : Try new restaurants (might find something better, or worse)
Good RL agents must balance both. Common strategies:
ε-greedy : With probability ε, choose random action (explore); otherwise, choose best known action (exploit)
Upper Confidence Bound (UCB) : Choose actions with uncertainty bonus
Softmax : Choose actions probabilistically based on their estimated values
AlphaGo : Defeated world champion in Go
OpenAI Five : Superhuman performance in Dota 2
Atari Games : Human-level performance across 57 games
Robot manipulation : Grasping objects, assembly
Locomotion : Walking, running, navigating obstacles
Autonomous vehicles : Self-driving cars
Data center cooling : Google reduced cooling costs by 40%
Traffic light control : Optimizing traffic flow
Energy grid management : Balancing supply and demand
YouTube : Video recommendations
Netflix : Movie recommendations
E-commerce : Product recommendations
Algorithmic trading : Automated stock trading
Portfolio optimization : Asset allocation
Risk management : Dynamic hedging strategies
Let's walk through how an RL agent learns:
Step One: Random Actions
Initially, the agent has no knowledge. It tries random actions and observes what happens.
Step 2: Learn from Experience
The agent stores experiences (state, action, reward, next state) and updates its understanding.
Step 3: Improve Policy
Based on accumulated experience, the agent identifies which actions led to high rewards.
Step 4: Repeat
The agent continues interacting with the environment, constantly improving its policy.
Convergence:
Over many episodes, the agent's performance improves until it finds an optimal (or near-optimal) policy.
Imagine a robot in a 4x4 grid trying to reach a goal:
bash
S . . .
.
. . . .
. . . G
S : Start position
G : Goal position
. : Empty cell
# : Wall (can't enter)
States : 16 grid positions
Actions : Up, Down, Left, Right
Rewards : +10 for reaching goal, -1 for each step, -10 for hitting wall
The agent starts randomly moving around. Over many episodes, it learns:
Moving toward the goal is good
Hitting walls is bad
Shorter paths are better (due to -1 step penalty)
Eventually, it finds the optimal path from S to G.
When you win a game after 100 moves, which moves were responsible? RL must figure out which past actions contributed to the final reward.
Rewards often come long after actions. A chess move might only be judged good or bad at the end of the game.
RL often requires millions of experiences to learn. This can be costly or dangerous in real-world applications.
Finding good strategies in complex environments requires smart exploration, not just random actions.
RL training can be unstable, with performance fluctuating significantly during learning.
Aspect
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Training Data
Labeled examples
Unlabeled data
Interactions and rewards
Goal
Predict labels
Find patterns
Maximize cumulative reward
Feedback
Correct answers
None
Scalar rewards (delayed)
Examples
Image classification, spam detection
Clustering, dimensionality reduction
Game playing, robotics
1950s : Dynamic programming (Bellman equations)
1960s-70s : Temporal difference learning concepts
1980s : Q-Learning algorithm (Watkins, 1989)
1992 : TD-Gammon learns world-class backgammon
1998 : Policy gradient theorems formalized
2013 : Deep Q-Networks (DQN) plays Atari games
2015 : DQN achieves human-level performance on 29 Atari games
2016 : AlphaGo defeats world champion Lee Sedol
2017 : AlphaZero masters chess, shogi, and Go from scratch
2019 : OpenAI Five defeats professional Dota 2 team
2020s : RLHF enables aligned language models (ChatGPT, Claude)
RL is learning through trial and error - agents discover good strategies by interacting with environments
Core components : Agent, environment, state, action, reward
Goal : Learn a policy that maximizes cumulative reward
Trade-offs : Exploration vs exploitation, sample efficiency vs performance
Applications : Games, robotics, resource management, finance, and more
Convergence : RL + GenAI through RLHF is transforming modern AI
In the next lessons, we'll dive deeper into specific RL algorithms:
Lesson 2 : Tabular Q-Learning for small state spaces
Lesson 3 : Deep Q-Networks (DQN) for complex environments
Lessons 4-6 : Policy gradient methods and actor-critic architectures
By the end of this module, you'll be able to train agents that master complex tasks through reinforcement learning!
Reinforcement Learning is about learning from interaction with an environment
Key components : Agent, environment, state, action, reward, policy
Markov Decision Processes formalize RL problems mathematically
Exploration vs exploitation is a fundamental trade-off in RL
Applications range from game playing to robotics to resource optimization
Modern RL combined with deep learning has achieved superhuman performance in many domains