# CS代考 COSC1125/1127: Artificial Intelligence – cscodehelp代写

COSC1125/1127: Artificial Intelligence

Week 9: Reinforcement Learning

Instructor: Prof.

RMIT University, Melbourne, Australia

[These slides were created by and for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Please retain proper attribution, including the reference to ai.berkeley.edu. Thanks!

‹#›

Week 9: From MDP to Reinforcement Learning

Some news…

Take Home Exercise (THE) done!

Well done all! 👏👏👏

Thanks for your professionalism!

Contest feedbacks already running

Are you in already?

Check post #170

Preliminary contest due on Sunday @ 11:59pm

Friday is a public holiday.

Attend tutorials on Wed & Thursday.

Drop-in lab moved to Thursday 4:30pm-5:30pm.

*

Feedback contest is running!

Preliminary submission (10%) due end of this week!

Every day 2am + 2pm

‹#›

Feedback from the past: What would you say to a fellow student wrt the Pacman Contest?

It’s loads of fun and you’ll learn a lot of you put the time in. Don’t underestimate the challenge though.

“Do it! You will love it!”

It’s really fun, but make sure you are prepared to spend a fair amount of time on it.

Go for it, it’s fun and you can learn a lot by reading, researching and exploring the Pac-Man.

AI can be fun to learn through games. Take the course and you won’t regret it 🙂

It’s a challenge. Starts slow and simple.

It was fun, but a lot of work and will require heaps of time to do well.

Make sure to start thinking about the final part throughout the rest of the assignments

Its hard, but its fun

Have fun, and refer UC materials to succeed.

Don’t start late.

It is a fun and practical way to learn and apply the concepts taught in the course.

Ask more questions about the algorithm than I did. I was confused at first but in the end my confusion stemmed from my over complicating the task.

‹#›

From Project 2 to Project Contest

Please watch it to avoid issues

Course Survey is live!

Method of teaching:

Lectorials

Forum

Assessments:

Pacman Projects

Flexible THE

Technology:

EdStem + GitHub

Learning Materials:

Videos + book + tute

If you are active participant in the course (lectorials, tutes) please please don’t just ignore it, so I get a balanced view…

‹#›

Extra 1-on-1 support for Pacman Project

Want 1-on-1 team support for your project?

Book 1-on-1 team 15’ consultation time on Wednesdays 3pm-4pm, weeks 9 to 12.

Book here!

When you book:

Make a PRIVATE post before to inform us what you want to chat.

Include the link to your repo.

Use the 1-on-1 tutorial consultation time!

Remember 1-on-1 Mondays 4pm-5pm Consultation Time for tutorials, use that & book it HERE!

No tutorials this Friday Sept 24th: holiday!

For tutorials: attend Wednesdays and Thursdays tutes.

Drop-in lab moved to Thursday 4:30pm

‹#›

Questions?

*

IMPORTANT DISCLAIMER

These pack of slides are NOT core study

material. The core & main material is the book or any specific resources given for the week.

The slides in this presentations were prepared only for supporting the instructor during the lectorial and only some of them have been used in the session as needed.

These pack of slides do not represent the complete material of the week, but only a few bits here and there around what was discussed in lectorials. Most of the week content will NOT show up in this pack of slides.

Please refer to the book (and any resources specifically given) for a comprehensive coverage of the week material.

CS 188: Artificial Intelligence

Reinforcement Learning

Instructors: and

University of California, Berkeley

[These slides were created by and for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Please retain proper attribution, including the reference to ai.berkeley.edu. Thanks!

‹#›

Topics

Model-based vs Model-free learning.

Passive and active RL

Direct evaluation, ADP, Temporal Difference (TD).

Q-value Q(s,a)

Q-learning and TD Q-learning.

Exploration vs Exploitation.

Approximate Q-learning.

‹#›

This week videos on RL from UCB!

Part I

Part II

‹#›

Reinforcement Learning

Reinforcement Learning

Basic idea:

Receive feedback in the form of rewards

Agent’s utility is defined by the reward function

Must (learn to) act so as to maximize expected rewards

All learning is based on observed samples of outcomes!

Environment

Agent

Actions: a

State: s

Reward: r

Show robot-soccer learning to walk videos (UT Austin):

? Show snake robot:

‹#›

Example: Learning to Walk

Initial

A Learning Trial

After Learning [1K Trials]

[Kohl and Stone, ICRA 2004]

Example: Toddler Robot

[Tedrake, Zhang and Seung, 2005]

[Video: TODDLER – 40s]

The Crawler!

[Demo: Crawler Bot (L10D1)] [You, in Project 3]

At times no reward accumulated at all, just sitting still

‹#›

Reinforcement Learning

Still assume a Markov decision process (MDP):

A set of states s ∈ S

A set of actions (per state) A

A model T(s,a,s’) = P(s’ | s, a)

A reward function R(s,a,s’)

Still looking for a policy π(s)

New twist: don’t know T or R

I.e. we don’t know which states are good or what the actions do

Must actually try actions and states out to learn

Offline (MDPs) vs. Online (RL)

Offline Solution

Online Learning

Model-Based Learning

Model-based vs Model-free Learning: Expected Age

Goal: Compute expected age of class students

P(a): probability of age a in the class/group – ‘the model”

Unknown P(A): “Model Based”

Unknown P(A): “Model Free”

Without P(A), instead collect samples [a1, a2, … aN]

Known P(A)

Why does this work? Because samples appear with the right frequencies.

Why does this work? Because eventually you learn the right model.

Adaptive Dynamic Programming:

Model-Based Learning

Model-Based Idea:

Learn an approximate model based on experiences

Solve for values as if the learned model were correct

Step 1: Learn empirical MDP model

Count outcomes s’ for each s, a

Normalize to give an estimate of

Discover each when we experience (s, a, s’)

Step 2: Solve the learned MDP

For example, use value iteration, as before, to estimate value U(s) (or V(s))

How to estimate T(s,a,s’)?

A

B C D

E

‹#›

Example: Model-Based Learning

Input Policy π

Assume: γ = 1

Observed Episodes (Training)

Learned Model

A

B C D

E

B, east, C, -1

C, east, D, -1

D, exit, x, +10

B, east, C, -1

C, east, D, -1

D, exit, x, +10

E, north, C, -1

C, east, A, -1

A, exit, x, -10

Episode 1

Episode 2

Episode 3

Episode 4

E, north, C, -1

C, east, D, -1

D, exit, x, +10

T(s,a,s’).

T(B, east, C) = 1.00

T(C, east, D) = 0.75

T(C, east, A) = 0.25

…

R(s,a,s’).

R(B, east, C) = -1

R(C, east, D) = -1

R(D, exit, x) = +10

…

Adaptive Dynamic Programming

‹#›

Adaptive Dynamic Programming in AIMA 4G

‹#›

Model-Free Learning

Passive Reinforcement Learning

Passive Reinforcement Learning

Simplified task: policy evaluation

Input: a fixed policy π(s)

You don’t know the transitions T(s,a,s’)

You don’t know the rewards R(s,a,s’)

Goal: learn the state values

In this case:

Learner is “along for the ride”

No choice about what actions to take

Just execute the policy and learn from experience

This is NOT offline planning! You actually take actions in the world.

Direct Evaluation

Goal: Compute values for each state under π

Idea: Average together observed sample values

Act according to π.

Every time you visit a state, write down what the sum of discounted rewards turned out to be.

Average those samples.

This is called direct evaluation

Example: Direct Evaluation

Input Policy π

Assume: γ = 1

Observed Episodes (Training)

Output Values

A

B C D

E

B, east, C, -1

C, east, D, -1

D, exit, x, +10

B, east, C, -1

C, east, D, -1

D, exit, x, +10

E, north, C, -1

C, east, A, -1

A, exit, x, -10

Episode 1

Episode 2

Episode 3

Episode 4

E, north, C, -1

C, east, D, -1

D, exit, x, +10

A

B C D

E

+8

+4

+10

-10

-2

Problems with Direct Evaluation

What’s good about direct evaluation?

It’s easy to understand

It doesn’t require any knowledge of T, R

It eventually computes the correct average values, using just sample transitions

What bad about it?

It wastes information about state connections

Each state must be learned separately

So, it takes a long time to learn

Output Values

A

B C D

E

+8

+4

+10

-10

-2

If B and E both go to C under this policy, how can their values be different?

Why Not Use Policy Evaluation?

Simplified Bellman updates calculate V for a fixed policy:

Each round, replace V with a one-step-look-ahead layer over V

This approach fully exploited the connections between the states

Unfortunately, we need T and R to do it!

Key question: how can we do this update to V without knowing T and R?

In other words, how to we take a weighted average without knowing the weights?

π(s)

s

s, π(s)

s, π(s),s’

s’

Sample-Based Policy Evaluation?

We want to improve our estimate of V by computing these averages:

Idea: Take samples of outcomes s’ (by doing the action!) and average

π(s)

s

s, π(s)

s1′

s2′

s3′

s, π(s),s’

s’

Almost! But we can’t rewind time to get sample after sample from state s.

Temporal Difference Learning

Temporal Difference Learning

Big idea: learn from every experience!

Update V(s) each time we experience a transition (s, a, s’, r)

Likely outcomes s’ will contribute updates more often

Temporal difference learning of values

Policy still fixed, still doing evaluation!

Move values toward value of whatever successor occurs: running average

π(s)

s

s, π(s)

s’

Sample of V(s):

Update to V(s):

Same update:

Exponential Moving Average

Exponential moving average

The running interpolation update:

Makes recent samples more important:

Forgets about the past (distant past values were wrong anyway)

Decreasing learning rate (alpha) can give converging averages

Example: Temporal Difference Learning

Assume: γ = 1, α = 1/2

Observed Transitions

B, east, C, -2

0

0 0 8

0

0

-1 0 8

0

0

-1 3 8

0

C, east, D, -2

A

B C D

E

States

Problems with TD Value Learning

TD value leaning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages

However, if we want to turn values into a (new) policy, we’re sunk:

Idea: learn Q-values, not values

Makes action selection model-free too!

a

s

s, a

s,a,s’

s’

Active Reinforcement Learning

Active Reinforcement Learning

Full reinforcement learning: optimal policies (like value iteration)

You don’t know the transitions T(s,a,s’)

You don’t know the rewards R(s,a,s’)

You choose the actions now

Goal: learn the optimal policy / values

In this case:

Learner makes choices!

Fundamental tradeoff: exploration vs. exploitation

This is NOT offline planning! You actually take actions in the world and find out what happens…

Detour: Q-Value Iteration

Value iteration: find successive (depth-limited) values

Start with V0(s) = 0, which we know is right

Given Vk, calculate the depth k+1 values for all states:

But Q-values are more useful, so compute them instead

Start with Q0(s,a) = 0, which we know is right

Given Qk, calculate the depth k+1 q-values for all q-states:

Q-Learning

Q-Learning: sample-based Q-value iteration

Learn Q(s,a) values as you go

Receive a sample (s,a,s’,r)

Consider your old estimate:

Consider your new sample estimate:

Incorporate the new estimate into a running average:

[Demo: Q-learning – gridworld (L10D2)]

[Demo: Q-learning – crawler (L10D3)]

Video of Demo Q-Learning — Gridworld

Video of Demo Q-Learning — Crawler

Q-Learning Properties

Amazing result: Q-learning converges to optimal policy — even if you’re acting suboptimally!

This is called off-policy learning

Caveats:

You have to explore enough

You have to eventually make the learning rate

small enough

… but not decrease it too quickly

Basically, in the limit, it doesn’t matter how you select actions (!)

CS 188: Artificial Intelligence

Reinforcement Learning II

Instructors: and — University of California, Berkeley

[These slides were created by and for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Please retain proper attribution and the reference to ai.berkeley.edu. Thanks!

‹#›

Reinforcement Learning

We still assume an MDP:

A set of states s ∈ S

A set of actions (per state) A

A model T(s,a,s’)

A reward function R(s,a,s’)

Still looking for a policy π(s)

New twist: don’t know T or R, so must try out actions

Big idea: Compute all averages over T using sample outcomes

The Story So Far: MDPs and RL

Known MDP: Offline Solution

Goal Technique

Compute V*, Q*, π* Value / policy iteration

Evaluate a fixed policy π Policy evaluation

Unknown MDP: Model-Based

Unknown MDP: Model-Free

Goal Technique

Compute V*, Q*, π* VI/PI on approx. MDP

Evaluate a fixed policy π PE on approx. MDP

Goal Technique

Compute V*, Q*, π* Q-learning

Evaluate a fixed policy π Value Learning

‹#›

Model-Free Learning

Model-free (temporal difference) learning

Experience world through episodes

Update estimates each transition

Over time, updates will mimic Bellman updates

r

a

s

s, a

s’

a’

s’, a’

s’’

Q-Learning

We’d like to do Q-value updates to each Q-state:

But can’t compute this update without knowing T, R

Instead, compute average as we go

Receive a sample transition (s,a,r,s’)

This sample suggests

But we want to average over results from (s,a) (Why?)

So keep a running average

Q-Learning Properties

Amazing result: Q-learning converges to optimal policy — even if you’re acting suboptimally!

This is called off-policy learning

Caveats:

You have to explore enough

You have to eventually make the learning rate

small enough

… but not decrease it too quickly

Basically, in the limit, it doesn’t matter how you select actions (!)

[Demo: Q-learning – auto – cliff grid (L11D1)]

Video of Demo Q-Learning Auto

Exploration vs. Exploitation

How to Explore?

Several schemes for forcing exploration

Simplest: random actions (ε-greedy)

Every time step, flip a coin

With (small) probability ε, act randomly

With (large) probability 1-ε, act on current policy

Problems with random actions?

You do eventually explore the space, but keep thrashing around once learning is done

One solution: lower ε over time

Another solution: exploration functions

[Demo: Q-learning – manual exploration – bridge grid (L11D2)] [Demo: Q-learning – epsilon-greedy — crawler (L11D3)]

Video of Demo Q-learning – Manual Exploration – Bridge Grid

Video of Demo Q-learning – Epsilon-Greedy – Crawler

Exploration Functions

When to explore?

Random actions: explore a fixed amount

Better idea: explore areas whose badness is not

(yet) established, eventually stop exploring

Exploration function

Takes a value estimate u and a visit count n, and

returns an optimistic utility, e.g.

Note: this propagates the “bonus” back to states that lead to unknown states as well!

Modified Q-Update:

Regular Q-Update:

[Demo: exploration – Q-learning – crawler – exploration function (L11D4)]

Video of Demo Q-learning – Exploration Function – Crawler

Regret

Even if you learn the optimal policy, you still make mistakes along the way!

Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards

Minimizing regret goes beyond learning to be optimal – it requires optimally learning to be optimal

Example: random exploration and exploration functions both end up optimal, but random exploration has higher regret

Approximate Q-Learning

Approximate Q-Learning

Generalizing Across States

Basic Q-Learning keeps a table of all q-values

In realistic situations, we cannot possibly learn about every single state!

Too many states to visit them all in training

Too many states to hold the q-tables in memory

Instead, we want to generalize:

Learn about some small number of training states from experience

Generalize that experience to new, similar situations

This is a fundamental idea in machine learning, and we’ll see it over and over again

[demo – RL pacman]

Example: Pacman

[Demo: Q-learning – pacman – tiny – watch all (L11D5)]

[Demo: Q-learning – pacman – tiny – silent train (L11D6)]

[Demo: Q-learning – pacman – tricky – watch all (L11D7)]

Let’s say we discover through experience that this state is bad:

In naïve q-learning, we know nothing about this state:

Or even this one!

Video of Demo Q-Learning Pacman – Tiny – Watch All

Video of Demo Q-Learning Pacman – Tiny – Silent Train

Video of Demo Q-Learning Pacman – Tricky – Watch All

Feature-Based Representations

Solution: describe a state using a vector of features (properties)

Features are functions from states to real numbers (often 0/1) that capture important properties of the state

Example features:

Distance to closest ghost

Distance to closest dot

Number of ghosts

1 / (dist to dot)2

Is Pacman in a tunnel? (0/1)

…… etc.

Is it the exact state on this slide?

Can also describe a q-state (s, a) with features (e.g. action moves closer to food)

Linear Value Functions

Using a feature representation, we can write a q function (or value function) for any state using a few weights:

Advantage: our experience is summed up in a few powerful numbers

Disadvantage: states may share features but actually be very different in value!

Approximate Q-Learning

Q-learning with linear Q-functions:

Intuitive interpretation:

Adjust weights of active features

E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features

Formal justification: online least squares

Exact Q’s

Approximate Q’s

Example: Q-Pacman

[Demo: approximate Q-learning pacman (L11D10)]

Video of Demo Approximate Q-Learning — Pacman

Q-Learning and Least Squares

0

20

0

20

40

0

10

20

30

40

0

10

20

30

20

22

24

26

Linear Approximation: Regression*

Prediction:

Prediction:

‹#›

Figure 1: scatter(1:20,10+(1:20)+2*randn(1,20),’k’,’filled’); a=axis; a(3)=0; axis(a);

Optimization: Least Squares*

0

20

0

Error or “residual”

Prediction

Observation

‹#›

Figure 1: scatter(1:20,10+(1:20)+2*randn(1,20),’k’,’filled’); a=axis; a(3)=0; axis(a);

Minimizing Error*

Approximate q update explained:

Imagine we had only one point x, with features f(x), target value y, and weights w:

“target”

“prediction”

0

2

4

6

8

10

12

14

16

18

20

-15

-10

-5

0

5

10

15

20

25

30

Degree 15 polynomial

Overfitting: Why Limiting Capacity Can Help*

Policy Search

Policy Search

Problem: often the feature-based policies that work well (win games, maximize utilities) aren’t the ones that approximate V / Q best

E.g. your value functions from project 2 were probably horrible estimates of future rewards, but they still produced good decisions

Q-learning’s priority: get Q-values close (modeling)

Action selection priority: get ordering of Q-values right (prediction)

We’ll see this distinction between modeling and prediction again later in the course

Solution: learn policies that maximize rewards, not the values that predict them

Policy search: start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing on feature weights

Policy Search

Simplest policy search:

Start with an initial linear value function or Q-function

Nudge each feature weight up and down and see if your policy is better than before

Problems:

How do we tell the policy got better?

Need to run many sample episodes!

If there are a lot of features, this can be impractical

Better methods exploit lookahead structure, sample wisely, change multiple parameters…

Policy Search

[ ]

[Video: HELICOPTER]

Conclusion

We’ve seen how AI methods can solve problems by:

Search

Knowledge Representation

Automated Planning

Markov Decision Problems

Reinforcement Learning

Basic Probabilities reasoning: beyond true/false.

Next week: Reasoning with Uncertainty using Bayesian Networks.

Exploiting dependances

Whiteboard used in lectorial

‹#›