# 留学生作业代写 Markov Decision Processes – cscodehelp代写

Markov Decision Processes

[AIMA 4G] Chapter 15.1-15.3, 16.1-16.2

Artificial Intelligence

COSC1127/1125

Semester 2, 2021

Prof.

* Many slides are based on those of and , and . Thanks!

Today…

Some news…

Project 2 marking is done!

1st contest feedback this week!

Submit as per instruction.

Check #170

Course THE happening next week

Monday 9am – 9pm!

available today

PDF (spec) + Google form (answer)

Get familiar with form+process.

Deadline: Thursday 12pm

*

First feedback contest tomorrow!

1st feedback tomorrow!

Tag your version with “testing” to participate!

Feedback from Project 2

Week 9 THE

20% of course marks; individual assessment.

Monday Sept 20th @ 9am-9pm (12hrs!)

2-3hrs long test; many targeted questions.

Link will be posted in EdStem.

Be ready: NO REPLACEMENT!

Submit before 6pm when silence policy starts (dont risk it!)

Two parts:

PDF document with question specs.

Online Google Form to enter answers. Make 100% sure you are logged to your RMIT account.

Try the HE this week: be ready, avoid surprises.

You can have your own pen & paper notes.

Content: Weeks 1 – 8 (included).

Theoretical & conceptual questions: ~tutorials.

No negative marking.

Managing content in difficult times…

Simplified tute 6 a lot!

removed content on axioms

removed redundant tasks.

added practical question

Simplified Week 8 (MDPs):

Less videos (just 4 + UCB’s one)

Reduce policy iteration to its concept only; just focus on value iteration.

Week 11: a review of content

Artificial Intelligence

8 Markov Decision Processes

9 Reinforcement Learning

10 Bayesian Reasoning

11 Review MDP, RL, BN

12 IA + AI Philosophy + Course recap

*

Questions?

*

IMPORTANT DISCLAIMER

These pack of slides are NOT core study

material. The core & main material is the book or any specific resources given for the week.

The slides in this presentations were prepared only for supporting the instructor during the lectorial and only some of them have been used in the session as needed.

These pack of slides do not represent the complete material of the week, but only a few bits here and there around what was discussed in lectorials. Most of the week content will NOT show up in this pack of slides.

Please refer to the book (and any resources specifically given) for a comprehensive coverage of the week material.

This week…

Sequential decision problems.

Markov decision processes:

Definition

Policy

Utilities of Sequences

Value Iteration:

Utilities of States

Bellman equations

Iterative algorithm

Policy Evaluation & Policy Iteration

Week 6: Classical Planning

PLANNER

Plan

(sequence of actions)

State Model

Language

Algorithm

Recap Previous Weeks

Automated planning:

Goals, planning, logic representation, search, …

E.g: To be on the flight ontime.

Properties/Assumptions:

World is fully observable.

Deterministic actions.

Static environment.

Goal is “all or nothing”.

But:

Knowledge is incomplete.

Actions are fallible & non-deterministic.

Environment is dynamic.

Agent uses preferences.

Traffic? Accidents? Security checks? Flight delays?

Probability

Probability theory very well established.

Use it to assign degrees of beliefs and likelihoods to (uncertain) events.

E.g. Pr(coin = head) = ½

We have seen this 2 weeks ago; let’s use them!

Making decisions? Utility Theory

Deterministic logical agent has a goal and executes (any) plan that achieves the goals.

In an uncertain world, this is no longer the case.

Flight A90: 95% making to airport on time.

Flight A120: 98% making it on time.

Flight A1440: 100% making it on time.

Relax goals: consider agent preferences for different outcomes.

Utility theory = associate degree of usefulness (utility) to each outcome, with outcomes of higher utility preferred.

From Archie Home

Simple vs Complex Sequential Decisions

Simple decisions: utility of plan based on one action/decision:

Whether I take a taxi or train to airport?

Where to go for a date?

Complex Sequential decisions: utility of outcomes are dependent on a sequence of actions/decisions:

Chess.

Financial planning.

Path navigation.

Deployment of mining trucks in an area.

Overview of next 2 weeks

Sequential Complex Decision Making: MDPs

Environment known, but uncertainty in outcomes of actions/decisions.

Utility of plan dependent on sequence of actions.

Reinforcement Learning

Environment unknown (transitions, rewards).

But feedback available!

MDP: Markov Decision Processes

Image from: A Primer on Reinforcement Learning in the Brain: Psychological, Computational, and Neural Perspectives

Outline of MDP Content

GridWorld Running Example.

Definition of Markov Decision Processes.

Policies.

Utilities of Sequences.

Value Iteration.

Policy Evaluation & Policy Iteration.

A maze-like world

Agent lives in a grid

Walls block agent’s path

Movement actions are noisy – they do

not always go as planned.

When go north, 80% go north, but

10% of time go west, 10% east.

Agent receives rewards each time step

Small “living” reward each step (can be negative)

Big rewards comes at the end (good or bad)

Goal: maximise sum of rewards

Illustrative Example: GridWorld

Question: How to find the best sequence of actions or plan an agent should take in GridWorld?

GridWorld Actions

Deterministic Grid World

Stochastic Grid World

Characteristics of GridWorld

Environment is fully observable.

Agent knows where it is at any time.

But actions are uncertain (movement).

Each grid location is a state that the agent can be in.

Has terminating grid locations/states.

Sequential decision problem.

Each grid location/state has a reward.

Final utility depends on a sequence of states agent traversed.

Question: How to find the best sequence of actions an agent should take in GridWorld?

An MDP is a tuple M = (S, A, T, R, s0, F):

A set of states s ∈ S:

locations, clean, fuel, broken, ..

A set of actions a ∈ A:

move to, suck, pick, drop, …

A transition function T (“the model”):

T(s, a, s’): Probability that a

in s leads to s’, i.e., P(s’| s, a)

pick action succeeds 90%%

A reward function R(s): value of state s.

clean living room gives +10

A start state s0 ∈ S.

robot initial location, etc..

Possibly terminal states F ⊆ S

Markov Decision Processes (MDP)

Rewards on states or transitions?

R(s) : rewards on states (book R&N & here).

R(s, a, s’): rewards on transitions (“standard”)

Transition System / Model

Note: This is only a portion of the whole transition model for GridWorld. States only show location, more features may be in a state

What is Markovian in MDP?

“Markov” generally means that given the present state, the future and the past are independent.

For MDPs, “Markov” means action outcomes depend only on the current state.

E.g., in GridWorld, the outcomes of the agents movement actions (go north, can go north, west or east) only depends on its current grid cell.

Outline of MDP Content

GridWorld Running Example

Definition of Markov Decision Processes

Policies

Utilities of Sequences

Value Iteration

Policy Evaluation & Policy Iteration

MDP Solution: A “Good” Policy

In deterministic single agent problems, an (optimal) sequence of actions from start to goal is enough.

In MDP, we want an optimal policy π*: S → A

A policy specifies what an agent should do (action) for each state.

An optimal policy is one that maximises “expected utility” (more on this soon)

Optimal policy when

R(s) = -0.03 for all non-terminals s

Four different policies

Outline of MDP Content

GridWorld Running Example

Definition of Markov Decision Processes

Policies

Utilities of Sequences

Value Iteration

Policy Evaluation & Policy Iteration

Utilities of Sequences

Recall basic rewards represents agents “preferences”.

How to measure preferences/rewards/utilities over a sequence of states?

Less now, but better after:

[-10, 0, 1000] or [10, 1, 1]

Now or Later?

[10, 0, 0] or [0, 0, 10]

Discounting

It’s reasonable to maximise the sum of rewards.

It’s also reasonable to prefer rewards now to later.

One solution: values of rewards decay exponentially.

Worth Now

Worth Next Step

Worth In Two Steps

How does discounting work?

Given a sequence of states [s1, s2, s3, …, sn] and discount factor 𝛾 ∈ (0,1], the utility of starting at state s1 can be defined as:

U(s1) = R(s1) + 𝛾R(s2) + 𝛾2 R(s3) + … + 𝛾n R(sn)

Stationary Preferences Assumption

Theorem: if we assume stationary preferences:

Then: there are only two ways to define utilities:

Additive utility

Discounted utility

Infinite Horizons

Problem: What if the game lasts forever (no termination state, or agent never reaches one)? Infinite utilities for additive?

Solutions:

Proper policies: policies that guarantee agent always reach a terminal state. **Not always possible**.

e.g., classical planning or control.

Finite horizon: terminate game after a fixed number of steps.

But gives non-stationary policies (they have time dependencies)

Discounted rewards: – utility is finite (0 < γ < 1)
What are you worry about and looking forward?
9836 8086 @ menti.com
Recap: MDPs
Markov decision processes (MDP):
Set of states S
Start state s0
Set of actions A
Transitions T(s,a,s’)
Rewards R(s)
MDP concepts so far:
Policy = choice of action for each state
Utility = sum of discounted rewards
Outline of MDP Content
GridWorld Running Example
Definition of Markov Decision Processes
Policies
Utilities of Sequences
Value Iteration
Policy Evaluation & Policy Iteration
Utility of states & Value Iteration
Given a MDP: how to find the optimal policy for it?
Optimal policy indicates for each state, what is the best action to perform?
In order to determine what is the best action, need to calculate the utility of each state. Recursive? How to calculate this?
Value Iteration
Utility of states - Deterministic Actions
Assume deterministic actions, i.e, that is:
For all s, a, s’: T(s, a, s’) = 1 or T(s, a, s’) = 0.
Then a policy from s1 gives a sequence s1,s2,s3,....
What is the utility of state s1 when following policy π?
Uπ(s1) = R(s1) + 𝛾R(s2) + 𝛾2 R(s3) + … = Σt>= 0 𝛾t R(st)

si+1 = the state after doing action π(si) in state si

that is, the state such that T(si, a, si+1) = 1

Utility of states – Deterministic Actions

Uπ(s1) = R(s1) + 𝛾R(s2) + 𝛾2 R(s3) + 𝛾3 R(s4) + …

Uπ(s1) = R(s1) + 𝛾[R(s2) + 𝛾 R(s3) + 𝛾2 R(s4) + … ]

Uπ(s1) = R(s1) + 𝛾Uπ(s2)

This is based on using a particular policy π.

But π is what we are trying to find!

The utility of a state U(s) is when we choose

the optimal action:

U(s) = R(s) + 𝛾 maxa{T(s, a, s’)U(s’)}

The maximum value across all possible actions.

Utilities of States

Now let’s go back to MDPs, where actions are uncertain.

T(s, a, s’) = P(s’ | s, a)

U(s) = R(s) + 𝛾 maxa{T(s, a, s’)U(s’)}

U(s) = R(s) + 𝛾 maxa{Σs’∊ST(s, a, s’)U(s’)}

deterministic domain

Gridworld Utilities

Utilities of state for 𝛾=1 and R(s) = 0.04 for non-terminal states.

Value Iteration Gridworld App

Run VI step by step using this tool I refactored as a stand-alone Java application:

Maximum Expected Utility Principle

Suppose that we have the optimal utility at every state, that is we have U(s). What is the optimal policy?

The Maximum Expected Utility (MEU) Principle says to find the optimal action, we chose the action that maximises the expected utility of the subsequent states:

Value Iteration

For each state s we have its Bellman equation (so we have |S| equations):

U(s) = R(s) + 𝛾 maxa{Σs’∊ST(s, a, s’)U(s2)}

But we cannot solve as simultaneous equations due to the max operator, but can solve iteratively:

U0(s) = 0, for all states s ∊ S

Ui+1(s) = R(s) + 𝛾 maxa{Σs’∊ST(s, a, s’)Ui(s2)}

Guaranteed to converge!

Value Iteration

Function valueIteration(MDP, 𝜖, 𝜸)

U(s) = 0, for all s ∊ S

Repeat

U’ = U // save existing U into U’

d = 0 // largest difference between U and U’

For each state s ∊ S do: // get new U

U(s) = R(s) + 𝛾 maxa {Σs’∊ST(s, a, s’)U’(s’)}

d = max {d, |U(s) – U’(s)|}

Until d is very small

Extracting Optimal Policy

After value iteration, we know the utility of each state for an optimal policy. We have U(s) function, for all s.

To extract the optimal action for each state (i.e., the policy), we use:

The Maximum Expected Utility (MEU) Principle says to find the optimal action, we chose the action that maximises the expected utility of the subsequent states:

Gridworld using Value Iteration

States and their evolution of utilities.

Number of iterations required to guarantee an error of c.Rmax, as the discount factor is varied.

Outline of MDP Content

GridWorld Running Example

Definition of Markov Decision Processes

Policies

Utilities of Sequences

Value Iteration

Policy Evaluation & Policy Iteration

(just know it conceptually; no exercises)

Policy Evaluation

In the case where the policy is fixed (e.g, always move right if possible), policy evaluation is to calculation the utility of each state for the fixed policy.

Given policy :

Policy Iteration

Value Iteration can be slow to converge and finding best actions can be costly.

Alternative: Policy Iteration:

Initialise with some policy

Given current policy, evaluate it (calculate U)

Calculate new policy using MEU and U

Repeat until no changes in policy.

Value Iteration vs Policy Iteration

Both value iteration and policy iteration compute the optimal values

Value iteration

Every iteration updates utility values and policies implicitly

We don’t track the policy but taking the max over all actions implicitly computes it

In policy iteration

Utilities are updated over a fixed policy (fast)

After utilities calculated, find optimal policy (similar to value iteration)

Complexity Comparison

Policy iteration

Generally less iterations than value iteration:

Policy evaluation can be solved in O(|S|^3)

MEU can be costly O(|A||S|^2) per run

Value iteration:

O(|A||S|^2) per iteration

Can have many iterations to convergence

Conclusion

Sequential decision problems.

Markov decision processes:

Definition

Policy: map S -> A

Utilities of Sequences

Value Iteration:

Utilities of States

Bellman equations

Iterative algorithm

Policy Evaluation & Policy Iteration:

Next lecture: Reinforcement learning!

Whiteboard used…