# CS代考 Optimal policies – cscodehelp代写

Optimal policies

If we have the correct utility values, the agent has a simple decision process It just picks the action a that maximises the expected utility of the next state:

̊ ÿ1π ̊1 π psq“arg max Pps|s,aqU psq

aPApsq s1 Only have to consider the next step.

The big question is how to compute U π ̊ psq.

⃝c -Trenn, King’s College London

2

Optimal policies

Note that this is specific to the value of the reward Rpsq for non-terminal states — different rewards will give different values and policies.

⃝c -Trenn, King’s College London 3

Bellman equation

How do we find the best policy (for a given set of rewards)?

Turns out that there is a neat way to do this, by first computing the utility of each state.

We compute this using the Bellman equation ÿ

Upsq“Rpsq`γ max aPApsq s1

γ is a discount factor.

Pps1|s,aqUps1q

⃝c -Trenn, King’s College London

4

Not this Bellman

”Just the place for a Snark!” the Bellman cried, As he landed his crew with care;

Supporting each man on the top of the tide

By a finger entwined in his hair.

”Just the place for a Snark! I have said it twice: That alone should encourage the crew.

Just the place for a Snark! I have said it thrice: What I tell you three times is true.”

( ’s illustrations to “The Hunting of the Snark”).

⃝c -Trenn, King’s College London 5

Bellman equation

Apply:

and we get:

ÿ

Upsq“Rpsq`γ max aPApsq s1

Pps1|s,aqUps1q

⃝c -Trenn, King’s College London

6

Bellman equation

U p1, 1q “ ́0.04 ` γ max

$0.8Up1, 2q ` 0.1Up2, 1q ` 0.1Up1, 1q, pUpq , ’/ ’/

’&0.9U p1, 1q ` 0.1U p1, 2q, pLef tq /.

’0.9U p1, 1q ` 0.1U p2, 1q, pDownq / ’/ ’/ %-

0.8U p2, 1q ` 0.1U p1, 2q ` 0.1U p1, 1q pRightq

⃝c -Trenn, King’s College London

7

Value iteration

In an MDP wth n states, we will have n Bellman equations.

( /Cartoon Network)

Hard to solve these simultaneously because of the max operation

‚ Makesthemnon-linear

⃝c -Trenn, King’s College London 8

Value iteration

Luckily an iterative approach works. Start with arbitrary values for states. Then apply the Bellman update:

Ui`1psqÐRpsq`γ max aPApsq s1

simultaneously to all the states.

Continue until the values of states do not change.

The values are guaranteed to converge on the optimal values (but might take some time).

⃝c -Trenn, King’s College London 9

ÿ

Pps1|s,aqUips1q

Value iteration

procedure VALUE ITERATION forsinS do

Upsq Ð 0 end for

repeat

Ucopy Ð U forsinS do

UpsqÐRpsq`γmaxaPApSq end for

until U == Ucopy end procedure

ř11 s1 Pps|s,aqUcopypsq

States, S, reward, Rpsq, set of actions, Apsq and transition model, P ps1|s, aq, are exactly as defined earlier.

γ is the discount factor.

⃝c -Trenn, King’s College London 10

Value iteration

How the values of states change as updates occur.

⃝c -Trenn, King’s College London 11

Value iteration

U p4, 3q is pinned to 1.

U p3, 3q quickly settles to a value close to 1

U p1, 1q becomes negative, and then grows as positive utility from the goal feeds back to it.

⃝c -Trenn, King’s College London 12

Rewards

The example so far has a negative reward Rpsq for each state. Encouragement for an agent not to stick around.

Can also think of Rpsq is being the cost of moving to the next state (where we obatin the utility):

where s is the action used. Bellman becomes:

Rpsq “ ́cps, aq

Ui`1psqÐγ maxp Pps1|s,aqUips1qq ́cps,aq aPApsq s1

Note that the action can be dependent on the state.

ÿ

⃝c -Trenn, King’s College London 13