# CS代考 Other notions of “rational” – cscodehelp代写

Other notions of “rational”

There are other criteria for decision-making than maximising expected utility. One approach is to look at the option which has the least-bad worst outcome.

This maximin criterion can be formalised in the same framework as MEU (Maximum Expected Utility), making the rational (in this sense) action:

a ̊ “argmaxtminups1qu aPA s1 Psa

Its effect is to ignore the probability of outcomes and concentrate on optimising the worst case outcome.

Example (utilities are in red):

s a2 7s6 a1 5

4 53s5 4 s s4

Here we would pick a1 ⃝c -Trenn, King’s College London

2

s1 3 s2

Other notions of “rational”

The opposite attitude, that of optimistic risk-seeker, is captured by the maximax criterion:

a ̊ “ arg maxtmax ups1qu aPA s1 Psa

This will ignore possible bad outcomes and just focus on the best outcome of each action.

Example:

s a2 7s6 a1 5

4 53s5 4 s s4

Here we would pick a2 ⃝c -Trenn, King’s College London

3

s1 3 s2

Sequential decision problems

These approaches give us a battery of techniques to apply to individual decisions by agents.

However, they aren’t really sufficient.

Agents aren’t usually in the business of taking single decisions

‚ Lifeisaseriesofdecisions.

The best overall result is not necessarily obtained by a greedy approach to a series of decisions.

The current best option isn’t the best thing in the long-run.

⃝c -Trenn, King’s College London 4

Sequential decision problems

Otherwise I’d only ever eat cherry pie

(pillsbury.com)

(Damn fine pie.)

⃝c -Trenn, King’s College London 5

An example

The agent has to pick a sequence of actions.

for all states s.

⃝c -Trenn, King’s College London

6

Apsq “ tU p, Down, Lef t, Rightu

An example

The world is fully observable. End states have values `1 or ́1.

⃝c -Trenn, King’s College London 7

An example

If the world were deterministic, the choice of actions would be easy here.

But actions are stochastic.

⃝c -Trenn, King’s College London

8

U p, U p, Right, Right, Right

An example

80% of the time the agent moves as intended.

20% of the time the agent moves perpendicular to the intended direction. Half the time to the left, half the time to the right.

The agent doesn’t move if it hits a wall.

⃝c -Trenn, King’s College London 9

An example

So U p, U p, Right, Right, Right succeeds with probability: 0.85 “ 0.32768

⃝c -Trenn, King’s College London

10

An example

Also a small chance of going around the obstacle the other way.

⃝c -Trenn, King’s College London 11

An example

We can write a transition model to describe these actions. Since the actions are stochastic, the model looks like:

Pps1|s,aq

where a is the action that takes the agent from s to s1. Transitions are assumed to be first order Markovian. That is, they only depend on the current and next states.

So, we could write a large set of probability tables that would describe all the possible actions executed in all the possible states.

This would completely specify the actions.

⃝c -Trenn, King’s College London 12

An example

The full description of the problem also has to include the utility function.

This is defined over sequences of states — runs in the terminology of the first lecture.

We will assume that in each state s the agent receives a reward Rpsq. This may be positive or negative.

⃝c -Trenn, King’s College London 13

An example

The reward for non-terminal states is ́0.04.

We will assume that the utility of a run is the sum of the utilities of states, so the ́0.04 is an incentive to take fewer steps to get to the terminal state.

(You can also think of it as the cost of an action).

⃝c -Trenn, King’s College London 14

How do we tackle this?

( /Cartoon Network)

⃝c -Trenn, King’s College London 15

Markov decision process

The overall problem the agent faces here is a Markov decision process (MDP) Mathematically we have

‚ a set of states s P S with an initial state s0. ‚ AsetofactionsApsqineachstate.

‚ AtransitionmodelPps1|s,aq;and

‚ ArewardfunctionRpsq.

Captures any fully observable non-deterministic environment with a Markovian transition model and additive rewards.

⃝c -Trenn, King’s College London

16

Kaelbling

Markov decision process

What does a solution to an MDP look like?

⃝c -Trenn, King’s College London 17

Markov decision process

A solution is a policy, which we write as π. This is a choice of action for every state.

‚ thatwayifwegetofftrack,westillknowwhattodo.

In any state s, πpsq identifies what action to take.

Example policy π: πps0q “ left, πps1q “ left, πps2q “ right, . . . Another example policy π1: For all states s, π1psq “ left

⃝c -Trenn, King’s College London 18

Markov decision process

Naturally we’d prefer not just any policy but the optimum policy. ‚ Buthowtofindit?

Need to compare policies by the reward they generate.

Since actions are stochastic, policies won’t give the same reward every time.

‚ Socomparetheexpectedvalue.

The optimum policy π ̊ is the policy with the highest expected value. At every stage the agent should perform π ̊psq.

⃝c -Trenn, King’s College London 19

Markov decision process

(40 Acres and a Mule Filmworks/Universal Pictures)

π ̊psq is the right thing.

⃝c -Trenn, King’s College London 20

An example

(a) An optimal policy for the stochastic environment with Rpsq “ ́0.04. (b) Optimal policies for different values of Rpsq.

⃝c -Trenn, King’s College London 21

An example

Rpsq ď ́1.6284, life is painful so the agent heads for the exit, even if it is a bad state.

́0.4278 ď Rpsq ď ́0.0850, life is unpleasant so the agent heads for the `1 state and is prepared to risk falling into the ́1 state.

́0.0221 ă Rpsq ă 0, life isn’t so bad, and the optimal policy doesn’t take any risks.

Rpsq ą 0, the agent doesn’t want to leave.

⃝c -Trenn, King’s College London 22