How utilities are calculated

So far we have assumed that utilities are summed along a run. ‚ Nottheonlyway.

In general we need to compute Urprs0, s1, . . . , snsq for general Urp ̈q. That is, the utility of a run.

Before Urp ̈q was just the sum of rewards in every state. Can consider finite and infinite horizons.

‚ Isit“gameover”atsomepoint?

Turns out that infinite horizons are mostly easier to deal with.

‚ Thatiswhatwewilluse.

⃝c -Trenn, King’s College London 2

How utilities are calculated

Also have to consider whether utilities are stationary or non-stationary.

‚ Think of: does the same state always have the same value?

‚ E.g., in Pacman when you pick up a fruit, there is a large reward for that tile. That

changes after you picked up the fruit. Example:

‚ Normallywepreferonestatetoanother.

‚ PassingtheAImoduletofailingit

‚ Inthiscasewhentheexamis,todayornextweek,isirrelevant.

We assume utilities are stationary.

⃝c -Trenn, King’s College London 3

But are they?

Not clear that utilities are always stationary.

In truth, I don’t always most want to eat cherry pie. Despite this, we will assume that utilities are stationary.

⃝c -Trenn, King’s College London 4

How utilities are calculated

With stationary utilities, there are two ways to establish Ur prs0 , s1 , . . . , sn sq from Rpsq.

Additive rewards:

Urprs0, s1, . . . , snsq “ Rps0q ` Rps1q ` . . . ` Rpsnq

as above. Discounted rewards:

Urprs0, s1, . . . , snsq “ Rps0q ` γRps1q ` . . . ` γnRpsnq where the discount factor γ is a number between 0 and 1.

The discount factor models the preference of the agent for current over future rewards.

⃝c -Trenn, King’s College London 5

How utilities are calculated

There is an issue with infinite sequences with additive, undiscounted rewards. ‚ Whatwilltheutilityofapolicybe?

⃝c -Trenn, King’s College London 6

How utilities are calculated

There is an issue with infinite sequences with additive, undiscounted rewards. ‚ Whatwilltheutilityofapolicybe?

Unbounded

8 or ́8.

This is problematic if we want to compare policies.

⃝c -Trenn, King’s College London 7

How utilities are calculated

Some solutions are (definitions follow): ‚ Properpolicies

‚ Averagereward

‚ Discountedrewards

⃝c -Trenn, King’s College London 8

How utilities are calculated

Proper policies always end up in a terminal state eventually. Thus they have a finite expected utility.

⃝c -Trenn, King’s College London 9

How utilities are calculated

We can compute the average reward per time step. Even for an infinite policy this will (usually) be finite.

⃝c -Trenn, King’s College London 10

How utilities are calculated

Assume: 0 ď γ ă 1 and rewards are bounded by Rmax

With discounted rewards the utility of an infinite sequence is finite:

Urprs0, s1, . . . , snsq

n

ÿ

“ γtRpstq t“0

8

ÿ

ď γtRpstq t“0

8

ÿ

ď γtRmax t“0

Rmax p1 ́γq

ď

⃝c -Trenn, King’s College London

11

Optimal policies

With discounted rewards we compare policies by computing their expected values. The expected utility of executing π starting in s is given by:

«ff

8

ÿ

Uπpsq “ E

where St is the state the agent gets to at time t.

γtRpStq

St is a random variable and we compute the probability of all its values by looking

at all the runs which end up there after t steps.

⃝c -Trenn, King’s College London 12

t“0

Optimal policies

The optimal policy is then:

π ̊ “argmaxUπpsq π

It turns out that this is independent of the state the agent starts in.

⃝c -Trenn, King’s College London 13

Optimal policies

Here we have the values of states if the agent executes an optimal policy

Uπ ̊psq

⃝c -Trenn, King’s College London 14

Optimal policies

Here we have the values of states if the agent executes an optimal policy

Uπ ̊psq

What should the agent do if it is in (3, 1)?

⃝c -Trenn, King’s College London 15

Example

The answer is Left.

The best action is the one that maximises the expected utility.

(You have to calculate the expected utility of all the actions to see why Left is the best choice.)

⃝c -Trenn, King’s College London 16