# 程序代写代做代考 Bayesian Reinforcement Learning

Reinforcement Learning

Explora3on and Controlled Sensing

Subramanian Ramamoorthy

School of Informa3cs

21 March, 2017

Example Applica,on: Sampling

Spa,otemporal Fields

21/03/2017 2

Ques,ons for Ocean Sampling

• How to represent the objec,ve that the goal of mo,on

planning is to acquire informa,on which is then used in

model learning?

• Concretely, how to decide where and when to sample on the

basis of this?

21/03/2017 3

Example Problem: Preference Elicita,on

Luggage Capacity?

Two Door? Cost?

Engine Size?

Color? Options?

Shopping for a Car:

21/03/2017 4

Preference Elicita,on Problem

… the process of determining a user’s preferences/u,li,es to

the extent necessary to make a decision on her behalf

• Issues:

– preferences vary widely

– large (mul,-aTribute) outcome spaces

– quan,ta,ve u,li,es (the “numbers”) difficult to assess

• Preference elicita,on can be posed as a POMDP

– Let us try to formulate the state-ac,on-observa,on space…

21/03/2017 5

Plan for This Lecture

1. A look (recap) at what Bayesian upda,ng of model

parameters achieves

2. Informa,on acquisi,on problems and the value of

informa,on (VoI)

3. Policies based on informa,on gain, e.g., for robots sampling

in a naviga,on seang

21/03/2017 6

Bayesian Upda,ng

21/03/2017 7

Recap of Background

• Learning problem: probabilis,c statement of what we believe

about parameters that characterise system behaviours

• Focus is on uncertainty about performance:

– Choice: e.g., of person, technology

– Design: e.g., policies for running business opera,ons

– Policy: e.g., when to sell an asset, maintenance decisions

• Beliefs are influenced by observa,ons we make

• Two key ways of thinking about learning problems:

frequen,st and Bayesian

• Bayesian: start with ini,al beliefs regarding parameters and

combine prior with measurements to compute posteriors

21/03/2017 8

Key Ideas in Bayesian Models

• Begin with a prior distribu,on over unknown parameter µ

• Any number whose value is unknown is a random variable

• Distribu,on of the random variable ~ our belief about how

likely µ is to take on certain values

• Bayesian perspec,ve is well suited to informa,on collec,on

• We always start with some sort of prior knowledge or history

• More important is the conceptual framework that there exists

some truth that we are trying to discover

• Op,mal learning: learn µ as efficiently as possible

21/03/2017 9

µ ⇠ N(✓0,�20) Prior belief

Updates for Independent Beliefs

• Consider a random variable, e.g., observa,on W, normally

distributed. We can write its variance and precision as,

• Having seen n observa,ons, we believe mean of µ is θn and

variance is

• Aker observing the next measurement we update to,

21/03/2017 10

�2W ,�W =

1

�2W

1/�n

✓n+1 =

�n✓n + �WWn+1

�n + �W

�n+1 = �n + �W

Updates for Independent Beliefs

• We could combine these into the more compact form,

• Now, consider the variance of the form,

• This is the variance, given that we have collected n

measurements already, so the only random variable at this

point is Wn+1. Also, think of it as change in variance of θn.

21/03/2017 11

✓n+1 = (�n+1)

�1(�n✓n + �WWn+1)

V arn[·] = V ar[·|W1,W2, …,Wn]

�̃2n = V arn[✓n+1 � ✓n]

Updates for Independent Beliefs

• We could also write θn+1 in a different way by defining the

variable,

• This is a random variable only because we have not yet

observed Wn+1.

• So that we have the update,

21/03/2017 12

Z =

✓n+1 � ✓n

�̃n

✓n+1 = ✓n + �̃nZ

What Happens to Variance

aker a Measurement

V ar(µ) = E[µ2]� (E[µ])2

= E(µ2)� E[(E[µ|W ])2] + E[(E[µ|W ])2]� (E[µ])2

= E[E[µ2|W ]� (E[µ|W ])2] + E[(E[µ|W ])2]� (E[E[µ|W ]])2

= E[V ar(µ|W )] + V ar[E(µ|W )]

21/03/2017 13

E[V ar(µ|W )] = V ar(µ)� V ar(E[µ|W ])

i.e., variance after measurement will, on average, always be smaller

than the original variance. The last term could be zero (if W is irrelevant),

but with a sensible signal this is the benefit to measurements.

Informa,on Acquisi,on and VoI

21/03/2017 14

Informa,on Acquisi,on

• We want to understand the “economics of informa,on”

• Cost of informa,on is highly problem dependent

• Benefits of informa,on can oken be captured using models

that combine the issues of uncertainty in the context of

simple decision problems

• We will look at a simple problem to illustrate key ideas

regarding these benefits

21/03/2017 15

Example: Simple Game as a Decision Tree

• We need to decide whether

to first acquire a signal that

provides informa,on into

the probability of winning

• Illustrated in decision tree

• Game has two outcomes:

– If we win (“W”), we receive

reward R

– If we lose (“L”), we lose -1

– Lack of informa,on is the

informa,on state “N”

21/03/2017 16

Expected Value

• Without any informa,on signal (“N”), probability of winning is

known to be p

• Expected value is,

– where we assume we will not play if expected value is nega,ve

21/03/2017 17

E[V |N ] = max{0, pR� (1� p)}

Remark on nota0on:

Unlike in our previous discussions where V represented value as in

expectation of discounted return, here value will stand for a reward

at the end of the game (following convention in litt. on this topic)

Informa,ve Signal

• Before we play the game, we have the op,on of acquiring an

informa,on signal S (e.g., purchasing a report or checking

informa,on on the internet)

• The signal may be good (“g”) or bad (“b”)

• We assume that this signal will correctly predict the outcome

of this game with probability q, i.e.,

21/03/2017 18

P [S = g|W ] = P [S = b|L] = q

We would like to understand:

• the value of purchasing the signal (elementary informa,on

acquisi,on problem)

• the value of the quality of signal, represented by probability q

Condi,onal Value

• We first need to understand how the signal changes the

expected payoff from the game.

• Condi,onal value of the game given the signal is,

• This equa,on captures our ability to observe the signal, and

then decide whether we want to play the game or not.

• If the signal is bad, expected winnings are,

21/03/2017 19

E[V |S = g] = max{0, R.P [W |S = g]� P [L|S = g]}

E[V |S = b] = max{0, R.P [W |S = b]� P [L|S = b]}

Decision to Acquire

• We next need to find the value of the game given that we

have decided to acquire the signal, but before we know its

realisa,on. This is given by,

• For this, we need the uncondi,onal probabili,es:

21/03/2017 20

E[V |S] = E[V |S = g]P [S = g] + E[V |S = b]P [S = b]

P [S = g] = P [S = g|W ]P [W ] + P [S = g|L]P [L]

P [S = g] = qp+ (1� q)(1� p)

P [S = b] = P [S = b|W ]P [W ] + P [S = b|L]P [L]

P [S = g] = (1� q)p+ q(1� p)

Condi,onal Probability of Win/Loss

Given the Outcome of Signal

• Use Bayes theorem to write,

• Correspondingly, for the bad signal,

21/03/2017 21

P [W |S = g] =

P [W ]P [S = g|W ]

P [S = g]

P [W |S = g] =

pq

qp+ (1� q)(1� p)

P [W |S = b] =

P [W ]P [S = b|W ]

P [S = b]

P [W |S = g] =

p(1� q)

(1� q)p+ q(1� p)

P [L|S = g] = 1� P [W |S = g], etc.

Value of the Signal

• Let S represent the decision to acquire the signal before we

know the outcome of the signal.

• Expected value of the game given that we have chosen to

acquire the signal is,

21/03/2017 22

E[V |S] = E[V |S = g]P [S = g] + E[V |S = b]P [S = b]

E[V |S] = max{0, RP [W |S = g]� P [L|S = g]}(qp+ (1� q)(1� p))+

max{0, RP [W |S = b]� P [L|S = b]}((1� q)p+ q(1� p))

E[V |S] = max{0, R

pq

qp+ (1� q)(1� p)

}(qp+ (1� q)(1� p))+

max{0, R

p(1� q)

(1� q)p+ q(1� p)

�

q(1� p)

(1� q)p+ q(1� p)

}+

((1� q)p+ q(1� p))

Value of the Signal

• The value of the signal which depends on the game reward R,

the probability of winning, p, and the quality of the signal, q,

21/03/2017 23

V s(R, p, q) = E[V |S]� E[V |N ]

p = 0.5 q = 0.7

Summary of the Simple Example

• We have computed the “value” of a discrete piece of

informa,on in a stylized seang.

– Note that the use of value here, while consistent with our

earlier usage, is slightly simpler nota,onally: the return for a

single piece of informa,on does not need a discounted sum

• Next, we turn to a variant where we are allowed to take

mul,ple measurements to increase the precision of the

informa,on gained

21/03/2017 24

Towards Marginal Value of Informa,on

• Imagine that we have a choice between doing nothing (with

reward 0) and choosing a random reward with mean µ.

• Assume that our prior belief about µ is normally distributed

with mean and precision,

• Before playing the game, we are allowed to collect a series of

measurements, (we’ll ignore cost for now)

• We assume that W has the unknown mean µ and a known

precision

21/03/2017 25

(✓0,�0 =

1

�20

)

W1,W2, …,Wn

�W

Es,ma,ng Reward aker n Measurements

• If we choose to make n measurements, the precision of our

es,mate of the reward would be,

• The updated es,mate of our reward (using a Bayesian model)

would be,

21/03/2017 26

�n = �0 + n�W

✓n =

�0✓0 + n�W W̄n

�0 + n�W

W̄n =

1

n

nX

k=1

Wk

• Create a random variable capturing belief about reward

• Use this to make a decision about whether to play the game

• Start with a known iden,ty,

where,

• We can write the change in variance (variance of θn given

what we knew before we took the n measurements),

21/03/2017 27

V ar(µ) = E[V ar(µ|W1, …,Wn)] + V ar[E[µ|W1, …,Wn]]

V ar(µ|W1, …,Wn) =

1

�n

= (�0 + n�W )

�1

E[µ|W1, …,Wn] = ✓n

�̃2(n) = V ar(✓n) = V ar(µ)� E[

1

�n

]

�̃2(n) = V ar(✓n) =

1

�0

�

1

�n

=

1

�0

�

1

�0 + n�W

Value of Informa,on

• With Z deno,ng a standard zero mean –unit variance normal

distribu,on, we can write,

• Aker our n measurements, we are going to choose to play the

game if we believe the value of the game is non-zero.

• That value is

• For each distribu,on family of interest, one could write down

such an expression and expand to get analy,cal formula,on

of VoI

21/03/2017 28

✓n = ✓0 + �̃

2(n)Z

Vn = E[max{0, ✓n}]

Example VoI Curves

21/03/2017 29

The slope of these curves provide a marginal VoI

Explora,on with a Mobile Robot

21/03/2017 30

Explora,on Problems

• Explora,on: control a mobile robot so as to maximize

knowledge about the external world

• Example: robot needs to acquire a map of a sta,c

environment. If we represent map as “occupancy grid”,

explora,on is to maximise cumula,ve informa,on we have

about each grid cell

• POMDPs already subsume this func,on but we need to define

an appropriate payoff func0on

• One good choice is informa,on gain:

Reduc,on in entropy of a robot’s belief as a func,on of its

ac,ons

21/03/2017 31

Explora,on Heuris,cs

• While POMDPs are conceptually useful here, we may not

want to use them directly – state/observa,on space is huge

• We will instead try to derive greedy heuris,c based on the

no,on of informa0on gain.

• Limit lookahead to just one explora,on ac,on

– The explora,on ac,on could itself involve a sequence of

control ac,ons (but logically, it will serve as one

explora,on ac,on)

– For instance, select a loca,on to explore anywhere in the

map, then go there

21/03/2017 32

Informa,on and Entropy

• The key to explora,on is informa,on.

• Entropy of expected informa,on:

• Entropy is at its maximum for a uniform distribu,on, p

• Condi,onal entropy is the entropy of a condi,onal distrib.

• In explora,on, we seek to minimize the expected entropy of

the belief aker execu,ng an ac,on

• So, condi,on on measurement z and control u that define the

belief state transi,on

21/03/2017 33

Hp(x) = �

Z

p(x) log p(x)dx �

X

x

p(x) log p(x)or

Condi,onal Entropy aker Ac,on/Observa,on

• With B(b,z,u) deno,ng the belief aker execu,ng control u and

observing z under belief b,

• Condi,onal entropy of state x’ aker execu,ng ac,on u and

measuring z is given by,

• The condi,onal entropy of the control is,

21/03/2017 34

Hb(x

0|z, u) = �

Z

B(b, z, u)(x0) logB(b, z, u)(x0)dx0

Hb(x

0|u) = Ez[Hb(x0|z, u)]

=

Z Z

Hb(x

0|z, u)p(z|x0)p(x0|u, x)b(x)dzdx0dx

Greedy Techniques

• Expected informa,on gain lets us phrase explora,on as a

decision theore,c problem.

• Informa,on Gain is

21/03/2017 35

Ib(u) = Hp(x)�Hb(x0|u)

= Hp(x)� Ez[Hb(x0|z, u)]

Greedy Techniques

• If r(x,u) is the cost of applying control u in state x (trea,ng

cost as nega,ve numbers), then op,mal greedy explora,on

for the belief b maximizes difference between informa,on

gain and cost,

21/03/2017 36

⇡(b) = argmax

u

↵(Hp(x)� Ez[Hb(x0|z, u)]) +

Z

r(x, u)b(x)dx

Expected information gain

(Original entropy – Cond. Entropy) Expected cost

Example: Combining Explora,on and

Mapping

• By reasoning about control, the mapping process can be

made much more effec,ve

• Ques,on: Where to move next in a map?

21/03/2017 37

Explora,on Problem: Visually

21/03/2017 38

Map Entropy

21/03/2017 39