# 程序代写代做代考 matlab Reinforcement Learning

Reinforcement Learning

Introduc2on

Subramanian Ramamoorthy

School of Informa2cs

17 January 2017

Admin

Lecturer: Subramanian (Ram) Ramamoorthy

IPAB, School of Informa;cs

s.ramamoorthy@ed (preferred method of contact)

Informa;cs Forum 1.41, x505119

Teaching Assistant: Svetlin Penkov, SV.Penkov@ed.ac.uk

Class representa5ve?

Mailing list: Are you on it – I will use it for announcements!

17/01/17 Reinforcement Learning 2

Admin

Lectures:

Tuesday and Friday 12:10 – 13:00 (Teviot Lecture Theatre)

Assessment: Homework/Exam 10+10% / 80%

– HW1: Out 3 Feb, Due 17 Feb

• Some pen+paper ques;ons & setup for HW2

– HW2: Out 28 Feb, Due 28 Mar

• Implement an agent using ALE

17/01/17 Reinforcement Learning 3

Admin

Webpage: www.informa;cs.ed.ac.uk/teaching/courses/rl

• Lecture slides to be uploaded, typically, the day before

Main Textbook

– R. Suaon and A. Barto, Reinforcement Learning, 2nd ed.

Other Suggested Books

– See list on course web page: going into more detail on specific

topics (e.g., dynamic programming, POMDPs, ac;ve sensing) than

possible or necessary in lectures. These books are FYI and op;onal.

Other readings, e.g., papers, to be suggested for specific lectures.

Background: Mathema;cs, Matlab, Exposure to machine learning?

17/01/17 Reinforcement Learning 4

Problem of Learning from Interac5on

Ø with environment

Ø to achieve some goal

• Baby playing. No teacher. Sensorimotor connec;on to

environment.

– Cause <-> effect

– Ac;on <-> consequences

– How to achieve goals

• Learning to drive a car, hold conversa;on, etc.

• Environment’s response affects our subsequent ac;ons

• We find out the effects of our ac;ons later

17/01/17 Reinforcement Learning 5

Rough History of RL Ideas

• Psychology – learning by trial and error

… ac;ons followed by good or bad outcomes have their

tendency to be reselected altered accordingly

- Selec;onal: try alterna;ves and pick good ones

- Associa;ve: associate alterna;ves with par;cular situa;ons

• Computa;onal studies (e.g., credit assignment problem)

– Minsky’s SNARC, 1950

– Michie’s MENACE, BOXES, etc. 1960s

• Temporal Difference learning (Minsky, Samuel, Shannon, …)

– Driven by differences between successive es;mates over ;me

17/01/17 Reinforcement Learning 6

Rough History of RL, contd.

• In 1970-80, many researchers, e.g., Klopf, Suaon & Barto,…,

looked seriously at issues of “geong results from the

environment” as opposed to supervised learning

– Although supervised learning methods such as backpropaga;on

were some;mes used, emphasis was different

• Stochas;c op;mal control (mathema;cs, opera;ons

research)

– Deep roots: Hamilton-Jacobi → Bellman/Howard

– By the 1980s, people began to realize the connec;on between

MDPs and the RL problem as above…

17/01/17 Reinforcement Learning 7

What is the Nature of the Problem?

• As you can tell from the history, many ways to understand the

problem – you will see this as we proceed through course

• Unifying perspec;ve:

– Sequen;al decision making

– Stochas;c op;miza;on over ;me

• Let us unpack this through a few applica;on examples…

17/01/17 Reinforcement Learning 8

Example Domain: Robo;cs

Environment

Percep;on

Ac;on

Adversarial

ac;ons & other

agents

Adversarial

ac;ons & other

agents

High-level

goals

Problem: How to generate actions, to achieve high-level goals, using limited

perception and incomplete knowledge of environment?

17/01/17 9 Reinforcement Learning

Example Domain:

Natural Language Dialogue

17/01/17 10 [Source: http://www.agilingua.com/images/DM_EN.jpg]

How to Model Decisions, Computa;onally?

• Who makes them?

– Individual

– ‘Group’

• What are the condi;ons?

– Certainty

– Risk

– Uncertainty

For most of this course, we’ll take the ‘individual’ viewpoint

and we’ll be somewhere in between risk and uncertainty

17/01/17 11 Reinforcement Learning

How to Model Decision under Certainty?

• Given a set of possible acts

• Choose one that maximizes some given index

If a is a generic act in a set of feasible acts A, f(a) is an index

being maximized, then

Problem: Find a* in A such that f(a*) > f(a) for all a in A.

The index f plays a key role, e.g., think of buying a pain;ng.

Essen;al problem: How should the subject select an index

func;on such that her choice reduces to finding maximizers?

17/01/17 12 Reinforcement Learning

An Opera;onal Way to Find Index Func;on

• Observe subject’s behaviour in restricted seongs and predict

purchase behaviour from that:

• Instruct the subject as follows:

– Here are ten valuable reproduc;ons

– We will present these to you in pairs

– You will tell us which one of the pair you prefer to own

– Aver you have evaluated all pairs, we will pick a pair at random and

present you with the choice you previously made (it is to your

advantage to remember your true tastes)

• The subject’s behaviour is as though there is a ranking over all

pain;ngs, so each pain;ng can be summarized by a number

17/01/17 13 Reinforcement Learning

Some Desiderata of this Ranking

• Transi5vity: Previous argument only makes sense if the rank is

transi;ve – if A is preferred in (A, B) and B is preferred in (B,

C) then A is preferred in (A, C); and this holds for all triples of

alterna;ves A, B and C

• Ordinal nature of index: One is tempted to immediately turn

the ranking into a latent measure of ‘sa;sfac;on’ but that is

premature as u;li;es need not be unique.

e.g., we could assign 3 u;les to A, 2 u;les to B and 1 u;le to C

to explain the choice behaviour

Equally, 30, 20.24 and 3.14 would yield the same choice

While it is OK to compare indices, it isn’t (yet) OK to add/mul;ply

17/01/17 14 Reinforcement Learning

What Happens if we Relax Transi;vity?

• Assume Pandora says (in the pairwise comparisons):

– Apple < Orange
– Orange < Fig
– Fig < Apple
• Why is this a problem for Pandora?
• Assume a merchant who transacts with her as follows:
– Pandora has an Apple at the start of the conversa;on
– He offers to exchange Orange for Apple, if she gives him a penny
– He then offers an exchange of Fig for Orange, at the price of a penny
– Then, offers Apple for the Fig, for a penny
– Now, what is Pandora’s net posi2on?
17/01/17 15 Reinforcement Learning
Decision Making under Risk
• Ini;ally appeared as analysis of fair gambles, needed some
no;ons of u;lity
• Gamble has n outcomes, each worth a1, …, an
• The probability of each outcome is p1, …, pn
• How much is it worth to par;cipate in this gamble?
b = a1 p1 + … + an pn
One may treat this monetary expected value as a fair price
Is this a sufficient descrip;on of choice behaviour under risk?
17/01/17 16 Reinforcement Learning
St. Petersburg Paradox of D. Bernoulli
• A fair coin is tossed un;l a head appears
• Gambler receives 2n if the first head appears on trial n
• Probability of this event = probability of tail in first (n-1) trials
and head on trial n, i.e., (1/2)n
Expected value = 2.(1/2) + 4.(1/2)2 + 8.(1/2)8 +… = ∞
• Are you willing to bet in this way? Is anyone?
17/01/17 17 Reinforcement Learning
Defining U;lity
• Bernoulli went on to argue that people do not act in this way
• The thing to average is the ‘intrinsic worth’ of the monetary
values, not the absolute values
e.g., intrinsic worth of money may increase with money but at
a diminishing rate
• Let us say u;lity of m is log10 m, then expected value is,
log10 2.(1/2) + log10 4.(1/2)2 + log10 8.(1/2)8 +… = b < ∞
Monetary fair price of the gamble is a where log10a = b.
17/01/17 18 Reinforcement Learning
Some Cri;ques of Bernoulli’s Formula;on
von Neumann and Morgenstern (vNM), who ini;ated the
formal study of game theory, raised the following ques;ons:
• The assignment of u;lity to money is arbitrary and ad hoc
– There are an infinity of func;ons that capture ‘diminishing rate’,
how should we choose?
– The associa;on may vary from person to person
• Why is the defini;on of the decision based upon expected
value of the u;lity?
– Is this actually descrip;ve of a single gambler, in one-shot choice?
– How to define/constrain u;lity?
17/01/17 19 Reinforcement Learning
von Neumann & Morgenstern Formula;on
• If a person is able to express preferences between every
possible pair of gambles
where gambles are taken over some basic set of alterna;ves
• Then one can introduce u;lity associa;ons to the basic
alterna;ves in such a manner that
• If the person is guided solely by the u;lity expected value, he
is ac5ng in accord with his true tastes.
– provided his tastes are consistent in some way
17/01/17 20 Reinforcement Learning
Construc;ng U;lity Func;ons
• Suppose we know the following preference order:
– A < b ~ c < d < e
• The following are u;lity func;ons that capture this:
– So, in situa;ons like St Petersburg paradox, the revealed preference of
any realis;c player may differ from the case of infinite expected value
– Sa;sfac;on at some large value, risk tolerance, ;me preference, etc.
17/01/17 21
a b c d E
U 0 1/2 1/2 3/4 1
V -1 1 1 2 3
W -8 0 0 1 8
Reinforcement Learning
Certainty Equivalents and Indifference
• The previous statement applies equally well to certain events
and gambles or loaeries
• So, even aotudes regarding tradeoffs between the two ought
to be captured
• Basic issue – how to compare?
• Imagine the following choice (A > B > C pref.) : (a) you get B

for certain, (b) you get A with probability p and C otherwise

• If p is near 1, op;on b is beaer; if p is near 0, then op;on a:

there is a single point where we switch

• Indifference is described as something like

(2/3) (1) + (1 – 2/3) (0) = 2/3

17/01/17 22

A C B

Reinforcement Learning

Decision Making under Uncertainty

• A choice must be made from among a set of acts, A1, …, Am.

• The rela;ve desirability of these acts depends on which state

of nature prevails, either s1, …, sn .

• As decision maker we know that one of several things is true

and this influences our choice

• Key point about decision making: whether or not you have a

probabilis2c characteriza2on of alterna2ves has a big impact

on how to approach the problem

17/01/17 23 Reinforcement Learning

Example: Savage’s Omelet Problem

Your friend has broken 5 good eggs into a bowl when you come

in to volunteer and finish the omelet.

A sixth egg lies unbroken (you must use it or waste it

altogether).

Your three acts: break it into bowl, break it into saucer – inspect

and pour into bowl, throw it uninspected

17/01/17 Reinforcement Learning 24

Decision in Savage’s Omelet Problem

• To each outcome, we could assign a u;lity and maximize it

• What do we know about the state of nature?

– We may act as though there is one true state, we just don’t know it

– If we assume a probability over si, this is decision under risk

– If we do not assume a probability over si, what might one do?

17/01/17 25 Reinforcement Learning