# CS代考计算机代写 computational biology algorithm data mining What is MACHINE LEARNING?

What is MACHINE LEARNING?

Prof. Dan A. Simovici

UMB

1/49

Outline

1 A Formal Model

2 Empirical Risk Minimization (ERM)

3 ERM with Inductive Bias

4 An Example : Regression

2/49

Outline

What is Machine Learning?

Machine learning (ML) studies the construction and analysis of algorithms that learn from data.

ML algorithms construct models starting from samples of data and use these models to make predictions or decisions.

ML and its applied counterpart, data mining, mostly deal with problems that present difficulties in formulating algorithms that can be readily translated into programs, due to their complexity.

ML techniques tend to avoid the difficulties of standard problem solving techniques where a complete understanding of data is required at the beginning of the problem solving process.

3/49

Outline

Typical ML Activities

Example

finding diagnosis for patients starting with a series of their symptoms;

determining credit worthiness of customers based on their demographics and credit history;

document classification based on their main topic; speech recognition;

computational biology applications.

4/49

Outline

Supervised Learning

Often ML aims to compute a label for each analyzed piece of data that depends of the characteristics of data.

The general approach known as supervised learning is to begin with a number of labelled examples (where answers are known or are provided by a supervisor) known as training set.

The goal is to generate an algorithm that computes the function that gives the labels of remaining examples.

5/49

Outline

Unsupervised Learning

In unsupervised learning the challenge is to identify the structure that is hidden in data, e.g. identifying groups of data such that strong similarity exists between objects that belong to the same group and also, that objects that belong to different groups are sufficiently distinct.

This activity is known as clustering and it is a typical example of unsupervised learning.

The term “unsupervised” refers to the fact that this type of learning does not require operator intervention. Other machine learning activities of this type include outlier identification, density estimation, etc.

6/49

Outline

Semi-supervised Learning

An intermediate type of activity, referred as semi-supervised learning requires a limited involvement of the operator.

For example, in the case of clustering, this may allow the operator to specify pairs of objects that must belong to the same group and pairs of objects that may not belong to the same group.

7/49

Outline

Quality of the Learning Process

The quality of the learning process is assessed through its capability for generalization, that is, the capacity of the produced algorithm for computing correct labels for yet unseen examples.

the correct behavior of an algorithm relative to the training data is no guarantee, in general, for its generalization prowess;

sometimes in the pursuit of a perfect fit of the learning algorithm to the training data leads to overfitting; this term describes the situation when the algorithm acts correctly on the training data but is unable to predict unseen data;

in an extreme case, a rote learner will memorize the labels of its training data and nothing else. Such a learner will be perfectly accurate on its training data but lack completely any generalization capability.

8/49

Outline

Active and Reinforcement Learning

A machine learning algorithm can achieve greater accuracy with fewer training examples if it is allowed to choose the data from which it learns, that is, to apply active learning.

An active learner may pose queries soliciting a human operator to label a data instance. Since unlabelled data is abundant and, in many cases, easily obtained there are good reasons to use this learning paradigm.

Reinforcement learning is a machine-learning paradigm inspired by psychology which emphasizes learning by an agent from its direct interaction with the data in order to attain certain goals of learning e.g. accuracy of label prediction.

The framework of this type of learning makes use of states and actions of an agent, and their rewards, and deals with uncertainty and nondeterminism.

9/49

A Formal Model

The Learner’s Input

The domain set: X consists of the objects that we wish to label; usually objects are represented by a vector of features. We refer to these objects as instances.

The label set: Y is generally a finite set, e.g. {0, 1} or {−1, 1}.

Training data: S = {(x1,y1),…,(xm,ym)} is a finite sequence of pairs in X × Y, that is, a sequence of labelled objects. Each pair (xi , yi ) is a training example.

10/49

A Formal Model

The Learner’s Output

The learner is required to produce a function f : X −→ Y starting from f(x1) = y1,f(x2) = y2,…,f(xn) = yn,

as provided by the training data S. This function is known as a

a predictor, or a hypothesis, or a classifier

11/49

A Formal Model

A Data Generation Model

Assumptions:

data has a probability distribution function D;

the learner ignores the probability distribution function D; there exists some correct labelling function

f : X −→ Y whichweseektodetermineknowingthatf(xi)=yi for1in.

12/49

A Formal Model

Measures of Success

Definition

The true error of a prediction rule h : X −→ Y is

L(D,f )(h) = D({x | h(x) ̸= f (x)}).

L(D,f )(h) is a loss measure: it evaluates the probability that the prediction rule h(x) will produce a result distinct from the labelling function f .

The error is measured with respect to probability distribution D and the correct labelling function f .

Alternative terms used for L(D,f )(h): generalization error;

risk;

the true error of h.

The true error L(D,f )(h) of h is not known to the learner because D and f are unknown.

13/49

Empirical Risk Minimization (ERM)

The Training Error

The empirical error of a predictor h can be considered only in relation with a labelled sample.

Definition

Let S = {(x1, y1), . . . , (xm, ym)} be a labelled sample. The empirical error of a predictor h on sample S is the number

LS(h)= |{i | 1i m,h(xi)̸=yi}|. m

An alternative name for LS (h) is empirical risk.

14/49

Empirical Risk Minimization (ERM)

Empirical Risk Minimization (ERM)

ERM is an approach that seeks a predictor that minimizes the training error LS(h).

Let h be the predictor defined as h(xi) = yi for all xi ∈ S and h(xi) = k, where k does not label any object in S. The empirical error will be 0 but h will fail miserably on unseen data. This phenomenon is called overfitting: designing a predictor to fit the sample.

15/49

ERM with Inductive Bias

Inductive Bias

ERM can lead to overfitting. Therefore, we seek supplementary conditions that ensure that ERM will not overfit (conditions under which a predictor with good performance on the training data will have good performance on unseen data).

Common solution:

Use a restricted hypothesis class H chosen in advance, that is before seeing the data.

This approach is known as the inductive bias.

16/49

ERM with Inductive Bias

For a given class H (known as hypothesis class) and a training sample S the hypothesis

h = ERMH(S)

uses the ERM rule to chose a predictor h ∈ H with the lowest

possible error over S.

Both large LS (h) values and strong inductive bias are negative; the

question is achieve a balance between these factors.

Let argminh∈HLS(h) be the set of hypothesis in H that achieve the minimum values of LS (h). This approach aims to have

ERMH(S) ∈ argminh∈HLS (h).

Definition

hS is the hypothesis that results from applying ERMH to the sample S, namely

hS ∈argminh∈HLS(h).

17/49

ERM with Inductive Bias

Finite Hypothesis Classes

A simple inductive bias: class H is finite. Definition

The Realizability Assumption: There exists h∗ ∈ H such that the true error of a prediction rule h : X −→ Y is

L(D,f )(h∗) = 0.

This implies the existence of h∗ such that with probability 1 over random samples S, where the instances of S are sampled according to D and are labelled by f we have LS (h∗) = 0.

Realizability assumption implies that for every ERM hypothesis we have LS(hS) with probability 1.

18/49

ERM with Inductive Bias

Samples are obtained by drawing values from a distribution D independently of each other.

Since samples are drawn randomly from D, the true error L(D,f )(hS ) is a random variable.

We cannot predict with certainty that a sample S will suffice to direct the learner towards a good classifier.

19/49

ERM with Inductive Bias

Example

To predict if a papaya fruit is testy or not but without ever tasting papayas one needs to decide which features of a papaya the prediction should be based on.

On the basis of your past experience with other fruits, two features will be used: the papaya’s color, ranging from dark green, through orange and red to dark brown, and the papaya’s softness, ranging from rock hard to mushy.

The input for figuring out the prediction rule is a sample of papayas that are examined for color and softness and then tasted and found out whether they were tasty or not.

20/49

ERM with Inductive Bias

Example

There is always some (small) chance that all the papayas we have happened to taste were not tasty (a non-representative sample), in spite of the fact that, say, 70% of the papayas are tasty. In such a case, ERMH(S) may be the constant function that labels every papaya as “not tasty” (and has 70% error on the true distribution of papayas).

We will therefore address the probability to sample a training set for which L(D,f)(hS) is not too large. Usually, we denote the probability of getting a nonrepresentative sample by δ, and refer to 1 − δ as the confidence parameter of our prediction.

21/49

ERM with Inductive Bias

Approximately Correct Predictors

Since we cannot guarantee perfect label prediction, we introduce another parameter for the quality of prediction, the accuracy parameter, denoted by ε.

The probability of getting a non-representative sample is denoted by δ. 1 − δ is the confidence parameter.

The accuracy parameter ε: the event involving the true error

L(D,f )(h) > ε

is a failure of the learner.

If L(D,f )(h) ε (the true error is less than ε) then the output of the learner is an approximately correct predictor.

22/49

ERM with Inductive Bias

Fix f and seek an upper bound for the probability of sampling m instances that will lead to a failure of the learner.

Let S = (x1, . . . , xm). We would like to upper bound the probability that the true error is larger than ε, that is, Dm({S | L(D,f )(hS ) > ε}).

The set Hb of bad hypotheses is defined as:

Hb ={h∈H | L(D,f)(h)>ε}.

Notethatifh∈Hb wehave1−L(D,f) <1−ε. Define the set of misleading examples:
M = {S | ∃h ∈ Hb,LS(h) = 0}.
M contains those samples that produce an empirical error 0 on bad hypotheses, that is, makes a bad hypothesis look like a good hypothesis.
23/49
ERM with Inductive Bias
Note that if a sample S is such that LD,S (hS ) > ε, then there exists a bad hypothesis h ∈ Hb such that LS (h) = 0.

Goal: to bound the probability of the event L(D,f )(hS ) > ε.

24/49

ERM with Inductive Bias

The realizability assumption implies LS (hS ) = 0. Therefore, the event L(D,f )(hS ) > ε

can happen only if for some h ∈ Hb we have LS(h) = 0, that is, only if{S | L(D,f)(hS)>ε}⊆M.

The set of misleading examples M can be written as: M= {S|LS(h)=0},

h∈Hb

hence

Dm({S | L(D,f)(hS) > ε}) Dm(M) = Dm( {S | LS(h) = 0}. h∈Hb

25/49

ERM with Inductive Bias

Elementary probability implies that

kk

P( Ai) P(Ai).

i=1 i=1

26/49

ERM with Inductive Bias

Therefore, by elementary probability theory we have

Dm( {S | LS(h)=0} Dm({S | LS(h)=0}).

h∈Hb h∈Hb

Fix some bad hypothesis h ∈ Hb. The event LS(h) = 0 is equivalent to h(xi ) = f (xi ) for 1 i m. Since the examples in the training sets are sampled independently and identically distributed (iid), we get

Dm({S|LS(h)=0}) = Dm({S|h(xi)=f(xi)for1im}) m

= D({xi | h(xi)=f(xi)}). i=1

27/49

ERM with Inductive Bias

For each individual sampling we have:

D({xi | h(xi)=f(xi)})=1−L(D,f)(h)1−ε,

where the last inequality follows from the fact that h ∈ Hb. Note that 1 − ε e−ε.

Thus,

Dm({S | LS(h)=0})(1−ε)m e−εm.

28/49

ERM with Inductive Bias

Since

we conclude that

Dm( {S | LS(h)=0} Dm({S | LS(h)=0}) h∈Hb h∈Hb

Dm( {S | LS(h) = 0} |Hb|e−εm. h∈Hb

29/49

ERM with Inductive Bias

Theorem

Let H be a finite hypothesis class, δ ∈ (0, 1), ε > 0 and let m be an integer such that

m log(|H|/δ) ε

Then, for any labelling function f and for any distribution D for which the realizability distribution holds, with probability at least 1 − δ over the choice of an iid sample S of size m, we have that for every ERM hypothesis hS it holds that L(D,f )(h) ε.

30/49

ERM with Inductive Bias

Thus, for sufficiently large m the ERM rule over a finite hypothesis class H will be probably (with a confidence of 1 − δ) approximatively correct (up to an error of ε).

31/49

An Example : Regression

Define the class of affine functions Ld that consists of functions of the form hw,b : Rd −→ R, where

d hw,b(x)=(w,x)+b=wixi +b.

i=1

Note that:

Each function hw,b is parametrized by w ∈ Rd and b ∈ R.

Each function hw,b takes as input a vector x and returns a scalar (w,x)+b.

32/49

An Example : Regression

Different hypotheses classes are constructed using functions from Ld . THE CLASS OF HALFSPACES:

The class of halfspaces is designed for binary classification problems, X=Rd andY={−1,1}.

The realizability assumption means that it is possible to separate with a hyperplane all positive example from all negative examples; this is the separable case.

33/49

An Example : Regression

The ERM problem for halfspaces in the realizable case can be expressed as a linear programming problem.

Let S = {(xi , yi ) | 1 i m} be a training set of size m. Since we assume the realizable case, an ERM predictor should have zero errors on the training set. Thus, we are looking for w ∈ Rd such that

sign((w,xi))=yi for1im.

34/49

An Example : Regression

Equivalently, we are seeking w such that

yi (w, xi ) > 0 for 1 i m.

Since we assume realizability such a vector w must exist. Let w∗ be one.

Define

For all i we have

γ = minyi(w∗,xi) and w ̃ = 1w∗. iγ

yi(w ̃,xi) = 1yi(w∗,xi) 1. γ

Thus, there exists a vector w that is an ERM predictor.

35/49

An Example : Regression

Linear Regression

X isasubsetofRd andthelabelY isasubsetofR.

Goal is to learn a linear function h : Rd −→ R that best approximates the relationship between variables (e.g., predicting the weight of a baby as a function of age and weight at birth).

The hypothesis class is

Hreg =Ld ={h:Rd −→R,h(x)=w′x+b}.

36/49

An Example : Regression

A loss function for linear regression could ve l(h, (x, y )) = (h(x) − y )2 (the squared loss).

37/49

An Example : Regression

Least Squares

Least squares is an algorithm that solves the ERM problem for the linear regression predictors with respect to the squared loss.

The ERM problem starts with a traing set S and seeks to find w, where

1 m

(w′xi − yi )2.

The minimum condition amount to ∂LS (h) = 0 for 1 p n. ∂wp

w = argminwLS (h) = argminbfw m

i=1

38/49

An Example : Regression

We have

1 m

LS(h) = m (w′xi −yi)2

i=1 1 m

= m 2 m

(w1x1i +···+wmxmi −yi)2.

Therefore,

∂ L S ( h )

∂w

= m (w1x1i +···+wmxmi −yi)xpi i=1

=

i=1

p

i=1

for every p, 1 p n.

2m

which implies

m

(w1x1ixpi +···+wmxmixpi −yixpi =0, i=1

m

(w1x1ixpi +···+wmxmixpi −yixpi)=0,

39/49

An Example : Regression

The last equalities can be written in compact form as XX′w = Xy, which requires solving the equation Aw = b, where A = XX ′ and b = X y.

Note that:

if A is invertible, the solution to the ERM problem is w = A−1b; b = X y is a linear combination of the columns of X .

40/49

An Example : Regression

If the linear system Aw = b has no solution, the “next best thing” is to findavectorc∈Rn suchthat∥Ac−b∥2∥Aw−b∥2 foreveryw∈Rn, an approach known as the least square method.

We will refer to the triple (A, w, b) as an instance of the least square problem.

Note that Aw ∈ range(A) for any w ∈ Rn. Thus, solving this problem amounts to finding a vector w in the subspace range(A) such that Aw is as close to b as possible.

41/49

An Example : Regression

LetA∈Rm×n beafull-rankmatrixsuchthatm>n,sotherankofAis n. The symmetric square matrix A′A ∈ Rn×n has the same rank n as the matrix A. Therefore, the system (A′A)w = A′w a unique solution s. Moreover, A′A is positive definite because

w′A′Aw = (Aw)′Aw =∥ Aw ∥2> 0 for Aw ̸= 0. Theorem

LetA∈Rm×n beafull-rankmatrixsuchthatm>nandletb∈Rm. The unique solution of the system (A′A)w = A′b equals the projection of the vector b on the subspace range(A).

42/49

An Example : Regression

Proof

The n columns of the matrix A = (v1 · · · vn) constitute a basis of the subspace range(A). Therefore, we seek the projection c of b on range(A) as a linear combination c = At, which allows us to reduce this problem to a minimization of the function

f(t) = ∥At−b∥2

= (At−b)′(At−b) = (t′A′ −b′)(At−b)

= t′A′At − b′At − t′A′b + b′b.

The necessary condition for the minimum is

(∇f )(t) = 2A′At − 2A′b = 0,

which implies A′At = A′b.

The linear system (A′A)t = A′b is known as the system of normal equations of A and b.

43/49

An Example : Regression

We represent (using the function plot of ML), the number of calories consumed by a person per day vs. the gross national product per person in European countries

3800 3600 3400 3200 3000 2800 2600 2400

PT

RO

CY

BA

NO

NL FI

SK YU

0123456789

gdp per person in $10K units

44/49

calories per day

An Example : Regression

ccode

gdp

cal

ccode

gdp

cal

’AL’ ’AT’ ’BY’ ’BE’ ’BA’ ’BG’ ’HR’ ’CY’ ’CZ’ ’DK’ ’EE’ ’FI’ ’FR’ ’GE’ ’DE’ ’GR’ ’HU’ ’IS’ ’IE’

0.74 4.03 1.34 3.79 0.66 1.28 1.75 1.21 2.56 3.67 1.90 3.53 3.33 0.48 3.59 3.02 1.90 3.67 3.76

2824.00 3651.00 2895.00 3698.00 2950.00 2813.00 2937.00 3208.00 3346.00 3391.00 3086.00 3195.00 3602.00 2475.00 3491.00 3694.00 3420.00 3279.00 3685.00

’IT’ ’LV’ ’LT’ ’LU’ ’MK’ ’MT’ ’MD’ ’NL’ ’NO’ ’PL’ ’PT’ ’RO’ ’RU’ ’YU’ ’SK’ ’SI’ ’ES’ ’CH’

3.07 1.43 1.59 8.18 0.94 2.51 0.25 4.05 5.91 1.88 2.30 1.15 1.59 1.10 2.22 2.84 2.95 4.29

3685.00 3029.00 3397.00 3778.00 2881.00 3535.00 2841.00 3240.00 3448.00 3375.00 3593.00 3474.00 3100.00 2689.00 2825.00 3271.00 3329.00 3400.00

45/49

An Example : Regression

We seek to approximate the calorie intake as a linear function of the gdp of the form

cal=r1 +r2 gdp.

This amounts to solving a linear system that consists of 37 equations and

two unknowns:

r1 + 4.29r2 = 3400 and, clearly such a system is inconsistent.

r1 + 0.74r2 = 2824 .

46/49

An Example : Regression

We augment the data sample matrix by a column that consists of 1s to accommodate a constant term r1; thus, we work with the data sample matrix B ∈ R37×2 given by

1 0.74 . .

B=. . 1 4.29

whose second column consists of the countries’ gross domestic products in $10K units. The matrix

47/49

An Example : Regression

C = B′B is

Solving the normal system using the ML statement w = C (B′ ∗ b) yields

2894.2 w = 142.3 ,

so the regression line is cal = 142.3 ∗ gdp + 2894.2.

37.0000 94.4600 C = 94.4600 333.6592 .

48/49

An Example : Regression

4200 4000 3800 3600 3400 3200 3000 2800 2600 2400

012345678

gdp per person in $10K units

49/49

calories per day