CS代考计算机代写 deep learning algorithm Machine Learning 10-601
Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
April 15, 2015
Today:
• Artificial neural networks
• Backpropagation
• Recurrent networks
• Convolutional networks
• Deep belief networks
• Deep Boltzman machines
Reading:
• Mitchell: Chapter 4
• Bishop: Chapter 5
• Quoc Le tutorial:
• Ruslan Salakhutdinov tutorial:
Artificial Neural Networks to learn f: XàY
• fmightbenon-linearfunction
• X(vectorof)continuousand/ordiscretevars
• Y(vectorof)continuousand/ordiscretevars
• Representfbynetworkoflogisticunits
• Eachunitisalogisticfunction
• MLE:trainweightsofallunitstominimizesumofsquared errors of predicted network outputs
• MAP:traintominimizesumofsquarederrorsplusweight magnitudes
ALVINN [Pomerleau 1993]
M(C)LE Training for Neural Networks
• Considerregressionproblemf:XàY,forscalarY
y = f(x) + ε deterministic
assume noise N(0,σε), iid • Let’smaximizetheconditionaldatalikelihood
Learned neural network
MAP Training for Neural Networks
• Considerregressionproblemf:XàY,forscalarY
y = f(x) + ε deterministic
noise N(0,σε)
Gaussian P(W) = N(0,σΙ)
lnP(W) ↔c∑i wi2
xd = input
td = target output
od = observed unit output
wi = weight i
(MLE)
xd = input
td = target output
od = observed unit output
wij =wtfromitoj
w0 left strt right up
Semantic Memory Model Based on ANN’s [McClelland & Rogers, Nature 2003]
No hierarchy given.
Train with assertions, e.g., Can(Canary,Fly)
Training Networks on Time Series
• Supposewewanttopredictnextstateofworld
– and it depends on history of unknown length
– e.g., robot with forward-facing sensors trying to predict next sensor reading as it moves and turns
Recurrent Networks: Time Series
• Supposewewanttopredictnextstateofworld
– and it depends on history of unknown length
– e.g., robot with forward-facing sensors trying to predict next sensor reading as it moves and turns
• Idea:usehiddenlayerinnetworktocapturestatehistory
Recurrent Networks on Time Series
How can we train recurrent net??
Convolutional Neural Nets for Image Recognition
[Le Cun, 1992]
• specializedarchitecture:mixdifferenttypesofunits,not completely connected, motivated by primate visual cortex
• manysharedparameters,stochasticgradienttraining
• very successful! now many specialized architectures for vision, speech, translation, …
Deep Belief Networks
[Hinton & Salakhutdinov, 2006]
• Problem:trainingnetworkswithmanyhiddenlayers doesn’t work very well
– local minima, very slow training if initialize with zero weights • Deepbeliefnetworks
– autoencoder networks to learn low dimensional encodings
– but more layers, to learn better encodings
Deep Belief Networks
[Hinton & Salakhutdinov, 2006]
original image
reconstructed from 2000-1000-500-30 DBN
reconstructed from 2000-300, linear PCA
versus
Deep Belief Networks: Training
[Hinton & Salakhutdinov, 2006]
Encoding of digit images in two dimensions
[Hinton & Salakhutdinov, 2006]
784-2 linear encoding (PCA) 784-1000-500-250-2 DBNet
Very Large Scale Use of DBN’s
[Quoc Le, et al., ICML, 2012]
Data: 10 million 200×200 unlabeled images, sampled from YouTube Training: use 1000 machines (16000 cores) for 1 week
Learned network: 3 multi-stage layers, 1.15 billion parameters
Achieves 15.8% (was 9.5%) accuracy classifying 1 of 20k ImageNet items
Real images that most excite the feature:
Image synthesized to most excite the feature:
Restricted Boltzman Machine
• Bipartite graph, logistic activation
• Inference: fill in any nodes, estimate other
nodes
• consider vi, hj are boolean variables
h1 h2 h3
v1v2… vn
Impact of Deep Learning • Speech Recogni4on
• Computer Vision • Recommender Systems
• Language Understanding
• Drug Discovery and Medical Image Analysis
[Courtesy of R. Salakhutdinov]
Feature Representa4ons: Tradi4onally
Data
Feature extraction
Learning algorithm
Object detec4on
Image
vision features
Recogni4on
Audio classifica4on
Audio
audio features
Speaker iden4fica4on
[Courtesy of R. Salakhutdinov]
Computer Vision Features
SIFT
Textons
HoG
GIST
RIFT
[Courtesy, R. Salakhutdinov]
Audio Features
Flux
Spectrogram MFCC
ZCR
Rolloff
[Courtesy, R. Salakhutdinov]
Audio Features
Representa4on Learning:
Flux
Can we automa4cally learn these representa4ons?
ZCR Rolloff
Spectrogram MFCC
[Courtesy, R. Salakhutdinov]
Restricted Boltzmann Machines
Graphical Models: Powerful framework for represen4ng dependency structure between random variables.
hidden variables Pair-wise Unary Feature Detectors
Image
visible variables
RBM is a Markov Random Field with:
• Stochas4c binary visible variables
• Stochas4c binary hidden variables
• Bipar4te connec4ons.
Markov random fields, Boltzmann machines, log-linear models.
[Courtesy, R. Salakhutdinov]
Observed Data
Subset of 25,000 characters
New Image: =
Learned W: “edges” Subset of 1000 features
Learning Features
Sparse representa8ons
….
Logis4c Func4on: Suitable for modeling binary images
[Courtesy, R. Salakhutdinov]
Model Learning Hidden units
Given a set of i.i.d. training examples
, we want to learn
model parameters . Maximize log-likelihood objec4ve:
Image
Deriva4ve of the log-likelihood:
visible units
Difficult to compute: exponen4ally many configura4ons
[Courtesy, R. Salakhutdinov]
RBMs for Real-valued Data
hidden variables
Pair-wise
Unary
Image visible variables Gaussian-Bernoulli RBM:
• Stochas4c real-valued visible variables
• Stochas4c binary hidden variables
• Bipar4te connec4ons.
(Salakhutdinov & Hinton, NIPS 2007; Salakhutdinov & Murray, ICML 2008)
[Courtesy, R. Salakhutdinov]
RBMs for Real-valued Data
hidden variables
Pair-wise Unary
Image visible variables
4 million unlabelled images
Learned features (out of 10,000)
[Courtesy, R. Salakhutdinov]
RBMs for Real-valued Data
hidden variables
Pair-wise Unary
Image visible variables
4 million unlabelled images
= 0.9 *
+ 0.8 *
Learned features (out of 10,000)
+ 0.6 * …
New Image
[Courtesy, R. Salakhutdinov]
RBMs for Word Counts
Pair-wise
Unary
0 0
01 0
P✓(v,h)=
10@XDXKXF XDXKXF1A exp Wkvkhj + vkbk + hjaj
Z(✓) ij i i i
i=1 k=1 j=1 i=1 k=1 j=1
P (vk = 1|h) = ✓ i
exp⇣bk +PF hjWk⌘
i⇣ j=1 ij ⌘
PK expbq+PF hjWq q=1 i j=1 ij
Replicated Soemax Model: undirected topic model:
• Stochas4c 1-of-K visible variables.
• Stochas4c binary hidden variables
• Bipar4te connec4ons.
[Courtesy, R. Salakhutdinov] (Salakhutdinov & Hinton, NIPS 2010, Srivastava & Salakhutdinov, NIPS 2012)
RBMs for Word Counts
Pair-wise
Unary
0 0
01 0
P✓(v,h)=
10@XDXKXF XDXKXF1A exp Wkvkhj + vkbk + hjaj
Z(✓) ij i i i
i=1 k=1 j=1 i=1 k=1 j=1
P (vk = 1|h) = ✓ i
exp⇣bk +PF hjWk⌘
i⇣ j=1 ij ⌘
PK expbq+PF hjWq q=1 i j=1 ij
Learned features: “topics’’
Reuters dataset: 804,414 unlabeled newswire stories
Bag-of-Words
russian russia moscow yeltsin soviet
clinton house president bill congress
computer system product soeware develop
trade country import world economy
stock wall street point dow
[Courtesy, R. Salakhutdinov]
Different Data Modali4es
• Binary/Gaussian/Soemax RBMs: All have binary hidden variables but use them to model different kinds of data.
Binary
Real-valued
0 0 0 1 0
• It is easy to infer the states of the hidden variables:
1-of-K
[Courtesy, R. Salakhutdinov]
Product of Experts The joint distribu4on is given by:
Marginalizing over hidden variables:
Product of Experts
government auhority power empire pu4n
clinton house president bill congress
bribery corrup4on dishonesty pu4n fraud
Pu4n
oil barrel exxon pu4n drill
stock … wall
street
point
dow
Topics “government”, ”corrup4on” and ”oil” can combine to give very high probability to a word “Pu4n”.
(Srivastava & Salakhutdinov, NIPS 2012)
[Courtesy, R. Salakhutdinov]
Deep Boltzmann Machines
Image
Low-level features: Edges
Built from unlabeled inputs.
Input: Pixels
(Salakhutdinov & Hinton, Neural Computation 2012)
[Courtesy, R. Salakhutdinov]
Deep Boltzmann Machines
Learn simpler representa4ons, then compose more complex ones
Higher-level features: Combina4on of edges
Low-level features: Edges
Built from unlabeled inputs.
Input: Pixels
Image
(Salakhutdinov 2008, Salakhutdinov & Hinton 2012)
[Courtesy, R. Salakhutdinov]
h3
h2
h1
v
Model Formula4on
Same as RBMs
model parameters
• Dependencies between hidden variables.
W3
requires approximate inference to
• All connec4ons are undirected. • Bolom-up and Top-down:
W2
train, but it can be done…
W1
and scales to millions of examples
Input
Top-down Bolom-up
[Courtesy, R. Salakhutdinov]
Samples Generated by the Model Training Data Model-Generated Samples
Data
[Courtesy, R. Salakhutdinov]
Handwri4ng Recogni4on
MNIST Dataset
60,000 examples of 10 digits
Op4cal Character Recogni4on 42,152 examples of 26 English lelers
Logis4c regression 22.14% K-NN 18.92%
Learning Algorithm
Error
Learning Algorithm
Error
Logis4c regression
K-NN
Neural Net (Plal 2005)
SVM (Decoste et.al. 2002)
Deep Autoencoder (Bengio et. al. 2007)
Deep Belief Net (Hinton et. al. 2006)
DBM
12.0% 3.09% 1.53% 1.40% 1.40%
1.20%
0.95%
Neural Net
SVM (Larochelle et.al. 2009)
Deep Autoencoder (Bengio et. al. 2007)
Deep Belief Net (Larochelle et. al. 2009)
14.62% 9.70% 10.05%
9.68%
Permuta4on-invariant version.
DBM 8.40%
[Courtesy, R. Salakhutdinov]
3-D object Recogni4on
NORB Dataset: 24,000 examples
Learning Algorithm
Error
Logis4c regression
K-NN (LeCun 2004)
SVM (Bengio & LeCun 2007)
Deep Belief Net (Nair & Hinton 2009)
DBM
22.5% 18.92% 11.6% 9.0%
7.2%
Palern Comple4on
[Courtesy, R. Salakhutdinov]
Learning Shared Representa4ons Across Sensory Modali4es
“Concept”
sunset, pacific ocean, baker beach, seashore, ocean
[Courtesy, R. Salakhutdinov]
A Simple Mul4modal Model • Use a joint binary hidden layer.
• Problem: Inputs have very different sta4s4cal proper4es.
• Difficult to learn cross-modal features.
0 Real-valued 0 0 1 0
1-of-K
[Courtesy, R. Salakhutdinov]
Mul4modal DBM
Gaussian model
0 Dense, real-valued 0
image features 01 0
Replicated Soemax
Word counts
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
[Courtesy, R. Salakhutdinov]
Mul4modal DBM
Gaussian model
0 Dense, real-valued 0
image features 01 0
Replicated Soemax
Word counts
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
[Courtesy, R. Salakhutdinov]
Mul4modal DBM
Gaussian model
Dense, real-valued 0
image features 01 0
Replicated Soemax
Word counts
0
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
[Courtesy, R. Salakhutdinov]
Mul4modal DBM
Word counts
Bolom-up + Top-down
Gaussian model
0 Dense, real-valued 0
image features 01 0
Replicated Soemax
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
[Courtesy, R. Salakhutdinov]
Mul4modal DBM
Word counts
Bolom-up + Top-down
Gaussian model
0 Dense, real-valued 0
image features 01 0
Replicated Soemax
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
[Courtesy, R. Salakhutdinov]
Text Generated from Images
Given
Generated
dog, cat, pet, kilen, puppy, ginger, tongue, kily, dogs, furry
sea, france, boat, mer, beach, river, bretagne, plage, brilany
portrait, child, kid, ritralo, kids, children, boy, cute, boys, italy
Given
Generated
insect, bulerfly, insects, bug, bulerflies, lepidoptera
graffi4, streetart, stencil, s4cker, urbanart, graff, sanfrancisco
canada, nature, sunrise, ontario, fog, mist, bc, morning
[Courtesy, R. Salakhutdinov]
Given
Generated
portrait, women, army, soldier, mother, postcard, soldiers
obama, barackobama, elec4on, poli4cs, president, hope, change, sanfrancisco, conven4on, rally
water, glass, beer, bolle, drink, wine, bubbles, splash, drops, drop
Text Generated from Images
Given
water, red, sunset
nature, flower, red, green
blue, green, yellow, colors
chocolate, cake
Retrieved
Images Generated from Text
[Courtesy, R. Salakhutdinov]
MIR-Flickr Dataset
• 1 million images along with user-assigned tags.
sculpture, beauty, stone
d80
nikon, green, light, photoshop, apple, d70
nikon, abigfave, goldstaraward, d80, nikond80
white, yellow, abstract, lines, bus, graphic
food, cupcake, vegan
sky, geotagged, reflec4on, cielo, bilbao, reflejo
anawesomeshot, theperfectphotographer, flash, damniwishidtakenthat, spiritofphotography
Huiskes et. al.
[Courtesy, R. Salakhutdinov]
Results
• Logis4c regression on top-level representa4on.
• Mul4modal Inputs
Mean Average Precision
Labeled 25K examples
+ 1 Million unlabelled
Learning Algorithm
MAP
Precision@50
Random
0.124
0.124
LDA [Huiskes et. al.]
0.492
0.754
SVM [Huiskes et. al.]
0.475
0.758
DBM-Labelled
0.526
0.791
Deep Belief Net
0.638
0.867
Autoencoder
0.638
0.875
DBM
0.641
0.873
[Courtesy, R. Salakhutdinov]
Artificial Neural Networks: Summary
• Highlynon-linearregression/classification
• Hiddenlayerslearnintermediaterepresentations
• Potentiallymillionsofparameterstoestimate
• Stochasticgradientdescent,localminimaproblems
• Deepnetworkshaveproducedrealprogressinmanyfields – computer vision
– speech recognition
– mapping images to text
– recommender systems – …
• Theylearnveryusefulnon-linearrepresentations