# IT代考 IEEE 86.11 (1998): 2278-2324. – cscodehelp代写

Introduction to Deep Learning
Neural Networks and Deep Learning

Describe the big-picture view of how neural networks work

Identify the basic building blocks and notations of deep neural networks

Illustrating A Biological Neuron
Soma (cell body)

Axon terminals

Neural Network

Artificial Neural Networks

Building Artificial Neural Networks
Output layer Activation
Input layer

Building Artificial Neural Networks (cont’d)
| What does this “neuron” do?
|The Perceptron model
Output layer Activation
Input layer

Logistic Neuron
|In Perceptron: |If we let:
𝑔𝑃 𝒙𝒊,𝒘 =ቊ
1, if 𝒘𝑻𝒙𝒊>0
𝑔𝒙,𝒘=1. 𝜎 𝒊 1+𝑒−𝒘𝑇𝒙𝒊
|The logistic function:

Logistic Neuron (cont’d)
A probability prediction Activation
Input layer

Learning in the Perceptron
|“Learning”: how does the neuron adapt its weights in response to the inputs?

The Perceptron Learning Algorithm
|Input -Training set
𝐷= 𝑥𝑖,𝑦𝑖,𝑖∈1,2,…𝑛.𝑦𝑖=0,1.
|Initialization
-Initialize the weights w(0)
(and some thresholds)
-Weights may be set to 0 or small random values

The Perceptron Learning Algorithm (cont’d)
|Iterate for t until a stop criterion is met {
for each sample 𝑥𝑖 with label 𝑦𝑖: {
compute the output 𝑦෕ of the network 𝑖
estimate the error of the network 𝑒 𝑤 𝑡 = 𝑦 𝑦෕ 𝑖−𝑖
update the weight 𝑤 𝑡 + 1 = 𝑤 𝑡 + 𝑒 𝑤 𝑡 𝑥𝑖 }

The Need for Multiple Layers
Perceptron Algorithm
w1x1 + w2x2 + w0 = 0

Extending to Multi-layer Neural Networks
|Question: Can a multi-layer version of the Perceptron (MLP) help solving the XOR problem?
The XOR Problem

An MLP Solving the XOR Problem
Output layer
Hidden layer
Input layer

The Question of Learning
|How can the network learn proper parameters from the given samples?
-Can the Perceptron algorithm be used?

Difficulty in Learning for MLP
|Perceptron Learning Algorithm
-The weight update of the neuron is proportional to the “error” computed as
𝑦 𝑖 − 𝑦ො 𝑖 .
• This requires us to know the target output 𝑦𝑖.
|Multi-layer Perceptron
-Except for the neurons on the output layer, other neurons (on the hidden layers) do not really have a target output given.

Back-propagation (BP) Learning for MLP
|The key: Properly distribute error computed from output layer back to earlier layers to allow their weights to be updated in a way that reduce the error
-The basic philosophy of the BP algorithm
|Differentiable activation functions
-We can use – e.g.,
• the logistic neurons
• neurons with sigmoid activation
• or its variants

A Multi-Layer Neural Network
|Using Logistic Neurons Output layer
Hidden layer Input layer

Handling Multiple (>2) Classes

Softmax for Handling Multiple Classes
|Using softmax to normalize the outputs (so they add up to 1).

How to Compute “errors” in this Case?
|Consider the cross-entropy as a loss function:

Neural Networks and Deep Learning

Introduction to Deep Learning
Key Techniques Enabling Deep Learning

Explain how, in principle, learning is achieved in a deep network
Explain key techniques that enable efficient learning in deep networks

|Back-propagation algorithm
|Design of activation functions |Regularization for improving performance
* Technological advancement in computing hardware is certainly another enabling factor but our discussion will focus on basic, algorithmic techniques.

Back Propagation (BP) Algorithm
|Simple Perceptron algorithm illustrates a path to learning by iterative optimization
-Updating weights based on network errors under current weights, and optimal weights are obtained when errors become 0 (or small enough)
|Gradient descent is a general approach to iterative optimization
-Define a loss function J
-Iteratively update the weights W according to the gradient
of J with respect to W.
W is the parameter of the network; J is the
objective function.

Back Propagation (BP) Algorithm (cont’d)
|Generalizes/Implements the idea for multi-layer networks
-Gradient descent for updating weights in optimizing a loss function
to loss gradient at output layer
W is the parameter of the network; J is the objective function.
Feedforward Operation
Target Values Output Layer
Hidden Layers Input Layer
Back Error Propagation

Illustrating the BP Algorithm 1/6
|Let’s consider a simple neural network with a single hidden layer. (We will only outline the key steps.)
-Let’s write the net input and activation for a hidden node: |Let’s write the net input and activation for the
hidden layer:
x3 Input layer
Output layer Hidden layer

Illustrating the BP Algorithm 2/6
|Using matrix/vector notations, for the hidden layer:
|Similarly, for the output layer → Homework.
x3 Input layer
Output layer Hidden layer

Illustrating the BP Algorithm 3/6
|Now consider m samples as input.
|Output layer is similarly done.
x1(i) x2(i)
x3(i) Input layer
Output layer Hidden layer

Illustrating the BP Algorithm 4/6
|Overall we have this flow of feedforward processing (note the notation change for simplicity: subscripts are for layers):
A0=X Z =W A +b Z1 A =f(Z ) A1 Z =W A +b Z2 A =f(Z ) A2 3xm 1 1 0 1 4xm 1 1 4xm 2 2 1 2 1xm 2 2 1xm
W1: 4×3 b1: 4×1
W2: 4×3 b2: 4×1
x1(i) x2(i)
x3(i) Input layer
|𝐂𝐨𝐧𝐬𝐢𝐝𝐞𝐫𝐝𝐖 ≜ 𝝏𝑳
Output layer Hidden layer

Illustrating the BP Algorithm 5/6
|Back-propagation
A0=X Z =W A +b Z1 A =f(Z ) A1 Z =W A +b Z2 A =f(Z ) A2 3xm 1 1 0 1 4xm 1 1 4xm 2 2 1 2 1xm 2 2 1xm
W1: 4×3 b1: 4×1
W2: 4×3 dZ2 dA2 b2: 4×1
x2(i) x3(i)
Input layer
|𝐂𝐨𝐧𝐬𝐢𝐝𝐞𝐫𝐝𝐖 ≜ 𝝏𝑳
Output layer Hidden layer

Illustrating the BP Algorithm 6/6
|A modular view of the layers
A0=X Z =W A +b Z1 A =f(Z ) A1 Z =W A +b Z2 A =f(Z ) A2 3xm 1 1 0 1 4xm 1 1 4xm 2 2 1 2 1xm 2 2 1xm
Ak-1 dAk-1
Output layer Hidden layer
W1: 4×3 b1: 4×1
Zk = WkAk-1+bk Ak = f(Zk)
W2: 4×3 dZ2 b2: 4×1
x2(i) x3(i)
Input layer

BP Algorithm Recap
|The feedforward process: ultimately produce A[K] that leads to the prediction for Y.
|The backpropagation process:
-First compute the loss
-Then compute the gradients via back- propagation through layers
-Key: use the chain rule of differentiation
|Essential to deep networks
|Suffers from several practical limitations
|Many techniques were instrumental to enabling learning with BP algorithm for deep neural networks

Activation Functions: Importance
|Provides non-linearity
|Functional unit of input-output mapping
|Its form impacts on gradients in BP algorithm

Activation Functions: Choices
|Older Types
-Thresholding -Logistic function -tanh
-Rectifier f(x) = max(0, x) and
its variants
-Rectified Linear Unit (ReLU)

ReLU and Some Variants
𝑎ReLU 𝑥 =max(0,𝑥ሻ 𝑎s 𝑥 =log(1+𝑒𝑥ሻ
𝑎L 𝑥 =ቊ 𝑥, 𝑖𝑓𝑥>0 𝛿𝑥, 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
with 𝛿 a small positive number

The Importance of Regularization
|The parameter space is huge, if there is no constraint in search for a solution, the algorithm may converge to poor solutions.
|Overfitting is a typical problem
-Converging to local minimum good only for the training data

Some Ideas for Regularization
|Favoring a network with small weights -achieved by adding a term of L2-norm of the weights to
original loss function
|Preventing neurons from “co-adaptation” ➔ Drop-out
|Making the network less sensitive to initialization/learning rate etc.
-Batch normalization
|Such regularization techniques have been found to be not only helpful but sometimes critical to learning in deep networks

|Obtain(b)byrandomly |UseBPtoupdate
deactivate some hidden nodes in (a)
|For input x, calculate output y by using the activated nodes ONLY
weights (which connect to the activated nodes) of network
|Activate all nodes |Go back to first step

Why Drop-out?
|Reducing co-adaptation of neuron |Model averaging

Batch Normalization (BN)
| Inputs to network layers are of varying distributions, the so-called internal covariate shift [Ioffe and Szegedy, 2015]
-Careful parameter initialization and low learning rate are required
|BN was developed to solve this problem by normalizing layer inputs of a batch

The Simple Math of BN
|For a mini-batch with size = m, first calculate
|Up to this point, has mean = 0 and standard deviation = 1

How is BN Used in Learning?
| Define two parameters and so that the output of the BN layer can be calculated as:
| Parameters and can be learned by minimizing the lost function via gradient descent
| Usually used right before the activation functions
FC BN activation FC BN activation

Other Regularization Techniques
|Weight sharing
|Training data conditioning
|Sparsity constraints
|Ensemble methods (committee of networks)
➔ Some of these will be discussed in later examples of networks.

Introduction to Deep Learning
Some Basic Deep Architecture

Appraise the detailed architecture of a basic convolutional neural network
Explain the basic concepts and corresponding architecture for auto- encoders and recurrent neural networks

|Convolutional Neural Network (CNN)
-will be given the most attention, for its wide range of
application
|Auto-encoder
|Recurrent Neural Networks (RNN)

Convolutional Neural Network (CNN)
|Most useful for input data defined on grid-like structures, like images or audio
|Built upon concept of “convolution” for signal/image filtering
|Invokes other concepts like pooling, weight- sharing, and (visual) receptive field, etc.

Image Filtering via Convolution 1/5
Kernel coefficients

Image Filtering via Convolution 2/5
New pixel value = (1*255+ 1*128+ 1*128+ 1*255+ 4*128+ 1*245+ 1*245+ 1*128+ 1*245)/12= 2141/12 =178

Image Filtering via Convolution 3/5

Image Filtering via Convolution 4/5
New pixel value = (1*128+ 1*128+ 1*240+ 1*178+ 4*245+ 1*240+ 1*128+ 1*245+ 1*240)/12= 2507/12 =209

Image Filtering via Convolution 5/5

Image Filtering via Convolution: Kernels
|By varying coefficients of the kernel, we can achieve different goals
– Smoothing, sharpening, detecting edges, etc.
|Better yet: can we learn proper kernels?
Part of CNN objective |Examples of Kernels:
Smoothing/Noise-reduction
(Vertical) Edge detection

2D Convolutional Neuron
The number of kernels defines the number of features maps (channels).
The sizes of the kernels define the receptive fields.

Convpool Layer
Convolution, pooling, and going through some activations

Illustrating A Simple CNN
Some convpool layers plus some fully-connected layers

CNN Examples – Different Complexities
|LeNet |AlexNet

|Each pixel in layer 3 corresponds to 7/3 of a pixel in the input Second level
|Receptive field of layer 1 is 5X5
Layer Number
Input Shape
Receptive Field
Number of Feature Maps
Type of Neuron
28 X 28 X 1
Convolutional
24 X 24 X 20
Convolutional
8 X 8 X 50
Fully Connected
LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.

Case Study: LeNet
|This is after training the network for 75 epochs with a learning rate of 0.01
|Produces an accuracy of 99.38% on the MNIST dataset.

Case Study: AlexNet
Krizhevsky, Alex, , and . Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

Case Study: AlexNet 1/3
Layer Number
Input Shape
Receptive Field
Number of Kernels
Type of Neuron
224 X 224 X 3
11 X 11, stride 4
Convolutional
3 X 3, stride 2
55 X 55 X 96
Convolutional
3 X 3, stride 2
13 X 13 X 256
Convolutional
13 X 13 X 384
Convolutional
13 X 13 X 384
Convolutional
Fully Connected
Fully Connected
|Receptive field of the layer 7 is
~ 52 pixels !! which is almost as big as an object part (about
one – fourth of the input image)
Krizhevsky, Alex, , and . Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

Case Study: AlexNet 2/3
|Imagenet -15 million images in over 22,000 categories
|(ILSVRC), used about 1000 of these categories
|Imagenet categories are|AlexNet was the earliest much more complicated systems to break the
than other datasets
-Often difficult even for humans to categorize perfectly
-Average human-level performance is about 96% on this dataset
-Non-neural conventional techniques were unable to achieve such performance

Case Study: AlexNet 3/3
|AlexNet was huge at the time.
-The size could lead to instability during training or inability
to learn, if without proper regularization
|Some techniques were used to make it trainable -AlexNet was the first prominent network to feature ReLU
-Features multi-GPU training (originally trained the networks on two Nvidia GTX 580 GPUs with 3GB)

Case Study: AlexNet Filters
Krizhevsky, Alex, , and . Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

|The CNNs are similar to the basic MLP architecture illustrated earlier, but some key extensions include:
-The concept of weight-sharing through kernels -Weight-sharing enables learnable kernels, which in turn
define feature maps -The idea of pooling

Auto-encoder 1/4
|Networks seen thus far are all trained via supervised learning
|Sometimes we may need to train a network without supervision:
→Unsupervised learning |Auto-encoder is a such
-Consider yi being an approximation of xi.

Auto-encoder 2/4
|Perfect auto-encoder would map xi to xi
|Learn good representations in the hidden layer

Auto-encoder 3/4
|Consider two cases
-Much fewer hidden nodes than
input nodes
-Many hidden nodes or more hidden nodes than input nodes
|Case 1: Encoder for compressing input and compressed data should still be able to reconstruct the input
– Similar to, e.g., PCA

Auto-encoder 4/4
|Consider two cases
-Fewer hidden nodes than input nodes
y1 … yd -More hidden nodes than input nodes
|Case 2: Allow more hidden nodes than input
– Allow more freedom for the input-to- hidden layer mapping in exploring structure of the input
– Additional “regularization” will be needed in order to find meaningful results

Recurrent Neural Networks (RNNs) 1/4
|Feedforward networks: Neurons are interconnected without any cycle in the connection
|Recurrent neural networks: Allow directed cycles in connections between neurons
-Notion of “state” or temporal dynamics -Necessity of internal memory
|One clear benefit: Such networks could naturally model variable-length sequential data

Recurrent Neural Networks (RNNs) 2/4
|A basic, illustrative architecture for RNN (showing only one node each layer)
-QUESTION: What is this network equivalent to, if we “unfold” the cycles for a given sequence of data?

Recurrent Neural Networks (RNNs) 3/4
|Training with BP algorithm may suffer from so- called vanishing gradient problem
|Some RNN variants have sophisticated “recurrence” structures, invented in part to address such difficulties faced by basic RNN models

Recurrent Neural Networks (RNNs) 4/4
|Examples:
|The “Long short-term memory” (LSTM) model
-used to produce state-of-the-art results in speech and language applications
|The Gated Recurrent Unit model, illustrated here: z