CS231n-4 & 5

这节课主要讲述一下BP怎么做的

BackProbagation

这里根据Chain Rule:$\frac{df}{dx} = \frac{df}{dq} \frac{dq}{dx}$
所以推算$\frac{df}{df} =1$, $\frac{df}{dz} = \frac{df}{df} \frac{df}{dz} = 1 q = 3$, 同理得后面的梯度（求导中函数的x为当前neuron的值）
Add gate: Gradient Distributor
Max gate: Gradient Router(只有max的值获得梯度传递)
Mul gate: Gradient “Switcher”(Neuron交换了梯度)
BP Gradient = [Last Gradient] [Local Gradient]

这里是用了Sigmoid作为Activation Function,可以看到梯度按照一层一层递推和直接对sigmoid function求导得到的值是一致的。$d\sigma (x)=\frac{1}{1+e^(-x)}$
$$\frac{d\sigma (x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2} = (1-\sigma (x))\sigma(x)$$

[hints: 每个循环包括：更新BP和更新FP，因为每次BP是受当前FP的结果影响的]

(Before) Linear: $ f= Wx $
(Now) 2-Layer NN: $f = W_2 max(0,W_1x) $

Vectorized: 向量化后，W是一层高维的参数可能4096，若果输出层也为4096。Jacobian矩阵的大小将为4096 4096,而且一般按批量处理图片，如果Batch为100，则为100 Size of Jacobian。

参考阅读 Efficient BP by Lecun

Lecture 5

A bit History

86年提出的BP算法如今大热，以往DEEP LEARNING主要网络复杂，无法收敛等问题，运算速度和数据也是考量的重要因素，导致停滞不前。如今随着实验对CNN等的认识，提出了很多有效训练和防止过拟合的方法。

06年Hinton提出了RBM，先逐层做一个预训练，然后再整合成网络BP Fine-tuning，结果效果不错。

10年的微软组增加了HMM在网络结构中，12年ALEX NET在IMAGE NET的成功引爆了潮流。

如何做CNN：
1.One time setup: activation functions, preprocessing, weight initialization, regularization, gradient checking
2.Traning Dynamics: babysitting the learning process, parameter updates, hyperparameter optimization
3.Evaluation: model ensembles

Activation Function

研究对比激活函数：
Sigmoid: 1.Squashes range [0,1] 2.nice interpretation “firing rate”
三个重大问题：1.Saturated Neurons “Kill” the gradient
2.Sigmoid ouputs are not zero-centered(使得生成下层的输入x全部同号，这会导致Optimization的过程锯齿状)
3.exp computation is high

tanh: 1.sqash among[-1,1]
2.zero centered (nice)
3.kills the gradient when saturated

ReLU: 1.Does not saturate(in +region)
2.computationally efficient
3.converges mush faster
but 1. Not zero-centered

2. An annoyance: the gradient of <0 is killed
3. Some time will "dead"(大概为负值后，无法继续更新，注意学习步长不能过大)

Leaky ReLU: $f(x) = max(\alpha x,x)$ some times $\alpha $ is 0.01
will not die

Exponential Linear Units(ELU):
当 x < 0时, $f(x) = \alpha (exp(x)-1)$
All benefit of ReLU, Closer to zero mean outputs, but requires exp()

Maxout “Neuron”:
$$max(w_1^Tx+b_1,w_2^Tx+b_2)$$
Linear Regime, Generalizes ReLU and Leaky ReLU,但参数变成原来的两倍了

总结：一般用ReLU,可以尝试Leaky ReLU/Maxout/ELU,不要用sigmoid；tanh的性能也不行

Data Preprocessing

归一： Subtract Mean and Normalization, 一般图片会subtract mean 或subtract channel mean
降为：用PCA和Whitening或者Leveraging的方法对高维数据做降为处理，但对于图像不常用

Weight Initalization(Important)

这是过去被忽视，但十分重要的部分。
1.All zero: Neuron的结果、误差和更新是一致的，失败
2.Small Random Numbers(以0为mean,0.01variance的高斯分布)：Work okay but lead to non-homogeneous distributions of activations across the layers of network.

Variance=1, all the neurons completely saturated, either -1 or 1,Gradients will be all zero.
Xavier Initialization(suitable so far): Variance = 1/sqrt(in_layer),Reasonable initialization assumes the linear activations.
Note additional /2: Variance = 1/sqrt(in_layer/2)
这里的in_layer是指上层输入的个数，通过这样可以使得上层的分布和当层的分布趋同，经验性地提高收敛的比率
soon… Propoer Initialization is an active area of research

Batch Normalization

Improves gradient flow through the network
Allows higher learning rates
Reduces the strong dependence on initialization
Acts as a form of regularization and slightly reduces the need for dropout(Maybe)
里面的scale和shift参数是通过Network自己学习的，归一化后重新根据学习需求调整数据分布的scale和shift.

Notice: At test time the BN layer differently: the mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used.

Babysitting the Learning Process

一个课堂模拟的小CNN，做完预处理后，选择了50个hidden neurons，10个output neurons.用CIFAR-10,使用SGD梯度下降。关键是拟合小的数据集，举了实际遇到情况的判断和处理：1.small loss train accuracy 1.00（nice) 2.Loss barely changing(Learning Rating too small) 3.loss exploding to NaN or Inf(Learning Rating too high)

Hyperparameter Optimization

1.Cross-Validation: Coarse->fine
First Stage: only few epochs get a rough idea of what params work
Second Stage: Longer running time , finer search(Repeat as Necessary)
(Cost ever > 3*original cost, break out early)

It is best to optimize in log space
Grid Search Vs Random Search: (random is better 2012)

3.Hyperparameter play with: network architecture, learning rate, its decay schedule, update type, regularization(L2/Dropout strength)

4.Monitor and Visualize the Loss curve(Big gap= Overfitting)

5.Track the ratio of weight updates/weight magnitudes(Ratio 0.01 is okay)

Side Talks

如何计算网络：神经元个数=逐层个数相加；W个数=累加（当层神经元个数*下层神经元个数），B个数=神经元个数

Summary

推荐一个网络走的流程：
1.Activation Functions(Use ReLU)
2.Data Preprocessing(Subtract Mean)
3.Weight Initialization(Xavier (/2) init)
4.Batch Normalization(Use)
5.Hyperparameter Optimization(random sample hyperparams, In log space when proper)

将数据预处理为[-1,1],mean 0, val 1
高斯初始化参数w, standaard deviation 为 $\sqrt{2/n}$, n为上层输入的神经元
用L2和Dropout作正则化（记得用p调整回test output）
使用Batch Normalization
参考任务使用loss function

疑问点：
1.初始化方差为1/(sqrt(in)/2): 详细看何大神的文章
2.BN和Dropout都是近两年的工作，可以看看.Alex和Lecun还是CNN的先锋者，记得大师兄说，要多看Alex的文章。
3.Hiearachical Softmax

参考阅读:
NN notes2
NN notes3