这节课主要讲述一下BP怎么做的
BackProbagation

这里根据Chain Rule:$\frac{df}{dx} = \frac{df}{dq} \frac{dq}{dx}$
所以推算$\frac{df}{df} =1$, $\frac{df}{dz} = \frac{df}{df} \frac{df}{dz} = 1 q = 3$, 同理得后面的梯度(求导中函数的x为当前neuron的值)
Add gate: Gradient Distributor
Max gate: Gradient Router(只有max的值获得梯度传递)
Mul gate: Gradient “Switcher”(Neuron交换了梯度)
BP Gradient = [Last Gradient] [Local Gradient]
这里是用了Sigmoid作为Activation Function,可以看到梯度按照一层一层递推和直接对sigmoid function求导得到的值是一致的。$d\sigma (x)=\frac{1}{1+e^(-x)}$
$$\frac{d\sigma (x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2} = (1-\sigma (x))\sigma(x)$$
[hints: 每个循环包括:更新BP和更新FP,因为每次BP是受当前FP的结果影响的]
(Before) Linear: $ f= Wx $
(Now) 2-Layer NN: $f = W_2 max(0,W_1x) $
Vectorized: 向量化后,W是一层高维的参数可能4096,若果输出层也为4096。Jacobian矩阵的大小将为4096 4096,而且一般按批量处理图片,如果Batch为100,则为100 Size of Jacobian。
Lecture 5
A bit History
86年提出的BP算法如今大热,以往DEEP LEARNING主要网络复杂,无法收敛等问题,运算速度和数据也是考量的重要因素,导致停滞不前。如今随着实验对CNN等的认识,提出了很多有效训练和防止过拟合的方法。
06年Hinton提出了RBM,先逐层做一个预训练,然后再整合成网络BP Fine-tuning,结果效果不错。
10年的微软组增加了HMM在网络结构中,12年ALEX NET在IMAGE NET的成功引爆了潮流。
如何做CNN:
1.One time setup: activation functions, preprocessing, weight initialization, regularization, gradient checking
2.Traning Dynamics: babysitting the learning process, parameter updates, hyperparameter optimization
3.Evaluation: model ensembles
Activation Function
研究对比激活函数:
Sigmoid: 1.Squashes range [0,1] 2.nice interpretation “firing rate”
三个重大问题:1.Saturated Neurons “Kill” the gradient
2.Sigmoid ouputs are not zero-centered(使得生成下层的输入x全部同号,这会导致Optimization的过程锯齿状)
3.exp computation is high
tanh: 1.sqash among[-1,1]
2.zero centered (nice)
3.kills the gradient when saturated
ReLU: 1.Does not saturate(in +region)
2.computationally efficient
3.converges mush faster
but 1. Not zero-centered
2. An annoyance: the gradient of <0 is killed
3. Some time will "dead"(大概为负值后,无法继续更新,注意学习步长不能过大)
Leaky ReLU: $f(x) = max(\alpha x,x)$ some times $\alpha $ is 0.01
will not die
Exponential Linear Units(ELU):
当 x < 0时, $f(x) = \alpha (exp(x)-1)$
All benefit of ReLU, Closer to zero mean outputs, but requires exp()
Maxout “Neuron”:
$$max(w_1^Tx+b_1,w_2^Tx+b_2)$$
Linear Regime, Generalizes ReLU and Leaky ReLU,但参数变成原来的两倍了
总结: 一般用ReLU,可以尝试Leaky ReLU/Maxout/ELU,不要用sigmoid;tanh的性能也不行
Data Preprocessing
归一: Subtract Mean and Normalization, 一般图片会subtract mean 或subtract channel mean
降为: 用PCA和Whitening或者Leveraging的方法对高维数据做降为处理,但对于图像不常用
Weight Initalization(Important)
这是过去被忽视,但十分重要的部分。
1.All zero: Neuron的结果、误差和更新是一致的,失败
2.Small Random Numbers(以0为mean,0.01variance的高斯分布):Work okay but lead to non-homogeneous distributions of activations across the layers of network.
- Variance=1, all the neurons completely saturated, either -1 or 1,Gradients will be all zero.
- Xavier Initialization(suitable so far): Variance = 1/sqrt(in_layer),Reasonable initialization assumes the linear activations.
- Note additional /2: Variance = 1/sqrt(in_layer/2)
这里的in_layer是指上层输入的个数,通过这样可以使得上层的分布和当层的分布趋同,经验性地提高收敛的比率
soon… Propoer Initialization is an active area of research
Batch Normalization

- Improves gradient flow through the network
- Allows higher learning rates
- Reduces the strong dependence on initialization
- Acts as a form of regularization and slightly reduces the need for dropout(Maybe)
里面的scale和shift参数是通过Network自己学习的,归一化后重新根据学习需求调整数据分布的scale和shift.
Notice: At test time the BN layer differently: the mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used.
Babysitting the Learning Process
一个课堂模拟的小CNN,做完预处理后,选择了50个hidden neurons,10个output neurons.用CIFAR-10,使用SGD梯度下降。关键是拟合小的数据集,举了实际遇到情况的判断和处理:1.small loss train accuracy 1.00(nice) 2.Loss barely changing(Learning Rating too small) 3.loss exploding to NaN or Inf(Learning Rating too high)
Hyperparameter Optimization
1.Cross-Validation: Coarse->fine
First Stage: only few epochs get a rough idea of what params work
Second Stage: Longer running time , finer search(Repeat as Necessary)
(Cost ever > 3*original cost, break out early)
- It is best to optimize in log space
Grid Search Vs Random Search: (random is better 2012)
3.Hyperparameter play with: network architecture, learning rate, its decay schedule, update type, regularization(L2/Dropout strength)
4.Monitor and Visualize the Loss curve(Big gap= Overfitting)
5.Track the ratio of weight updates/weight magnitudes(Ratio 0.01 is okay)
Side Talks
如何计算网络:神经元个数=逐层个数相加;W个数=累加(当层神经元个数*下层神经元个数),B个数=神经元个数
Summary
推荐一个网络走的流程:
1.Activation Functions(Use ReLU)
2.Data Preprocessing(Subtract Mean)
3.Weight Initialization(Xavier (/2) init)
4.Batch Normalization(Use)
5.Hyperparameter Optimization(random sample hyperparams, In log space when proper)
- 将数据预处理为[-1,1],mean 0, val 1
- 高斯初始化参数w, standaard deviation 为 $\sqrt{2/n}$, n为上层输入的神经元
- 用L2和Dropout作正则化(记得用p调整回test output)
- 使用Batch Normalization
- 参考任务使用loss function
疑问点:
1.初始化方差为1/(sqrt(in)/2): 详细看何大神的文章
2.BN和Dropout都是近两年的工作,可以看看.Alex和Lecun还是CNN的先锋者,记得大师兄说,要多看Alex的文章。
3.Hiearachical Softmax