开始新一课之前,先来把上一节的相关阅读材料的知识补充上来。
Lecture 5 Notes2&3
Regularization
L1,L2的Loss function 还有Max Norm constraints: 对于系数向量w,有$w^2 < c$ ,C一般为3或4.
Dropout的技术:一般采用P=0.5,每个神经元的激活概率为0.5,然后每个样本对应一个新的Mask之后的子网络进行训练,最后测试的时候开启全部神经元但得到的结果需要乘上P=0.5这个系数。这种技术直观的好处是:1.迫使网络学习冗余的表达能力 2.实现了大型的学习模型,具有共享参数特性
Practice: 使用single,global L2 Regularization(cross-validated)+ Dropout(p=0.5)
Loss Functions
针对classification的任务,使用Softmax或者SVM Loss. 针对类别多的情况,可以使用Hierarchical Softmax.
Attribute Classification可以用Logistic regression classifier with two classes(0,1)
Regression任务,一般使用L2或者L1。
When faced with a regression task, first consider if it is absolutely necessary. Instead, have a strong preference to discretizing your outputs to bins and perform classification over them whenever possible.
Gradient Checks
Use centered formula,求梯度用左右方向的平均。$$\frac{df(x)}{dx} = \frac{f(x+h)-f(x-h)}{2h}$$
Use relative error for the comparison, relative error
$$ \frac{ |f_a - f_n | }{max(f_a,f_n)} $$ . $f_a$ 为analytic gradient , $f_n$ 为numberic gradient
In practice:
- relative error > 1e-2 usually means the gradient is probably wrong
- 1e-2>relative error>1e-4 should make you feel uncomfortable
- 1e-4>relative error is usually okay for objectives with kinks, but if there are no kinks(such tanh nonlinearities and softmax) , then 1e-4 is too high.
- 1e-7 and less you should be happy
- if too small like or than 1e-10, absolute value is worrying
越深的网络,relative errors越大,如果10层,则1e-2是可以接受的。
Summary: Careful with step size h, Gradcheck important, Don’t let regularization overwhelm the data, turn off the dropout/augmentations when gradient check, check only few dimensions.
Lecture 6
How to do parameter Updates
1.SGD: $x+= -learning rate *dx$
2.Momentum: allow velocity build up, velocity damped in steep due to changing sign
v = mu v - learning_rate dx
x += v
- Nestreov Momentum:
$v_t = \mu v_{t-1} - \epsilon \Delta f(\theta_{t-1} + \mu V_{t-1})$
$\theta_t = \theta_{t-1} + V_t$
4.Adagrad: Equalization the steep and shallow direction
cache + = dx *2
x += - learn_rate dx / (sqrt(cache)+1e-7)
5.Adam: Great enough with bias correct
m = beta1 m + (1-beta1)dx
v = beta2 v + (1-beta2)(dx *2)
x += -learn_rate m/(sqrt(v)+1e-7)
一般beta1和beta2可以设置为0.9和0.995
Second Order Optimization
1.泰勒展开
2.Newton Gradient: Jacobian H is too large with O(n^3),n is million
3.Quasi-Newton O(n^2)
- L-BFGS(Limited Memory BFGS):work well in full patch,but can not transfoer well to mini-batch search.
Practice: 1.Train Multiple Indeoendent model
2.At test time average the results
Fun tricks: get small boost from average multiple initilization model, keep track running average parametre vector.
Annealing learning rate
- step dcay: learning rate by half every t epochs
- Exponential decay $\alpha = \alpha_0 e^{-kt}$
- 1/t decay: $\alpha = \alpha_0 /(1+kt)$
Hyperparameter optimzation
stage search from coarse to fine
bayesian hyperparameter optimization
Model ensemble, improve the performance of NN a few percent:
- Same Model, Different Initializations
- Top models discovered during cross-validation
- Different Checkpoints of a single model: Training is very expensive, taking the different checkpoints of single network and using those to form an ensemble.(Cheap and practice,选取一些好的epoch模型)
- Running averatge of parameters during training(用训练过程模型中的均值,直观来看在碗状徘徊,均值更有利于接近底部)
Summary
1.针对少量的样本,gradient check很重要,并注意正确的初始化
2.the magnitude of updates should be ~1e-3 in first-layer
3.推荐用SGD+Nesterov Momentum or Adam
4.Decay learning rate over the period of training
5.Search good hyperparameters with random search(not grid), stage coarse to fine
6.Form model ensembles for extra performance