CS231n_6

开始新一课之前，先来把上一节的相关阅读材料的知识补充上来。

Lecture 5 Notes2&3

Regularization

L1,L2的Loss function 还有Max Norm constraints: 对于系数向量w,有$w^2 < c$ ，C一般为3或4.
Dropout的技术：一般采用P=0.5,每个神经元的激活概率为0.5，然后每个样本对应一个新的Mask之后的子网络进行训练，最后测试的时候开启全部神经元但得到的结果需要乘上P=0.5这个系数。这种技术直观的好处是：1.迫使网络学习冗余的表达能力 2.实现了大型的学习模型，具有共享参数特性

Practice: 使用single,global L2 Regularization(cross-validated)+ Dropout(p=0.5)

Loss Functions

针对classification的任务，使用Softmax或者SVM Loss. 针对类别多的情况，可以使用Hierarchical Softmax.

Attribute Classification可以用Logistic regression classifier with two classes(0,1)

Regression任务，一般使用L2或者L1。

When faced with a regression task, first consider if it is absolutely necessary. Instead, have a strong preference to discretizing your outputs to bins and perform classification over them whenever possible.

Gradient Checks

Use centered formula,求梯度用左右方向的平均。$$\frac{df(x)}{dx} = \frac{f(x+h)-f(x-h)}{2h}$$

Use relative error for the comparison, relative error
$$ \frac{ |f_a - f_n | }{max(f_a,f_n)} $$ . $f_a$ 为analytic gradient , $f_n$ 为numberic gradient

In practice:

relative error > 1e-2 usually means the gradient is probably wrong
1e-2>relative error>1e-4 should make you feel uncomfortable
1e-4>relative error is usually okay for objectives with kinks, but if there are no kinks(such tanh nonlinearities and softmax) , then 1e-4 is too high.
1e-7 and less you should be happy
if too small like or than 1e-10, absolute value is worrying
越深的网络，relative errors越大，如果10层，则1e-2是可以接受的。

Summary: Careful with step size h, Gradcheck important, Don’t let regularization overwhelm the data, turn off the dropout/augmentations when gradient check, check only few dimensions.

Lecture 6

How to do parameter Updates

1.SGD: $x+= -learning rate *dx$
2.Momentum: allow velocity build up, velocity damped in steep due to changing sign

v = mu v - learning_rate dx
x += v

Nestreov Momentum:
$v_t = \mu v_{t-1} - \epsilon \Delta f(\theta_{t-1} + \mu V_{t-1})$
$\theta_t = \theta_{t-1} + V_t$

4.Adagrad: Equalization the steep and shallow direction
cache + = dx *2
x += - learn_rate dx / (sqrt(cache)+1e-7)

5.Adam: Great enough with bias correct
m = beta1 m + (1-beta1)dx
v = beta2 v + (1-beta2)(dx *2)
x += -learn_rate m/(sqrt(v)+1e-7)
一般beta1和beta2可以设置为0.9和0.995

Second Order Optimization

1.泰勒展开
2.Newton Gradient: Jacobian H is too large with O(n^3),n is million
3.Quasi-Newton O(n^2)

L-BFGS(Limited Memory BFGS):work well in full patch,but can not transfoer well to mini-batch search.

Practice: 1.Train Multiple Indeoendent model
2.At test time average the results

Fun tricks: get small boost from average multiple initilization model, keep track running average parametre vector.

Annealing learning rate

step dcay: learning rate by half every t epochs
Exponential decay $\alpha = \alpha_0 e^{-kt}$
1/t decay: $\alpha = \alpha_0 /(1+kt)$

Hyperparameter optimzation

stage search from coarse to fine
bayesian hyperparameter optimization
Model ensemble, improve the performance of NN a few percent:

Same Model, Different Initializations
Top models discovered during cross-validation
Different Checkpoints of a single model: Training is very expensive, taking the different checkpoints of single network and using those to form an ensemble.(Cheap and practice，选取一些好的epoch模型)
Running averatge of parameters during training(用训练过程模型中的均值，直观来看在碗状徘徊，均值更有利于接近底部)

Summary

1.针对少量的样本，gradient check很重要，并注意正确的初始化
2.the magnitude of updates should be ~1e-3 in first-layer
3.推荐用SGD+Nesterov Momentum or Adam
4.Decay learning rate over the period of training
5.Search good hyperparameters with random search(not grid), stage coarse to fine
6.Form model ensembles for extra performance