Stochastic Gradient Descent (SGD)
Slides
Blog post on SGD variants
Training Resnt-50 on Imagenet in one hour
Some theory on scaling learning rate with batch size
Adding Gradient Noise
Temperature Cycling in SGD
MCMC with momentum