Created by: sidgoyal78
This PR adds the implementation of momentum operator.
In summary, we want to perform the update with a new velocity vector, such that,
velocity = mu * velocity + grad
param = param - learning_rate * velocity
(where mu is the momentum coefficient).