Created by: sidgoyal78
This PR has an alternate implementation of the SparseSGD kernel without cuda atomics. It is not the most efficient implementation. It partitions work along the input dimension, so that each thread can work parallely on a given dimension across the sparse table.