Created by: yaoxuefeng6
This pr add a designed new version of downpouropt worker to get a batter speed performance when there are multi losses in a program. in Python front-end, we fix _minimize function of distributed optimizer to support multi-loss in one program. To get better speed performance when there are multi losses, we add some additional attribute of sparse table: 1, set a sparse table "is_local" to pull all feasigns embedding before a training pass from parameter server and generate a local table, while training, the embeddings can be directly pull from this local table. This attribute can be used when a sparse table is only used for ff. 2, set a sparse table "is_async" to asynchronously run independent ops while pulling this sparse table's embeddings. other independent ops can be run when pull sparse asynchronously in worker. 3, Use DownpourOpt worker to adjust ops order to gather forward ops and backwards ops by optimized loss in program. So that we can run ops flexibly in this worker. Currently, this worker has good support on these two attribute as mentioned above. example:
adam = fluid.optimizer.Adam(learning_rate=0.000005)
adam = fleet.distributed_optimizer(adam, strategy={"device_worker":"DownpourSGDOpt"})