如何在dygraph 分布式运行时,令不同worker有不同的weight.
Created by: Meiyim
场景:尝试进行multi worker蒸馏时,希望不同worker有不同的teacher
发现:纵使显示令 不同worker加载不同的模型;forward出来的loss 都是严格一致的;请问有没有办法避免这种行为。
测试脚本(paddle 1.8.1 python3.6)
再次说明,以下行为虽然在正常训练中是符合预期的,但是在我目前的训练场景中希望即能开启prepare_context
, 同时避免每个model的forward值完全一致。
import sys
import logging
import numpy as np
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as D
import paddle.fluid.layers as L
class Net(D.Layer):
def __init__(self):
super().__init__()
self.fc = D.Linear(100,100)
def forward(self, i):
return self.fc(i)
env = D.parallel.Env()
place = F.CUDAPlace(env.dev_id)
D.enable_dygraph(place)
load_path = 'ernie-1.0'
model = Net()
model2 = Net()
model.eval()
ctx = D.parallel.prepare_context()
model = D.parallel.DataParallel(model, ctx)
print(model2.fc.weight.numpy().mean())
ids = D.to_variable(np.ones([10,100]).astype(np.float32))
logits = model2(ids)
print(logits.numpy().mean())
运行脚本:
python3 -m paddle.distributed.launch test.py
输出结果:
------------- Configuration Arguments -----------
cluster_node_ips: 127.0.0.1
log_dir: None
log_level: 20
node_ip: 127.0.0.1
print_config: True
selected_gpus: None
started_port: None
training_script: test.py
training_script_args: []
use_paddlecloud: False
------------------------------------------------
INFO 2020-07-03 15:10:13,230 launch.py:210] get cluster from args:job_server:None pods:['rank:0 id:None addr:127.0.0.1 port:None visible_gpu:[] trainers:["gpu:[\'0\'] endpoint:127.0.0.1:43290 rank:0", "gpu:[\'1\'] endpoint:127.0.0.1:29884 rank:1"]'] job_stage_flag:None hdfs:None
INFO 2020-07-03 15:10:13,231 utils.py:370] start trainer proc:['/home/work/chenxuyi/playground/ernie-tiny/app/bin/python3', '-u', 'test.py'] env:{'FLAGS_selected_gpus': '0', 'PADDLE_TRAINER_ID': '0', 'PADDLE_CURRENT_ENDPOINT': '127.0.0.1:43290', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_TRAINER_ENDPOINTS': '127.0.0.1:43290,127.0.0.1:29884'}
INFO 2020-07-03 15:10:13,237 utils.py:370] start trainer proc:['/home/work/chenxuyi/playground/ernie-tiny/app/bin/python3', '-u', 'test.py'] env:{'FLAGS_selected_gpus': '1', 'PADDLE_TRAINER_ID': '1', 'PADDLE_CURRENT_ENDPOINT': '127.0.0.1:29884', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_TRAINER_ENDPOINTS': '127.0.0.1:43290,127.0.0.1:29884'}
W0703 15:10:15.122733 186931 device_context.cc:252] Please NOTE: device: 1, CUDA Capability: 61, Driver API Version: 10.2, Runtime API Version: 10.0
W0703 15:10:15.128784 186931 device_context.cc:260] device: 1, cuDNN Version: 7.6.
W0703 15:10:15.209643 186930 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.2, Runtime API Version: 10.0
W0703 15:10:15.215517 186930 device_context.cc:260] device: 0, cuDNN Version: 7.6.
I0703 15:10:17.459383 186930 nccl_context.cc:127] init nccl context nranks: 2 local rank: 0 gpu id: 0
I0703 15:10:17.459421 186931 nccl_context.cc:127] init nccl context nranks: 2 local rank: 1 gpu id: 1
0.001723706
0.0008732666
0.087326676
0.087326676
可以看到,纵使 print(model2.fc.weight.numpy().mean())
的值不一样(因为两个model
是分别初始化的); 最后预测出来的结果确实相等的。