Training in multiprocessing.Process falied to run startup program when using GPU (#20169) · Issue · PaddlePaddle / Paddle

Training in multiprocessing.Process falied to run startup program when using GPU

Created by: xueeinstein

I build a complicate reinforcement learning model in Paddle, with parallel training two different models simultaneously. And one needs to sync the latest weights of another one to generate some target values for training. I implement the second model in multiprocess.Process.

However, the fresh forked (no other Paddle program created when fork new process) process failed to run the startup program when using GPU, but success when using CPU.

To make it easy for debug, I reproduce the bug in unit testcases as following:

class TestDynamicsModel(object):
    def test_learner(self):
        self.learner_process()

    def test_learner_process(self):
        process = multiprocessing.Process(target=self.learner_process)
        process.start()
        process.join()
        assert process.exitcode == 0

    def learner_process(self):
        bs = 4
        if machine_info.is_gpu_available():
            place = fluid.CUDAPlace(0)
        else:
            place = fluid.CPUPlace()
        # place = fluid.CPUPlace()
        exe = fluid.Executor(place)

        main, startup = fluid.Program(), fluid.Program()
        with fluid.program_guard(main, startup):
            env = DynamicsModel("env", self.env_config, self.learner_config)
            learner = DynamicsModelLearner(env)
            obs = fluid.layers.data("obs", self.env_config["obs_dims"],
                                    dtype="float32")
            next_obs = fluid.layers.data(
                "next_obs", self.env_config["obs_dims"], dtype="float32")
            actions = fluid.layers.data(
                "actions", [self.env_config["action_dim"]], dtype="float32")
            rewards = fluid.layers.data("rewards", [], dtype="float32")
            dones = fluid.layers.data("dones", [], dtype="bool")
            losses = learner.learn(obs, next_obs, actions, rewards, dones)

        exe.run(startup)
        obs_shape = [bs, *self.env_config["obs_dims"]]
        act_shape = [bs, self.env_config["action_dim"]]
        feed_dict = {
            "obs": np.random.random(obs_shape).astype(np.float32),
            "next_obs": np.random.random(obs_shape).astype(np.float32),
            "actions": np.random.random(act_shape).astype(np.float32),
            "rewards": np.random.random([bs]).astype(np.float32),
            "dones": np.zeros([bs]).astype(bool)
        }
        inspects = exe.run(program=main, feed=feed_dict, fetch_list=losses)
        assert len(inspects) == 4

Results: The testcase test_learner passed for running both on CPU and GPU, meaning that the process target function works well.

The testcase test_learner_process passed only for running on CPU. When using GPU, it raised an error when init learning rate variable. Check the attachment for details. out.txt

PaddlePaddle / Paddle 8 个月 前同步成功

Training in multiprocessing.Process falied to run startup program when using GPU

PaddlePaddle / Paddle
8 个月前同步成功