refine dist image classification

934f1f67 · typhoonzero · 9a4f5786 · 934f1f67 · 934f1f67 · 934f1f67
10 changed file
--- a/fluid/PaddleCV/image_classification/dist_train/README.md
+++ b/fluid/PaddleCV/image_classification/dist_train/README.md
@@ -7,13 +7,15 @@ large-scaled distributed training with two distributed mode: parameter server mo
 Before getting started, please make sure you have go throught the imagenet [Data Preparation](../README.md#data-preparation).
-1. The entrypoint file is `dist_train.py`, some important flags are as follows:
+1. The entrypoint file is `dist_train.py`, the commandline arguments are almost the same as the original `train.py`, with the following arguments specific to distributed training.
-    - `model`, the model to run with, default is the fine tune model `DistResnet`.
-    - `batch_size`, the batch_size per device.
    - `update_method`, specify the update method, can choose from local, pserver or nccl2.
-    - `device`, use CPU or GPU device.
+    - `multi_batch_repeat`, set this greater than 1 to merge batches before pushing gradients to pservers.
-    - `gpus`, the GPU device count that the process used.
+    - `start_test_pass`, when to start running tests.
+    - `num_threads`, how many threads will be used for ParallelExecutor.
+    - `split_var`, in pserver mode, whether to split one parameter to several pservers, default True.
+    - `async_mode`, do async training, defalt False.
+    - `reduce_strategy`, choose from "reduce", "allreduce".
    you can check out more details of the flags by `python dist_train.py --help`.
@@ -21,66 +23,27 @@ Before getting started, please make sure you have go throught the imagenet [Data
    We use the environment variable to distinguish the different training role of a distributed training job.
-    - `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
+    - General envs:
-    - `PADDLE_TRAINERS`, the trainer count of a job.
+        - `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
-    - `PADDLE_CURRENT_IP`, the current instance IP.
+        - `PADDLE_TRAINERS_NUM`, the trainer count of a distributed job.
-    - `PADDLE_PSERVER_IPS`, the parameter server IP list, separated by ","  only be used with update_method is pserver.
+        - `PADDLE_CURRENT_ENDPOINT`, current process endpoint.
-    - `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
+    - Pserver mode:
-    - `PADDLE_PSERVER_PORT`, the port of the parameter pserver listened on.
+        - `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
-    - `PADDLE_TRAINER_IPS`, the trainer IP list, separated by ",", only be used with upadte_method is nccl2.
+        - `PADDLE_PSERVER_ENDPOINTS`, the parameter server endpoint list, separated by ",".
+    - NCCL2 mode:
-### Parameter Server Mode
+        - `PADDLE_TRAINER_ENDPOINTS`, endpoint list for each worker, separated by ",".
-In this example, we launched 4 parameter server instances and 4 trainer instances in the cluster:
+### Try Out Different Distributed Training Modes
-1. launch parameter server process
+You can test if distributed training works on a single node before deploying to the "real" cluster.
-    ``` bash
+***NOTE: for best performance, we recommend using multi-process mode, see No.3. And together with fp16.***
-    PADDLE_TRAINING_ROLE=PSERVER \
-    PADDLE_TRAINERS=4 \
+1. simply run `python dist_train.py` to start local training with default configuratioins.
-    PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
+2. for pserver mode, run `bash run_ps_mode.sh` to start 2 pservers and 2 trainers, these 2 trainers
-    PADDLE_CURRENT_IP=192.168.0.100 \
+   will use GPU 0 and 1 to simulate 2 workers.
-    PADDLE_PSERVER_PORT=7164 \
+3. for nccl2 mode, run `bash run_nccl2_mode.sh` to start 2 workers.
-    python dist_train.py \
+4. for local/distributed multi-process mode, run `run_mp_mode.sh` (this test use 4 GPUs).
-        --model=DistResnet \
-        --batch_size=32 \
-        --update_method=pserver \
-        --device=CPU \
-        --data_dir=../data/ILSVRC2012
-    ```
-1. launch trainer process
-    ``` bash
-    PADDLE_TRAINING_ROLE=TRAINER \
-    PADDLE_TRAINERS=4 \
-    PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
-    PADDLE_TRAINER_ID=0 \
-    PADDLE_PSERVER_PORT=7164 \
-    python dist_train.py \
-        --model=DistResnet \
-        --batch_size=32 \
-        --update_method=pserver \
-        --device=GPU \
-        --data_dir=../data/ILSVRC2012
-    ```
-### NCCL2 Collective Mode
-1. launch trainer process
-    ``` bash
-    PADDLE_TRAINING_ROLE=TRAINER \
-    PADDLE_TRAINERS=4 \
-    PADDLE_TRAINER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
-    PADDLE_TRAINER_ID=0 \
-    python dist_train.py \
-        --model=DistResnet \
-        --batch_size=32 \
-        --update_method=nccl2 \
-        --device=GPU \
-        --data_dir=../data/ILSVRC2012
-    ```
 ### Visualize the Training Process
@@ -88,16 +51,10 @@ It's easy to draw the learning curve accroding to the training logs, for example
 the logs of ResNet50 is as follows:
 ``` text
-Pass 0, batch 0, loss 7.0336914, accucacys: [0.0, 0.00390625]
+Pass 0, batch 30, loss 7.569439, acc1: 0.0125, acc5: 0.0125, avg batch time 0.1720
-Pass 0, batch 1, loss 7.094781, accucacys: [0.0, 0.0]
+Pass 0, batch 60, loss 7.027379, acc1: 0.0, acc5: 0.0, avg batch time 0.1551
-Pass 0, batch 2, loss 7.007068, accucacys: [0.0, 0.0078125]
+Pass 0, batch 90, loss 6.819984, acc1: 0.0, acc5: 0.0125, avg batch time 0.1492
-Pass 0, batch 3, loss 7.1056547, accucacys: [0.00390625, 0.00390625]
+Pass 0, batch 120, loss 6.9076853, acc1: 0.0, acc5: 0.0125, avg batch time 0.1464
-Pass 0, batch 4, loss 7.133543, accucacys: [0.0, 0.0078125]
-Pass 0, batch 5, loss 7.3055463, accucacys: [0.0078125, 0.01171875]
-Pass 0, batch 6, loss 7.341838, accucacys: [0.0078125, 0.01171875]
-Pass 0, batch 7, loss 7.290557, accucacys: [0.0, 0.0]
-Pass 0, batch 8, loss 7.264951, accucacys: [0.0, 0.00390625]
-Pass 0, batch 9, loss 7.43522, accucacys: [0.00390625, 0.00390625]
 ```
 The below figure shows top 1 train accuracy for local training with 8 GPUs and distributed training

--- a/fluid/PaddleCV/image_classification/dist_train/batch_merge.py
+++ b/fluid/PaddleCV/image_classification/dist_train/batch_merge.py
+import paddle.fluid as fluid
+def copyback_repeat_bn_params(main_prog):
+    repeat_vars = set()
+    for op in main_prog.global_block().ops:
+        if op.type == "batch_norm":
+            repeat_vars.add(op.input("Mean")[0])
+            repeat_vars.add(op.input("Variance")[0])
+    for vname in repeat_vars:
+        real_var = fluid.global_scope().find_var("%s.repeat.0" % vname).get_tensor()
+        orig_var = fluid.global_scope().find_var(vname).get_tensor()
+        orig_var.set(np.array(real_var), fluid.CUDAPlace(0)) # test on GPU0
+def append_bn_repeat_init_op(main_prog, startup_prog, num_repeats):
+    repeat_vars = set()
+    for op in main_prog.global_block().ops:
+        if op.type == "batch_norm":
+            repeat_vars.add(op.input("Mean")[0])
+            repeat_vars.add(op.input("Variance")[0])
+    for i in range(num_repeats):
+        for op in startup_prog.global_block().ops:
+            if op.type == "fill_constant":
+                for oname in op.output_arg_names:
+                    if oname in repeat_vars:
+                        var = startup_prog.global_block().var(oname)
+                        repeat_var_name = "%s.repeat.%d" % (oname, i)
+                        repeat_var = startup_prog.global_block().create_var(
+                            name=repeat_var_name,
+                            type=var.type,
+                            dtype=var.dtype,
+                            shape=var.shape,
+                            persistable=var.persistable
+                        )
+                        main_prog.global_block()._clone_variable(repeat_var)
+                        startup_prog.global_block().append_op(
+                            type="fill_constant",
+                            inputs={},
+                            outputs={"Out": repeat_var},
+                            attrs=op.all_attrs()
+                        )
--- a/fluid/PaddleCV/image_classification/dist_train/dist_train.py
+++ b/fluid/PaddleCV/image_classification/dist_train/dist_train.py
--- a/fluid/PaddleCV/image_classification/dist_train/dist_utils.py
+++ b/fluid/PaddleCV/image_classification/dist_train/dist_utils.py
+import os
+import paddle.fluid as fluid
+def nccl2_prepare(args, startup_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.mode = "nccl2"
+    t = fluid.DistributeTranspiler(config=config)
+    envs = args.dist_env
+    t.transpile(envs["trainer_id"],
+        trainers=','.join(envs["trainer_endpoints"]),
+        current_endpoint=envs["current_endpoint"],
+        startup_program=startup_prog)
+def pserver_prepare(args, train_prog, startup_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.slice_var_up = args.split_var
+    t = fluid.DistributeTranspiler(config=config)
+    envs = args.dist_env
+    training_role = envs["training_role"]
+    t.transpile(
+        envs["trainer_id"],
+        program=train_prog,
+        pservers=envs["pserver_endpoints"],
+        trainers=envs["num_trainers"],
+        sync_mode=not args.async_mode,
+        startup_program=startup_prog)
+    if training_role == "PSERVER":
+        pserver_program = t.get_pserver_program(envs["current_endpoint"])
+        pserver_startup_program = t.get_startup_program(
+            envs["current_endpoint"], pserver_program, startup_program=startup_prog)
+        return pserver_program, pserver_startup_program
+    elif training_role == "TRAINER":
+        train_program = t.get_trainer_program()
+        return train_program, startup_prog
+    else:
+        raise ValueError(
+            'PADDLE_TRAINING_ROLE environment variable must be either TRAINER or PSERVER'
+        )
--- a/fluid/PaddleCV/image_classification/dist_train/env.py
+++ b/fluid/PaddleCV/image_classification/dist_train/env.py
+import os
+def dist_env():
+    """
+    Return a dict of all variable that distributed training may use.
+    NOTE: you may rewrite this function to suit your cluster environments.
+    """
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    num_trainers = 1
+    training_role = os.getenv("PADDLE_TRAINING_ROLE", "TRAINER")
+    assert(training_role == "PSERVER" or training_role == "TRAINER")
+    # - PADDLE_TRAINER_ENDPOINTS means nccl2 mode.
+    # - PADDLE_PSERVER_ENDPOINTS means pserver mode.
+    # - PADDLE_CURRENT_ENDPOINT means current process endpoint.
+    trainer_endpoints = os.getenv("PADDLE_TRAINER_ENDPOINTS")
+    pserver_endpoints = os.getenv("PADDLE_PSERVER_ENDPOINTS")
+    current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
+    if trainer_endpoints:
+        trainer_endpoints = trainer_endpoints.split(",")
+        num_trainers = len(trainer_endpoints)
+    elif pserver_endpoints:
+        num_trainers = int(os.getenv("PADDLE_TRAINERS_NUM"))
+    return {
+        "trainer_id": trainer_id,
+        "num_trainers": num_trainers,
+        "current_endpoint": current_endpoint,
+        "training_role": training_role,
+        "pserver_endpoints": pserver_endpoints,
+        "trainer_endpoints": trainer_endpoints
+    }
--- a/fluid/PaddleCV/image_classification/dist_train/run_mp_mode.sh
+++ b/fluid/PaddleCV/image_classification/dist_train/run_mp_mode.sh
+#!/bin/bash
+# Test using 4 GPUs
+export CUDA_VISIBLE_DEVICES="0,1,2,3"
+export MODEL="DistResNet"
+export PADDLE_TRAINER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161,127.0.0.1:7162,127.0.0.1:7163"
+# PADDLE_TRAINERS_NUM is used only for reader when nccl2 mode
+export PADDLE_TRAINERS_NUM="4"
+mkdir -p logs
+for i in {0..3}
+do
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:716${i}" \
+PADDLE_TRAINER_ID="${i}" \
+FLAGS_selected_gpus="${i}" \
+python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 --fp16 1 --scale_loss 8 &> logs/tr$i.log &
+done
--- a/fluid/PaddleCV/image_classification/dist_train/run_nccl2_mode.sh
+++ b/fluid/PaddleCV/image_classification/dist_train/run_nccl2_mode.sh
+#!/bin/bash
+export MODEL="DistResNet"
+export PADDLE_TRAINER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161"
+# PADDLE_TRAINERS_NUM is used only for reader when nccl2 mode
+export PADDLE_TRAINERS_NUM="2"
+mkdir -p logs
+# NOTE: set NCCL_P2P_DISABLE so that can run nccl2 distribute train on one node.
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
+PADDLE_TRAINER_ID="0" \
+CUDA_VISIBLE_DEVICES="0" \
+NCCL_P2P_DISABLE="1" \
+python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 &> logs/tr0.log &
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
+PADDLE_TRAINER_ID="1" \
+CUDA_VISIBLE_DEVICES="1" \
+NCCL_P2P_DISABLE="1" \
+python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 &> logs/tr1.log &
--- a/fluid/PaddleCV/image_classification/dist_train/run_ps_mode.sh
+++ b/fluid/PaddleCV/image_classification/dist_train/run_ps_mode.sh
+#!/bin/bash
+export MODEL="DistResNet"
+export PADDLE_PSERVER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161"
+export PADDLE_TRAINERS_NUM="2"
+mkdir -p logs
+PADDLE_TRAINING_ROLE="PSERVER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/ps0.log &
+PADDLE_TRAINING_ROLE="PSERVER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/ps1.log &
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
+PADDLE_TRAINER_ID="0" \
+CUDA_VISIBLE_DEVICES="0" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/tr0.log &
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
+PADDLE_TRAINER_ID="1" \
+CUDA_VISIBLE_DEVICES="1" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/tr1.log &
--- a/fluid/PaddleCV/image_classification/models/resnet_dist.py
+++ b/fluid/PaddleCV/image_classification/models/resnet_dist.py
@@ -14,8 +14,9 @@ train_parameters = {
    "learning_strategy": {
        "name": "piecewise_decay",
        "batch_size": 256,
-        "epochs": [30, 60, 90],
+        "epochs": [30, 60, 80],
-        "steps": [0.1, 0.01, 0.001, 0.0001]
+        "steps": [0.1, 0.01, 0.001, 0.0001],
+        "warmup_passes": 5
    }
 }
@@ -118,3 +119,4 @@ class DistResNet():
        short = self.shortcut(input, num_filters * 4, stride)
        return fluid.layers.elementwise_add(x=short, y=conv2, act='relu')
--- a/fluid/PaddleCV/image_classification/reader.py
+++ b/fluid/PaddleCV/image_classification/reader.py
@@ -139,7 +139,7 @@ def _reader_creator(file_list,
            if mode == 'train' and os.getenv('PADDLE_TRAINING_ROLE'):
                # distributed mode if the env var `PADDLE_TRAINING_ROLE` exits
                trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
-                trainer_count = int(os.getenv("PADDLE_TRAINERS", "1"))
+                trainer_count = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
                per_node_lines = len(full_lines) // trainer_count
                lines = full_lines[trainer_id * per_node_lines:(trainer_id + 1)
                                   * per_node_lines]