未验证 提交 97315e3d 编写于 作者: L Lyon 提交者: GitHub

Dev test paddle dali (#78)

* update paddle-dali

* update scripts for paddle-dali

* update paddle-readme
上级 c661a155
......@@ -48,40 +48,32 @@ cd models/PaddleCV/image_classification
将本页面scripts文件夹中的脚本全部放入:`models/PaddleCV/image_classification/`路径下。
## 框架安装
## 依赖安装
### 框架
```shell
python3 -m pip install paddlepaddle-gpu==1.8.3.post107 -i https://mirror.baidu.com/pypi/simple
```
## DALI
1. 本测试使用 conda 维护测试环境, 进入 paddle 的 conda 用例, 下载 DALI_extra 源码
### DALI
```
conda activate paddle
git clone https://github.com/NVIDIA/DALI_extra.git
```
1.本测试使用 conda 环境, 进入 paddle 的 conda 用例, 下载 DALI_extra 源码
2. 使用 conda 环境内的 python 安装 nightly DALI
```
/home/leinao/anaconda3/envs/paddle/bin/python -m pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/nightly nvidia-dali-nightly-cuda100
conda activate paddle
git clone https://github.com/NVIDIA/DALI_extra.git
```
3. 设置 `export FLAGS_fraction_of_gpu_memory_to_use=0.8`
```
export FLAGS_fraction_of_gpu_memory_to_use=0.8
```
否则可能会报错
2.使用 conda 环境内的 python 安装 nightly DALI
```
AssertionError: Please leave enough GPU memory for DALI workspace, e.g., by setting `export FLAGS_fraction_of_gpu_memory_to_use=0.8
python3 -m pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/nightly nvidia-dali-nightly-cuda100
```
## NCCL
paddle的分布式训练底层依赖NCCL库,需要从[NVIDIA-NCCL官网下载](https://developer.nvidia.com/nccl/nccl-download)并安装和操作系统、CUDA版本适配的NCCL。
......@@ -120,7 +112,7 @@ bash run_single_node.sh
对单机1卡、4卡、8卡分别做5组测试,默认测试fp32精度,batch_size=128。
### 混合精度
### 混合精度&DALI
可以通过参数指定fp16及batch_size:
......@@ -130,6 +122,8 @@ bash run_single_node.sh 128 fp16
也可以自行指定精度以及batch_size:`bash run_single_node.sh 64 fp16`,`bash run_single_node.sh 32 fp32`
默认不开DALI,如果需要用DALI,可在脚本single_node_train.sh中将设置变量:`USE_DALI=true`
## 2机16卡
......@@ -144,7 +138,7 @@ bash run_two_node.sh
NODE2节点`models/PaddleCV/image_classification/`目录下,修改run_two_node.sh脚本中的`CURRENT_NODE=$NODE2`,再执行`bash run_two_node.sh`,即可运行2机16卡的训练(同样默认测试5组,测试fp32精度,batch_size=128)。
### 混合精度
### 混合精度&DALI
可以通过参数指定fp16及batch_size:
......@@ -152,7 +146,7 @@ NODE2节点`models/PaddleCV/image_classification/`目录下,修改run_two_node
bash run_two_node.sh 256 fp16
```
默认不开DALI,如果需要用DALI,可在脚本multi_node_train.sh中将设置变量:`USE_DALI=true`
## 4机32卡
......@@ -164,7 +158,7 @@ bash run_multi_node.sh
以运行4机32卡的训练,默认测试5组(fp32精度,batch_size=128)。
### 混合精度
### 混合精度&DALI
可以通过参数指定fp16及batch_size:
......@@ -172,7 +166,7 @@ bash run_multi_node.sh
bash run_multi_node.sh 256 fp16
```
默认不开DALI,如果需要用DALI,可在脚本multi_node_train.sh中将设置变量:`USE_DALI=true`
# Result
......@@ -306,12 +300,28 @@ README展示的是extract_paddle_logs.py的计算结果。
| 2 | 16 | 6358.43 | 6.15 |
| 4 | 32 | 10633.22 | 10.2 |
- 注:本测试使用了表现较优的reader_thread参数(单机1卡和8卡设为8;4卡设为12;多机均设为8),但由于未成功安装dali及paddle-dali插件,故多卡/多机的加速比仍然较差,附加dali后的数据可参考:[Paddle官方fp16+dali测试结果](https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleCV/image_classification#%E6%B7%B7%E5%90%88%E7%B2%BE%E5%BA%A6%E8%AE%AD%E7%BB%83)
### batch size = 196 & with dali & without xla
| node_num | gpu_num | samples/s | speedup |
| -------- | ------- | --------- | ------- |
| 1 | 1 | 887.17 | 1 |
| 1 | 4 | 3598.0 | 4.06 |
| 1 | 8 | 6862.17 | 7.73 |
| 2 | 16 | 6018.46 | 6.78 |
| 4 | 32 | 11617.57 | 13.1 |
注:
- 本次评测最大batch size达不到Paddle官方的256,因为Paddle官方使用的是32GB显存的Tesla V100,而我们是16GB,故使用batch_size=256会OOM(OUT OF MEMORY)
- 本次测评得到多机加速比较低,可能是由于Paddle官方提供的脚本和配置并不是最优的,由于其没有提供原始测评时的容器镜像,故我们也无法做到对齐,测评得到的数据和Paddle官方存在较大差距。官方数据:[Paddle官方fp16+dali测试结果](https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleCV/image_classification#%E6%B7%B7%E5%90%88%E7%B2%BE%E5%BA%A6%E8%AE%AD%E7%BB%83)
## 完整日志
- [resnet50-fp32.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/PaddlePaddle/resnet50/resnet50_fp32.zip)
- [resnet50-fp16.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/PaddlePaddle/resnet50/resnet50_fp16.zip)
- [resnet50-fp16-dali.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/PaddlePaddle/resnet50/resnet50-fp16-dali.zip)
......@@ -20,16 +20,7 @@ mkdir -p $LOG_FOLDER
LOGFILE=${LOG_FOLDER}/r50_b${BATCH_SIZE}_${DTYPE}_${TEST_NUM}.log
export CUDA_VISIBLE_DEVICES=${gpus}
export FLAGS_fraction_of_gpu_memory_to_use=0.98
DATA_DIR=/datasets/ImageNet/imagenet_1k/
if [ $node_num -le 2 ] ; then
THREAD=8
else
THREAD=8
fi
DATA_DIR=/datasets/ImageNet/Paddle
# bash run.sh train ResNet50_fp16
......@@ -44,12 +35,25 @@ else
FP16_PARAMS=" "
fi
USE_DALI=false
if ${USE_DALI}; then
export FLAGS_fraction_of_gpu_memory_to_use=0.8
export DALI_EXTRA_PATH=/home/leinao/paddle/DALI_extra
THREAD=10
else
export FLAGS_fraction_of_gpu_memory_to_use=0.98
THREAD=8
fi
echo "FLAGS_fraction_of_gpu_memory_to_use=$FLAGS_fraction_of_gpu_memory_to_use"
echo "Nodes : $nodes"
echo "Use gpus: $gpus, Batch size per device : $BATCH_SIZE, Total Batch size : $total_bz"
echo "Learning rate: $LR"
echo "Use fp16 : $use_fp16"
export CUDA_VISIBLE_DEVICES=${gpus}
python3 -m paddle.distributed.launch --cluster_node_ips=${nodes} \
--node_ip=$CURRENT_NODE \
train.py \
......
......@@ -16,4 +16,4 @@ do
echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let i++
sleep 20
done
\ No newline at end of file
done
......@@ -32,4 +32,4 @@ do
echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let i++
sleep 20
done
\ No newline at end of file
done
......@@ -7,7 +7,7 @@ NODE2='10.11.0.3'
CURRENT_NODE=$NODE1
i=3
i=1
while [ $i -le 5 ]
do
bash $SHELL_FOLDER/multi_node_train.sh $MODEL 0,1,2,3,4,5,6,7 ${BATCH_SIZE} 224 $NODE1,$NODE2 $CURRENT_NODE $i $DTYPE
......
......@@ -16,10 +16,7 @@ mkdir -p $LOG_FOLDER
LOGFILE=${LOG_FOLDER}/r50_b${BATCH_SIZE}_${DTYPE}_${TEST_NUM}.log
export CUDA_VISIBLE_DEVICES=${gpus}
export FLAGS_fraction_of_gpu_memory_to_use=0.98
export DALI_EXTRA_PATH=/home/leinao/paddle/DALI_extra
DATA_DIR=/datasets/ImageNet/imagenet_1k/
DATA_DIR=/datasets/ImageNet/Paddle
MULTI_PROCESS="-m paddle.distributed.launch"
if [ $GPU_COUNT -le 2 ] ; then
......@@ -43,10 +40,23 @@ else
FP16_PARAMS=" "
fi
USE_DALI=false
if ${USE_DALI}; then
export FLAGS_fraction_of_gpu_memory_to_use=0.8
export DALI_EXTRA_PATH=/home/leinao/paddle/DALI_extra
THREAD=10
else
export FLAGS_fraction_of_gpu_memory_to_use=0.98
fi
echo "FLAGS_fraction_of_gpu_memory_to_use=$FLAGS_fraction_of_gpu_memory_to_use"
echo "Use gpus: $gpus, Batch size per device : $BATCH_SIZE, Total Batch size : $total_bz"
echo "Learning rate: $LR"
# echo "Use fp16 : $use_fp16"
export CUDA_VISIBLE_DEVICES=${gpus}
python3 $MULTI_PROCESS \
train.py ${FP16_PARAMS} \
--data_format=${DATA_FORMAT} \
......@@ -58,8 +68,8 @@ python3 $MULTI_PROCESS \
--batch_size=${total_bz} \
--print_step=1 \
--save_step=10000 \
--reader_thread=$THREAD \
--lr_strategy=piecewise_decay \
--reader_thread=${THREAD} \
--lr_strategy=cosine_decay \
--lr=0.001 \
--momentum_rate=0.875 \
--image_shape 3 $IMAGE_SIZE $IMAGE_SIZE \
......@@ -69,5 +79,6 @@ python3 $MULTI_PROCESS \
--warm_up_epochs=1 \
--use_mixup=False \
--use_label_smoothing=True \
--use_dali=$USE_DALI \
--label_smoothing_epsilon=0.1 2>&1 | tee ${LOGFILE}
echo "Writting log to ${LOGFILE}"
\ No newline at end of file
echo "Writting log to ${LOGFILE}"
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册