提交 d3d6748f 编写于 作者: L liangjianzhong

revise the timeline tool document for fleet training

上级 3d298e3c
......@@ -60,9 +60,19 @@ python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=time
## 分布式使用
一般来说,分布式的训练程序都会有两种程序:pserver和trainer。我们提供了把pserver和trainer的profile日志用timeline来显示的方式。
1. trainer打开方式与[本地使用](#local)部分的第1步相同
1. trainer打开方式与[本地使用](#local)部分的第1步基本相同,但因为存在多个trainer,需要对每个trainer做区分。例如:
```python
# or other method to get the unique id of the current trainer
trainer_id = int(os.environ.get('PADDLE_TRAINER_ID'))
if pass_id == 0 and batch_id == 5:
profiler.start_profiler("All")
elif pass_id == 0 and batch_id == 10:
profiler.stop_profiler("total", "/tmp/profile_"+ str(trainer_id))
```
1. pserver可以通过加两个环境变量打开profile,例如:
2. pserver可以通过加两个环境变量打开profile,例如:
```
FLAGS_rpc_server_profile_period=10 FLAGS_rpc_server_profile_path=./tmp/pserver python train.py
```
......
......@@ -62,7 +62,17 @@ python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=time
## Distributed
This tool can support distributed train programs(pserver and trainer) too.
1. Open traniner profiler just like how to use in [local](#local).
1. Open traniner profiler just like how to use in [local](#local), but remember to adjust the path of profile to each trainer, since there maybe more than one trainer in the same node.
```python
# or other method to get the unique id of the current trainer
trainer_id = int(os.environ.get('PADDLE_TRAINER_ID'))
if pass_id == 0 and batch_id == 5:
profiler.start_profiler("All")
elif pass_id == 0 and batch_id == 10:
profiler.stop_profiler("total", "/tmp/profile_"+ str(trainer_id))
```
2. Open pserver profiler: add two environment variables, e.g.:
```
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册