From d3d6748fcddd14f0b82489836bf008616e9e79c7 Mon Sep 17 00:00:00 2001 From: liangjianzhong Date: Tue, 16 Jun 2020 08:55:03 +0000 Subject: [PATCH] revise the timeline tool document for fleet training --- .../analysis_tools/timeline_cn.md | 14 ++++++++++++-- .../analysis_tools/timeline_en.md | 12 +++++++++++- 2 files changed, 23 insertions(+), 3 deletions(-) mode change 100644 => 100755 doc/fluid/advanced_guide/performance_improving/analysis_tools/timeline_cn.md mode change 100644 => 100755 doc/fluid/advanced_guide/performance_improving/analysis_tools/timeline_en.md diff --git a/doc/fluid/advanced_guide/performance_improving/analysis_tools/timeline_cn.md b/doc/fluid/advanced_guide/performance_improving/analysis_tools/timeline_cn.md old mode 100644 new mode 100755 index e40afcf3f..a92494390 --- a/doc/fluid/advanced_guide/performance_improving/analysis_tools/timeline_cn.md +++ b/doc/fluid/advanced_guide/performance_improving/analysis_tools/timeline_cn.md @@ -60,9 +60,19 @@ python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=time ## 分布式使用 一般来说,分布式的训练程序都会有两种程序:pserver和trainer。我们提供了把pserver和trainer的profile日志用timeline来显示的方式。 -1. trainer打开方式与[本地使用](#local)部分的第1步相同 +1. trainer打开方式与[本地使用](#local)部分的第1步基本相同,但因为存在多个trainer,需要对每个trainer做区分。例如: + ```python + # or other method to get the unique id of the current trainer + trainer_id = int(os.environ.get('PADDLE_TRAINER_ID')) + + if pass_id == 0 and batch_id == 5: + profiler.start_profiler("All") + elif pass_id == 0 and batch_id == 10: + profiler.stop_profiler("total", "/tmp/profile_"+ str(trainer_id)) + + ``` -1. pserver可以通过加两个环境变量打开profile,例如: +2. pserver可以通过加两个环境变量打开profile,例如: ``` FLAGS_rpc_server_profile_period=10 FLAGS_rpc_server_profile_path=./tmp/pserver python train.py ``` diff --git a/doc/fluid/advanced_guide/performance_improving/analysis_tools/timeline_en.md b/doc/fluid/advanced_guide/performance_improving/analysis_tools/timeline_en.md old mode 100644 new mode 100755 index fb51802a1..3ba1a8295 --- a/doc/fluid/advanced_guide/performance_improving/analysis_tools/timeline_en.md +++ b/doc/fluid/advanced_guide/performance_improving/analysis_tools/timeline_en.md @@ -62,7 +62,17 @@ python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=time ## Distributed This tool can support distributed train programs(pserver and trainer) too. -1. Open traniner profiler just like how to use in [local](#local). +1. Open traniner profiler just like how to use in [local](#local), but remember to adjust the path of profile to each trainer, since there maybe more than one trainer in the same node. + ```python + # or other method to get the unique id of the current trainer + trainer_id = int(os.environ.get('PADDLE_TRAINER_ID')) + + if pass_id == 0 and batch_id == 5: + profiler.start_profiler("All") + elif pass_id == 0 and batch_id == 10: + profiler.stop_profiler("total", "/tmp/profile_"+ str(trainer_id)) + + ``` 2. Open pserver profiler: add two environment variables, e.g.: ``` -- GitLab